FOREST FIRES IN BRAZIL

See https://www.kaggle.com/gustavomodelli/forest-fires-in-brazil for a full description of the dataset.

Import packages

import pandas as pd #Handle datasets
import seaborn as sns #Plots
import matplotlib.pyplot as plt  #Plots
import matplotlib

#Set some graphical parameters
rc={'axes.labelsize': 25, 'figure.figsize': (20,10), 
    'axes.titlesize': 25, 'xtick.labelsize': 18, 'ytick.labelsize': 18}
sns.set(rc=rc)

#Path data
path = 'C:/Users/Andreella/Documents/GitHub/angeella.github.io/Data'
df = pd.read_csv(path + '/amazon.csv',encoding="ISO-8859-1")

First \(3\) observations:

df.head(n=3)
##    year state    month  number        date
## 0  1998  Acre  Janeiro     0.0  1998-01-01
## 1  1999  Acre  Janeiro     0.0  1999-01-01
## 2  2000  Acre  Janeiro     0.0  2000-01-01

Some information about the variables:

df.info() 
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 6454 entries, 0 to 6453
## Data columns (total 5 columns):
##  #   Column  Non-Null Count  Dtype  
## ---  ------  --------------  -----  
##  0   year    6454 non-null   int64  
##  1   state   6454 non-null   object 
##  2   month   6454 non-null   object 
##  3   number  6454 non-null   float64
##  4   date    6454 non-null   object 
## dtypes: float64(1), int64(1), object(3)
## memory usage: 252.2+ KB

We are interested about the number of forest fires in Brazil

df.number.describe()
## count    6454.000000
## mean      108.293163
## std       190.812242
## min         0.000000
## 25%         3.000000
## 50%        24.000000
## 75%       113.000000
## max       998.000000
## Name: number, dtype: float64

To have an simple plot, we take a subset of the dataset:

df1 = df[(df.year > 2010) & (df.year < 2019)]
df1 =df1.loc[(df1.state.str.startswith('M'))]

We do a boxplot about the number of fire by groups, i.e., the states and the years.

sns.boxplot(x = 'year', y = 'number', hue = "state", data = df1) 

We do a timeseries plot with error bands:

sns.lineplot(x="year", y="number",hue="state", data=df1)

also we do a grouped violinplots:

sns.violinplot(x="state", y="number", data=df1)  

For other plots, please refers to https://seaborn.pydata.org/examples/index.html.

ECONOMIC FREEDOM INDEX

See https://www.kaggle.com/lewisduncan93/the-economic-freedom-index for a full description of the dataset.

Load and preprocess data

dt = pd.read_csv(path + '/economic_freedom_index2019_data.csv',encoding="ISO-8859-1")
dt.columns = dt.columns.str.replace(' ', '')
dt.columns = dt.columns.str.replace('2019', '')
dt.columns = dt.columns.str.replace('%', '')
dt.columns = dt.columns.str.replace('(', '')
dt.columns = dt.columns.str.replace(')', '')
dt = dt.dropna(axis = 0,how='any')

Basic info

dt.info()
## <class 'pandas.core.frame.DataFrame'>
## Index: 173 entries, 0 to 185
## Data columns (total 34 columns):
##  #   Column                 Non-Null Count  Dtype  
## ---  ------                 --------------  -----  
##  0   CountryID              173 non-null    int64  
##  1   CountryName            173 non-null    object 
##  2   WEBNAME                173 non-null    object 
##  3   Region                 173 non-null    object 
##  4   WorldRank              173 non-null    float64
##  5   RegionRank             173 non-null    float64
##  6   Score                  173 non-null    float64
##  7   PropertyRights         173 non-null    float64
##  8   JudicalEffectiveness   173 non-null    float64
##  9   GovernmentIntegrity    173 non-null    float64
##  10  TaxBurden              173 non-null    float64
##  11  Gov'tSpending          173 non-null    float64
##  12  FiscalHealth           173 non-null    float64
##  13  BusinessFreedom        173 non-null    float64
##  14  LaborFreedom           173 non-null    float64
##  15  MonetaryFreedom        173 non-null    float64
##  16  TradeFreedom           173 non-null    float64
##  17  InvestmentFreedom      173 non-null    float64
##  18  FinancialFreedom       173 non-null    float64
##  19  TariffRate             173 non-null    float64
##  20  IncomeTaxRate          173 non-null    float64
##  21  CorporateTaxRate       173 non-null    float64
##  22  TaxBurdenofGDP         173 non-null    float64
##  23  Gov'tExpenditureofGDP  173 non-null    float64
##  24  Country                173 non-null    object 
##  25  PopulationMillions     173 non-null    object 
##  26  GDPBillions,PPP        173 non-null    object 
##  27  GDPGrowthRate          173 non-null    float64
##  28  5YearGDPGrowthRate     173 non-null    float64
##  29  GDPperCapitaPPP        173 non-null    object 
##  30  Unemployment           173 non-null    object 
##  31  Inflation              173 non-null    float64
##  32  FDIInflowMillions      173 non-null    object 
##  33  PublicDebtofGDP        173 non-null    float64
## dtypes: float64(24), int64(1), object(9)
## memory usage: 47.3+ KB

Some plots

Boxplot by group, i.e. region:

sns.boxplot(x = 'Region', y = 'Score', data = dt) 

First scatter plot:

plt.scatter(x="PropertyRights", y="Score", data=dt)

We can put directly the linear regression fitting:

sns.lmplot(x="PropertyRights", y="Score", data=dt)

Density plot of the score variable:

sns.distplot(dt.Score, color="r")

Pair plot considering some variables, i.e. Property Rights, Labor Freedom, Government Integrity, Judical Effectiveness, Fiscal Health, Region and Score:

dt1 = dt[['PropertyRights', 'LaborFreedom', 'GovernmentIntegrity', 'JudicalEffectiveness','FiscalHealth', "Score", 'Region']]
matplotlib.rc_file_defaults()
sns.pairplot(dt1, hue="Region")

Linear regression

Import packages

import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error
import sklearn

Correlation matrix

corr = dt[['PropertyRights', 'LaborFreedom', 'GovernmentIntegrity', 'JudicalEffectiveness','FiscalHealth', "Score"]].corr()
corr
##                       PropertyRights  LaborFreedom  ...  FiscalHealth     Score
## PropertyRights              1.000000      0.432746  ...      0.329969  0.876601
## LaborFreedom                0.432746      1.000000  ...      0.104431  0.512976
## GovernmentIntegrity         0.866998      0.413794  ...      0.292240  0.818174
## JudicalEffectiveness        0.826805      0.421694  ...      0.287380  0.805825
## FiscalHealth                0.329969      0.104431  ...      1.000000  0.559395
## Score                       0.876601      0.512976  ...      0.559395  1.000000
## 
## [6 rows x 6 columns]

Heatmap of the correlation matrix:

sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)

We split the dataset into training (0.8) and test set (0.2):

import numpy as np
msk = np.random.rand(len(dt)) < 0.8
train = dt[msk]
test = dt[~msk]

Linear regression having as dependent variable the Score and PropertyRights, LaborFreedom and FiscalHealth as explicative variables:

results = smf.ols('Score ~ PropertyRights + LaborFreedom + FiscalHealth', data=train).fit()
results.summary()
OLS Regression Results
Dep. Variable: Score R-squared: 0.884
Model: OLS Adj. R-squared: 0.881
Method: Least Squares F-statistic: 330.5
Date: Tue, 07 Jan 2025 Prob (F-statistic): 1.27e-60
Time: 17:26:45 Log-Likelihood: -362.64
No. Observations: 134 AIC: 733.3
Df Residuals: 130 BIC: 744.9
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 26.0150 1.506 17.271 0.000 23.035 28.995
PropertyRights 0.3856 0.020 19.582 0.000 0.347 0.425
LaborFreedom 0.1423 0.026 5.410 0.000 0.090 0.194
FiscalHealth 0.0927 0.011 8.301 0.000 0.071 0.115
Omnibus: 5.626 Durbin-Watson: 2.013
Prob(Omnibus): 0.060 Jarque-Bera (JB): 5.388
Skew: -0.489 Prob(JB): 0.0676
Kurtosis: 3.093 Cond. No. 514.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We predict the score values using the test set:

pred = results.predict(test)
plt.scatter(test.Score, pred,  color='b')

Compute the mean squared error:

mean_squared_error(test.Score, pred)
## 13.342104336561519

We try to use a linear mixed model, considering as random effects the Region variable.

md = smf.mixedlm("Score ~ PropertyRights + LaborFreedom + FiscalHealth", train, groups="Region")
mdf = md.fit()
## C:\Users\ANDREE~1\AppData\Local\Programs\Python\PYTHON~1\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
##   warnings.warn("Maximum Likelihood optimization failed to "
## C:\Users\ANDREE~1\AppData\Local\Programs\Python\PYTHON~1\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2200: ConvergenceWarning: Retrying MixedLM optimization with lbfgs
##   warnings.warn(
## C:\Users\ANDREE~1\AppData\Local\Programs\Python\PYTHON~1\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
##   warnings.warn("Maximum Likelihood optimization failed to "
## C:\Users\ANDREE~1\AppData\Local\Programs\Python\PYTHON~1\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2200: ConvergenceWarning: Retrying MixedLM optimization with cg
##   warnings.warn(
## C:\Users\ANDREE~1\AppData\Local\Programs\Python\PYTHON~1\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
##   warnings.warn("Maximum Likelihood optimization failed to "
## C:\Users\ANDREE~1\AppData\Local\Programs\Python\PYTHON~1\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2206: ConvergenceWarning: MixedLM optimization failed, trying a different optimizer may help.
##   warnings.warn(msg, ConvergenceWarning)
## C:\Users\ANDREE~1\AppData\Local\Programs\Python\PYTHON~1\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2218: ConvergenceWarning: Gradient optimization failed, |grad| = 0.260398
##   warnings.warn(msg, ConvergenceWarning)
mdf.summary()
Model: MixedLM Dependent Variable: Score
No. Observations: 134 Method: REML
No. Groups: 5 Scale: 12.3137
Min. group size: 12 Log-Likelihood: -369.3094
Max. group size: 37 Converged: No
Mean group size: 26.8
Coef. Std.Err. z P>|z| [0.025 0.975]
Intercept 24.091 1.752 13.748 0.000 20.657 27.526
PropertyRights 0.411 0.023 18.119 0.000 0.367 0.455
LaborFreedom 0.147 0.026 5.714 0.000 0.097 0.198
FiscalHealth 0.100 0.011 8.952 0.000 0.078 0.122
Region Var 2.114 0.564

See http://www.statsmodels.org/stable/index.html for other commands about the linear (mixed) model. Also, https://www.statsmodels.org/stable/examples/notebooks/generated/mixed_lm_example.html makes a comparison between R lmer and Statsmodels MixedLM.

Principal Component Analysis

Import packages:

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np

Standardize data:

features = ['PropertyRights', 'LaborFreedom', 'GovernmentIntegrity', 'JudicalEffectiveness','FiscalHealth']
# Separating out the features
x = dt.loc[:, features].values
# Separating out the target
y = dt.loc[:,'Score'].values
# Standardizing the features
x = StandardScaler().fit_transform(x)

Perform PCA considering \(2\) principal components:

pca = PCA(4)
projected = pca.fit_transform(x)
print(x.shape)
## (173, 5)
print(projected.shape)
## (173, 4)

Plot the first \(2\) principal components:

plt.scatter(projected[:, 0], projected[:, 1],
            c=y, edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('seismic', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar()
## <matplotlib.colorbar.Colorbar object at 0x0000016601015810>

plt.plot(np.cumsum(pca.explained_variance_ratio_)) 
plt.xlabel("Number of component") 
plt.ylabel("Variance explained")
plt.xticks(range(4), [1,2,3,4])
## ([<matplotlib.axis.XTick object at 0x000001660FDEFED0>, <matplotlib.axis.XTick object at 0x000001660FE1A350>, <matplotlib.axis.XTick object at 0x000001660FE1AAD0>, <matplotlib.axis.XTick object at 0x000001660FE1B250>], [Text(0, 0, '1'), Text(1, 0, '2'), Text(2, 0, '3'), Text(3, 0, '4')])