FOREST FIRES IN BRAZIL

See https://www.kaggle.com/gustavomodelli/forest-fires-in-brazil for a full description of the dataset.

Import packages

import pandas as pd #Handle datasets
import seaborn as sns #Plots
import matplotlib.pyplot as plt  #Plots
import matplotlib

#Set some graphical parameters
rc={'axes.labelsize': 25, 'figure.figsize': (20,10), 
    'axes.titlesize': 25, 'xtick.labelsize': 18, 'ytick.labelsize': 18}
sns.set(rc=rc)

#Path data
path = 'C:/Users/Andreella/Documents/GitHub/angeella.github.io/Tutorials/Data'
df = pd.read_csv(path + '/amazon.csv',encoding="ISO-8859-1")

First \(3\) observations:

df.head(n=3)
##    year state    month  number        date
## 0  1998  Acre  Janeiro     0.0  1998-01-01
## 1  1999  Acre  Janeiro     0.0  1999-01-01
## 2  2000  Acre  Janeiro     0.0  2000-01-01

Some information about the variables:

df.info() 
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 6454 entries, 0 to 6453
## Data columns (total 5 columns):
##  #   Column  Non-Null Count  Dtype  
## ---  ------  --------------  -----  
##  0   year    6454 non-null   int64  
##  1   state   6454 non-null   object 
##  2   month   6454 non-null   object 
##  3   number  6454 non-null   float64
##  4   date    6454 non-null   object 
## dtypes: float64(1), int64(1), object(3)
## memory usage: 252.2+ KB

We are interested about the number of forest fires in Brazil

df.number.describe()
## count    6454.000000
## mean      108.293163
## std       190.812242
## min         0.000000
## 25%         3.000000
## 50%        24.000000
## 75%       113.000000
## max       998.000000
## Name: number, dtype: float64

To have an simple plot, we take a subset of the dataset:

df1 = df[(df.year > 2010) & (df.year < 2019)]
df1 =df1.loc[(df1.state.str.startswith('M'))]

We do a boxplot about the number of fire by groups, i.e., the states and the years.

sns.boxplot(x = 'year', y = 'number', hue = "state", data = df1) 

We do a timeseries plot with error bands:

sns.lineplot(x="year", y="number",hue="state", data=df1)

also we do a grouped violinplots:

sns.violinplot(x="state", y="number", data=df1)  

For other plots, please refers to https://seaborn.pydata.org/examples/index.html.

ECONOMIC FREEDOM INDEX

See https://www.kaggle.com/lewisduncan93/the-economic-freedom-index for a full description of the dataset.

Load and preprocess data

dt = pd.read_csv(path + '/economic_freedom_index2019_data.csv',encoding="ISO-8859-1")
dt.columns = dt.columns.str.replace(' ', '')
dt.columns = dt.columns.str.replace('2019', '')
dt.columns = dt.columns.str.replace('%', '')
dt.columns = dt.columns.str.replace('(', '')
## <string>:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
dt.columns = dt.columns.str.replace(')', '')
## <string>:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
dt = dt.dropna(axis = 0,how='any')

Basic info

dt.info()
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 173 entries, 0 to 185
## Data columns (total 34 columns):
##  #   Column                 Non-Null Count  Dtype  
## ---  ------                 --------------  -----  
##  0   CountryID              173 non-null    int64  
##  1   CountryName            173 non-null    object 
##  2   WEBNAME                173 non-null    object 
##  3   Region                 173 non-null    object 
##  4   WorldRank              173 non-null    float64
##  5   RegionRank             173 non-null    float64
##  6   Score                  173 non-null    float64
##  7   PropertyRights         173 non-null    float64
##  8   JudicalEffectiveness   173 non-null    float64
##  9   GovernmentIntegrity    173 non-null    float64
##  10  TaxBurden              173 non-null    float64
##  11  Gov'tSpending          173 non-null    float64
##  12  FiscalHealth           173 non-null    float64
##  13  BusinessFreedom        173 non-null    float64
##  14  LaborFreedom           173 non-null    float64
##  15  MonetaryFreedom        173 non-null    float64
##  16  TradeFreedom           173 non-null    float64
##  17  InvestmentFreedom      173 non-null    float64
##  18  FinancialFreedom       173 non-null    float64
##  19  TariffRate             173 non-null    float64
##  20  IncomeTaxRate          173 non-null    float64
##  21  CorporateTaxRate       173 non-null    float64
##  22  TaxBurdenofGDP         173 non-null    float64
##  23  Gov'tExpenditureofGDP  173 non-null    float64
##  24  Country                173 non-null    object 
##  25  PopulationMillions     173 non-null    object 
##  26  GDPBillions,PPP        173 non-null    object 
##  27  GDPGrowthRate          173 non-null    float64
##  28  5YearGDPGrowthRate     173 non-null    float64
##  29  GDPperCapitaPPP        173 non-null    object 
##  30  Unemployment           173 non-null    object 
##  31  Inflation              173 non-null    float64
##  32  FDIInflowMillions      173 non-null    object 
##  33  PublicDebtofGDP        173 non-null    float64
## dtypes: float64(24), int64(1), object(9)
## memory usage: 47.3+ KB

Some plots

Boxplot by group, i.e. region:

sns.boxplot(x = 'Region', y = 'Score', data = dt) 

First scatter plot:

plt.scatter(x="PropertyRights", y="Score", data=dt)

We can put directly the linear regression fitting:

sns.lmplot(x="PropertyRights", y="Score", data=dt)

Density plot of the score variable:

sns.distplot(dt.Score, color="r")

Pair plot considering some variables, i.e. Property Rights, Labor Freedom, Government Integrity, Judical Effectiveness, Fiscal Health, Region and Score:

dt1 = dt[['PropertyRights', 'LaborFreedom', 'GovernmentIntegrity', 'JudicalEffectiveness','FiscalHealth', "Score", 'Region']]
matplotlib.rc_file_defaults()
sns.pairplot(dt1, hue="Region")

Linear regression

Import packages

import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error
import sklearn

Correlation matrix

corr = dt[['PropertyRights', 'LaborFreedom', 'GovernmentIntegrity', 'JudicalEffectiveness','FiscalHealth', "Score"]].corr()
corr
##                       PropertyRights  LaborFreedom  ...  FiscalHealth     Score
## PropertyRights              1.000000      0.432746  ...      0.329969  0.876601
## LaborFreedom                0.432746      1.000000  ...      0.104431  0.512976
## GovernmentIntegrity         0.866998      0.413794  ...      0.292240  0.818174
## JudicalEffectiveness        0.826805      0.421694  ...      0.287380  0.805825
## FiscalHealth                0.329969      0.104431  ...      1.000000  0.559395
## Score                       0.876601      0.512976  ...      0.559395  1.000000
## 
## [6 rows x 6 columns]

Heatmap of the correlation matrix:

sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)

We split the dataset into training (0.8) and test set (0.2):

import numpy as np
msk = np.random.rand(len(dt)) < 0.8
train = dt[msk]
test = dt[~msk]

Linear regression having as dependent variable the Score and PropertyRights, LaborFreedom and FiscalHealth as explicative variables:

results = smf.ols('Score ~ PropertyRights + LaborFreedom + FiscalHealth', data=train).fit()
results.summary()
OLS Regression Results
Dep. Variable: Score R-squared: 0.885
Model: OLS Adj. R-squared: 0.883
Method: Least Squares F-statistic: 344.5
Date: Sun, 19 Feb 2023 Prob (F-statistic): 8.92e-63
Time: 17:44:32 Log-Likelihood: -371.13
No. Observations: 138 AIC: 750.3
Df Residuals: 134 BIC: 762.0
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 26.4742 1.476 17.941 0.000 23.556 29.393
PropertyRights 0.3783 0.018 20.459 0.000 0.342 0.415
LaborFreedom 0.1275 0.025 5.079 0.000 0.078 0.177
FiscalHealth 0.1028 0.011 9.712 0.000 0.082 0.124
Omnibus: 4.654 Durbin-Watson: 1.867
Prob(Omnibus): 0.098 Jarque-Bera (JB): 4.353
Skew: -0.433 Prob(JB): 0.113
Kurtosis: 3.094 Cond. No. 518.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We predict the score values using the test set:

pred = results.predict(test)
plt.scatter(test.Score, pred,  color='b')

Compute the mean squared error:

mean_squared_error(test.Score, pred)
## 14.41123837321759

We try to use a linear mixed model, considering as random effects the Region variable.

md = smf.mixedlm("Score ~ PropertyRights + LaborFreedom + FiscalHealth", train, groups="Region")
mdf = md.fit()
mdf.summary()
Model: MixedLM Dependent Variable: Score
No. Observations: 138 Method: REML
No. Groups: 5 Scale: 12.0251
Min. group size: 12 Log-Likelihood: -378.4386
Max. group size: 36 Converged: Yes
Mean group size: 27.6
Coef. Std.Err. z P>|z| [0.025 0.975]
Intercept 24.976 1.700 14.689 0.000 21.644 28.308
PropertyRights 0.401 0.021 19.104 0.000 0.360 0.443
LaborFreedom 0.125 0.025 4.978 0.000 0.076 0.175
FiscalHealth 0.112 0.011 10.362 0.000 0.091 0.133
Region Var 1.896 0.540

See http://www.statsmodels.org/stable/index.html for other commands about the linear (mixed) model. Also, https://www.statsmodels.org/stable/examples/notebooks/generated/mixed_lm_example.html makes a comparison between R lmer and Statsmodels MixedLM.

Principal Component Analysis

Import packages:

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np

Standardize data:

features = ['PropertyRights', 'LaborFreedom', 'GovernmentIntegrity', 'JudicalEffectiveness','FiscalHealth']
# Separating out the features
x = dt.loc[:, features].values
# Separating out the target
y = dt.loc[:,'Score'].values
# Standardizing the features
x = StandardScaler().fit_transform(x)

Perform PCA considering \(2\) principal components:

pca = PCA(4)
projected = pca.fit_transform(x)
print(x.shape)
## (173, 5)
print(projected.shape)
## (173, 4)

Plot the first \(2\) principal components:

plt.scatter(projected[:, 0], projected[:, 1],
            c=y, edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('seismic', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar()
## <matplotlib.colorbar.Colorbar object at 0x000001E29F0F59A0>
plt.plot(np.cumsum(pca.explained_variance_ratio_)) 
plt.xlabel("Number of component") 
plt.ylabel("Variance explained")
plt.xticks(range(4), [1,2,3,4])
## ([<matplotlib.axis.XTick object at 0x000001E29F06E6D0>, <matplotlib.axis.XTick object at 0x000001E29F06E7C0>, <matplotlib.axis.XTick object at 0x000001E29F109C40>, <matplotlib.axis.XTick object at 0x000001E29F00BA00>], [Text(0, 0, '1'), Text(1, 0, '2'), Text(2, 0, '3'), Text(3, 0, '4')])