FOREST FIRES IN BRAZIL

See https://www.kaggle.com/gustavomodelli/forest-fires-in-brazil for a full description of the dataset.

Import packages

import pandas as pd #Handle datasets
import seaborn as sns #Plots
import matplotlib.pyplot as plt  #Plots
import matplotlib

#Set some graphical parameters
rc={'axes.labelsize': 25, 'figure.figsize': (20,10), 
    'axes.titlesize': 25, 'xtick.labelsize': 18, 'ytick.labelsize': 18}
sns.set(rc=rc)

#Path data
path = 'C:/Users/Andreella/Documents/GitHub/angeella.github.io/Data'
df = pd.read_csv(path + '/amazon.csv',encoding="ISO-8859-1")

First \(3\) observations:

df.head(n=3)
##    year state    month  number        date
## 0  1998  Acre  Janeiro     0.0  1998-01-01
## 1  1999  Acre  Janeiro     0.0  1999-01-01
## 2  2000  Acre  Janeiro     0.0  2000-01-01

Some information about the variables:

df.info() 
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 6454 entries, 0 to 6453
## Data columns (total 5 columns):
##  #   Column  Non-Null Count  Dtype  
## ---  ------  --------------  -----  
##  0   year    6454 non-null   int64  
##  1   state   6454 non-null   object 
##  2   month   6454 non-null   object 
##  3   number  6454 non-null   float64
##  4   date    6454 non-null   object 
## dtypes: float64(1), int64(1), object(3)
## memory usage: 252.2+ KB

We are interested about the number of forest fires in Brazil

df.number.describe()
## count    6454.000000
## mean      108.293163
## std       190.812242
## min         0.000000
## 25%         3.000000
## 50%        24.000000
## 75%       113.000000
## max       998.000000
## Name: number, dtype: float64

To have an simple plot, we take a subset of the dataset:

df1 = df[(df.year > 2010) & (df.year < 2019)]
df1 =df1.loc[(df1.state.str.startswith('M'))]

We do a boxplot about the number of fire by groups, i.e., the states and the years.

sns.boxplot(x = 'year', y = 'number', hue = "state", data = df1) 

We do a timeseries plot with error bands:

sns.lineplot(x="year", y="number",hue="state", data=df1)

also we do a grouped violinplots:

sns.violinplot(x="state", y="number", data=df1)  

For other plots, please refers to https://seaborn.pydata.org/examples/index.html.

ECONOMIC FREEDOM INDEX

See https://www.kaggle.com/lewisduncan93/the-economic-freedom-index for a full description of the dataset.

Load and preprocess data

dt = pd.read_csv(path + '/economic_freedom_index2019_data.csv',encoding="ISO-8859-1")
dt.columns = dt.columns.str.replace(' ', '')
dt.columns = dt.columns.str.replace('2019', '')
dt.columns = dt.columns.str.replace('%', '')
dt.columns = dt.columns.str.replace('(', '')
## <string>:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
dt.columns = dt.columns.str.replace(')', '')
## <string>:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
dt = dt.dropna(axis = 0,how='any')

Basic info

dt.info()
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 173 entries, 0 to 185
## Data columns (total 34 columns):
##  #   Column                 Non-Null Count  Dtype  
## ---  ------                 --------------  -----  
##  0   CountryID              173 non-null    int64  
##  1   CountryName            173 non-null    object 
##  2   WEBNAME                173 non-null    object 
##  3   Region                 173 non-null    object 
##  4   WorldRank              173 non-null    float64
##  5   RegionRank             173 non-null    float64
##  6   Score                  173 non-null    float64
##  7   PropertyRights         173 non-null    float64
##  8   JudicalEffectiveness   173 non-null    float64
##  9   GovernmentIntegrity    173 non-null    float64
##  10  TaxBurden              173 non-null    float64
##  11  Gov'tSpending          173 non-null    float64
##  12  FiscalHealth           173 non-null    float64
##  13  BusinessFreedom        173 non-null    float64
##  14  LaborFreedom           173 non-null    float64
##  15  MonetaryFreedom        173 non-null    float64
##  16  TradeFreedom           173 non-null    float64
##  17  InvestmentFreedom      173 non-null    float64
##  18  FinancialFreedom       173 non-null    float64
##  19  TariffRate             173 non-null    float64
##  20  IncomeTaxRate          173 non-null    float64
##  21  CorporateTaxRate       173 non-null    float64
##  22  TaxBurdenofGDP         173 non-null    float64
##  23  Gov'tExpenditureofGDP  173 non-null    float64
##  24  Country                173 non-null    object 
##  25  PopulationMillions     173 non-null    object 
##  26  GDPBillions,PPP        173 non-null    object 
##  27  GDPGrowthRate          173 non-null    float64
##  28  5YearGDPGrowthRate     173 non-null    float64
##  29  GDPperCapitaPPP        173 non-null    object 
##  30  Unemployment           173 non-null    object 
##  31  Inflation              173 non-null    float64
##  32  FDIInflowMillions      173 non-null    object 
##  33  PublicDebtofGDP        173 non-null    float64
## dtypes: float64(24), int64(1), object(9)
## memory usage: 47.3+ KB

Some plots

Boxplot by group, i.e. region:

sns.boxplot(x = 'Region', y = 'Score', data = dt) 

First scatter plot:

plt.scatter(x="PropertyRights", y="Score", data=dt)

We can put directly the linear regression fitting:

sns.lmplot(x="PropertyRights", y="Score", data=dt)

Density plot of the score variable:

sns.distplot(dt.Score, color="r")

Pair plot considering some variables, i.e. Property Rights, Labor Freedom, Government Integrity, Judical Effectiveness, Fiscal Health, Region and Score:

dt1 = dt[['PropertyRights', 'LaborFreedom', 'GovernmentIntegrity', 'JudicalEffectiveness','FiscalHealth', "Score", 'Region']]
matplotlib.rc_file_defaults()
sns.pairplot(dt1, hue="Region")

Linear regression

Import packages

import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error
import sklearn

Correlation matrix

corr = dt[['PropertyRights', 'LaborFreedom', 'GovernmentIntegrity', 'JudicalEffectiveness','FiscalHealth', "Score"]].corr()
corr
##                       PropertyRights  LaborFreedom  ...  FiscalHealth     Score
## PropertyRights              1.000000      0.432746  ...      0.329969  0.876601
## LaborFreedom                0.432746      1.000000  ...      0.104431  0.512976
## GovernmentIntegrity         0.866998      0.413794  ...      0.292240  0.818174
## JudicalEffectiveness        0.826805      0.421694  ...      0.287380  0.805825
## FiscalHealth                0.329969      0.104431  ...      1.000000  0.559395
## Score                       0.876601      0.512976  ...      0.559395  1.000000
## 
## [6 rows x 6 columns]

Heatmap of the correlation matrix:

sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)

We split the dataset into training (0.8) and test set (0.2):

import numpy as np
msk = np.random.rand(len(dt)) < 0.8
train = dt[msk]
test = dt[~msk]

Linear regression having as dependent variable the Score and PropertyRights, LaborFreedom and FiscalHealth as explicative variables:

results = smf.ols('Score ~ PropertyRights + LaborFreedom + FiscalHealth', data=train).fit()
results.summary()
OLS Regression Results
Dep. Variable: Score R-squared: 0.888
Model: OLS Adj. R-squared: 0.885
Method: Least Squares F-statistic: 343.2
Date: Thu, 29 May 2025 Prob (F-statistic): 1.45e-61
Time: 17:48:59 Log-Likelihood: -362.28
No. Observations: 134 AIC: 732.6
Df Residuals: 130 BIC: 744.2
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 25.5973 1.491 17.165 0.000 22.647 28.548
PropertyRights 0.3685 0.019 19.743 0.000 0.332 0.405
LaborFreedom 0.1536 0.025 6.041 0.000 0.103 0.204
FiscalHealth 0.1012 0.011 9.268 0.000 0.080 0.123
Omnibus: 7.081 Durbin-Watson: 1.959
Prob(Omnibus): 0.029 Jarque-Bera (JB): 6.780
Skew: -0.537 Prob(JB): 0.0337
Kurtosis: 3.249 Cond. No. 513.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We predict the score values using the test set:

pred = results.predict(test)
plt.scatter(test.Score, pred,  color='b')

Compute the mean squared error:

mean_squared_error(test.Score, pred)
## 13.195898426565337

We try to use a linear mixed model, considering as random effects the Region variable.

md = smf.mixedlm("Score ~ PropertyRights + LaborFreedom + FiscalHealth", train, groups="Region")
mdf = md.fit()
mdf.summary()
Model: MixedLM Dependent Variable: Score
No. Observations: 134 Method: REML
No. Groups: 5 Scale: 12.8113
Min. group size: 10 Log-Likelihood: -371.0757
Max. group size: 37 Converged: Yes
Mean group size: 26.8
Coef. Std.Err. z P>|z| [0.025 0.975]
Intercept 24.815 1.683 14.748 0.000 21.517 28.113
PropertyRights 0.385 0.022 17.316 0.000 0.342 0.429
LaborFreedom 0.148 0.026 5.755 0.000 0.098 0.199
FiscalHealth 0.107 0.012 9.258 0.000 0.085 0.130
Region Var 1.229 0.421

See http://www.statsmodels.org/stable/index.html for other commands about the linear (mixed) model. Also, https://www.statsmodels.org/stable/examples/notebooks/generated/mixed_lm_example.html makes a comparison between R lmer and Statsmodels MixedLM.

Principal Component Analysis

Import packages:

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np

Standardize data:

features = ['PropertyRights', 'LaborFreedom', 'GovernmentIntegrity', 'JudicalEffectiveness','FiscalHealth']
# Separating out the features
x = dt.loc[:, features].values
# Separating out the target
y = dt.loc[:,'Score'].values
# Standardizing the features
x = StandardScaler().fit_transform(x)

Perform PCA considering \(2\) principal components:

pca = PCA(4)
projected = pca.fit_transform(x)
print(x.shape)
## (173, 5)
print(projected.shape)
## (173, 4)

Plot the first \(2\) principal components:

plt.scatter(projected[:, 0], projected[:, 1],
            c=y, edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('seismic', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar()
## <matplotlib.colorbar.Colorbar object at 0x000002248B03A250>

plt.plot(np.cumsum(pca.explained_variance_ratio_)) 
plt.xlabel("Number of component") 
plt.ylabel("Variance explained")
plt.xticks(range(4), [1,2,3,4])
## ([<matplotlib.axis.XTick object at 0x000002248B0AFA00>, <matplotlib.axis.XTick object at 0x000002248B0AFC10>, <matplotlib.axis.XTick object at 0x000002248B0BF9A0>, <matplotlib.axis.XTick object at 0x000002248B107B50>], [Text(0, 0, '1'), Text(1, 0, '2'), Text(2, 0, '3'), Text(3, 0, '4')])