See https://www.kaggle.com/gustavomodelli/forest-fires-in-brazil for a full description of the dataset.
Import packages
import pandas as pd #Handle datasets
import seaborn as sns #Plots
import matplotlib.pyplot as plt #Plots
import matplotlib
#Set some graphical parameters
rc={'axes.labelsize': 25, 'figure.figsize': (20,10),
'axes.titlesize': 25, 'xtick.labelsize': 18, 'ytick.labelsize': 18}
sns.set(rc=rc)
#Path data
path = 'C:/Users/Andreella/Documents/GitHub/angeella.github.io/Tutorials/Data'
df = pd.read_csv(path + '/amazon.csv',encoding="ISO-8859-1")
First \(3\) observations:
df.head(n=3)
## year state month number date
## 0 1998 Acre Janeiro 0.0 1998-01-01
## 1 1999 Acre Janeiro 0.0 1999-01-01
## 2 2000 Acre Janeiro 0.0 2000-01-01
Some information about the variables:
df.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 6454 entries, 0 to 6453
## Data columns (total 5 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 year 6454 non-null int64
## 1 state 6454 non-null object
## 2 month 6454 non-null object
## 3 number 6454 non-null float64
## 4 date 6454 non-null object
## dtypes: float64(1), int64(1), object(3)
## memory usage: 252.2+ KB
We are interested about the number of forest fires in Brazil
df.number.describe()
## count 6454.000000
## mean 108.293163
## std 190.812242
## min 0.000000
## 25% 3.000000
## 50% 24.000000
## 75% 113.000000
## max 998.000000
## Name: number, dtype: float64
To have an simple plot, we take a subset of the dataset:
df1 = df[(df.year > 2010) & (df.year < 2019)]
df1 =df1.loc[(df1.state.str.startswith('M'))]
We do a boxplot about the number of fire by groups, i.e., the states and the years.
sns.boxplot(x = 'year', y = 'number', hue = "state", data = df1)
We do a timeseries plot with error bands:
sns.lineplot(x="year", y="number",hue="state", data=df1)
also we do a grouped violinplots:
sns.violinplot(x="state", y="number", data=df1)
For other plots, please refers to https://seaborn.pydata.org/examples/index.html.
See https://www.kaggle.com/lewisduncan93/the-economic-freedom-index for a full description of the dataset.
Load and preprocess data
dt = pd.read_csv(path + '/economic_freedom_index2019_data.csv',encoding="ISO-8859-1")
dt.columns = dt.columns.str.replace(' ', '')
dt.columns = dt.columns.str.replace('2019', '')
dt.columns = dt.columns.str.replace('%', '')
dt.columns = dt.columns.str.replace('(', '')
## <string>:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
dt.columns = dt.columns.str.replace(')', '')
## <string>:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
dt = dt.dropna(axis = 0,how='any')
Basic info
dt.info()
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 173 entries, 0 to 185
## Data columns (total 34 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 CountryID 173 non-null int64
## 1 CountryName 173 non-null object
## 2 WEBNAME 173 non-null object
## 3 Region 173 non-null object
## 4 WorldRank 173 non-null float64
## 5 RegionRank 173 non-null float64
## 6 Score 173 non-null float64
## 7 PropertyRights 173 non-null float64
## 8 JudicalEffectiveness 173 non-null float64
## 9 GovernmentIntegrity 173 non-null float64
## 10 TaxBurden 173 non-null float64
## 11 Gov'tSpending 173 non-null float64
## 12 FiscalHealth 173 non-null float64
## 13 BusinessFreedom 173 non-null float64
## 14 LaborFreedom 173 non-null float64
## 15 MonetaryFreedom 173 non-null float64
## 16 TradeFreedom 173 non-null float64
## 17 InvestmentFreedom 173 non-null float64
## 18 FinancialFreedom 173 non-null float64
## 19 TariffRate 173 non-null float64
## 20 IncomeTaxRate 173 non-null float64
## 21 CorporateTaxRate 173 non-null float64
## 22 TaxBurdenofGDP 173 non-null float64
## 23 Gov'tExpenditureofGDP 173 non-null float64
## 24 Country 173 non-null object
## 25 PopulationMillions 173 non-null object
## 26 GDPBillions,PPP 173 non-null object
## 27 GDPGrowthRate 173 non-null float64
## 28 5YearGDPGrowthRate 173 non-null float64
## 29 GDPperCapitaPPP 173 non-null object
## 30 Unemployment 173 non-null object
## 31 Inflation 173 non-null float64
## 32 FDIInflowMillions 173 non-null object
## 33 PublicDebtofGDP 173 non-null float64
## dtypes: float64(24), int64(1), object(9)
## memory usage: 47.3+ KB
Boxplot by group, i.e. region:
sns.boxplot(x = 'Region', y = 'Score', data = dt)
First scatter plot:
plt.scatter(x="PropertyRights", y="Score", data=dt)
We can put directly the linear regression fitting:
sns.lmplot(x="PropertyRights", y="Score", data=dt)
Density plot of the score variable:
sns.distplot(dt.Score, color="r")
Pair plot considering some variables, i.e. Property Rights, Labor Freedom, Government Integrity, Judical Effectiveness, Fiscal Health, Region and Score:
dt1 = dt[['PropertyRights', 'LaborFreedom', 'GovernmentIntegrity', 'JudicalEffectiveness','FiscalHealth', "Score", 'Region']]
matplotlib.rc_file_defaults()
sns.pairplot(dt1, hue="Region")
Import packages
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error
import sklearn
Correlation matrix
corr = dt[['PropertyRights', 'LaborFreedom', 'GovernmentIntegrity', 'JudicalEffectiveness','FiscalHealth', "Score"]].corr()
corr
## PropertyRights LaborFreedom ... FiscalHealth Score
## PropertyRights 1.000000 0.432746 ... 0.329969 0.876601
## LaborFreedom 0.432746 1.000000 ... 0.104431 0.512976
## GovernmentIntegrity 0.866998 0.413794 ... 0.292240 0.818174
## JudicalEffectiveness 0.826805 0.421694 ... 0.287380 0.805825
## FiscalHealth 0.329969 0.104431 ... 1.000000 0.559395
## Score 0.876601 0.512976 ... 0.559395 1.000000
##
## [6 rows x 6 columns]
Heatmap of the correlation matrix:
sns.heatmap(corr,
xticklabels=corr.columns,
yticklabels=corr.columns)
We split the dataset into training (0.8) and test set (0.2):
import numpy as np
msk = np.random.rand(len(dt)) < 0.8
train = dt[msk]
test = dt[~msk]
Linear regression having as dependent variable the Score and PropertyRights, LaborFreedom and FiscalHealth as explicative variables:
results = smf.ols('Score ~ PropertyRights + LaborFreedom + FiscalHealth', data=train).fit()
results.summary()
Dep. Variable: | Score | R-squared: | 0.885 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.883 |
Method: | Least Squares | F-statistic: | 344.5 |
Date: | Sun, 19 Feb 2023 | Prob (F-statistic): | 8.92e-63 |
Time: | 17:44:32 | Log-Likelihood: | -371.13 |
No. Observations: | 138 | AIC: | 750.3 |
Df Residuals: | 134 | BIC: | 762.0 |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 26.4742 | 1.476 | 17.941 | 0.000 | 23.556 | 29.393 |
PropertyRights | 0.3783 | 0.018 | 20.459 | 0.000 | 0.342 | 0.415 |
LaborFreedom | 0.1275 | 0.025 | 5.079 | 0.000 | 0.078 | 0.177 |
FiscalHealth | 0.1028 | 0.011 | 9.712 | 0.000 | 0.082 | 0.124 |
Omnibus: | 4.654 | Durbin-Watson: | 1.867 |
---|---|---|---|
Prob(Omnibus): | 0.098 | Jarque-Bera (JB): | 4.353 |
Skew: | -0.433 | Prob(JB): | 0.113 |
Kurtosis: | 3.094 | Cond. No. | 518. |
We predict the score values using the test set:
pred = results.predict(test)
plt.scatter(test.Score, pred, color='b')
Compute the mean squared error:
mean_squared_error(test.Score, pred)
## 14.41123837321759
We try to use a linear mixed model, considering as random effects the Region variable.
md = smf.mixedlm("Score ~ PropertyRights + LaborFreedom + FiscalHealth", train, groups="Region")
mdf = md.fit()
mdf.summary()
Model: | MixedLM | Dependent Variable: | Score |
No. Observations: | 138 | Method: | REML |
No. Groups: | 5 | Scale: | 12.0251 |
Min. group size: | 12 | Log-Likelihood: | -378.4386 |
Max. group size: | 36 | Converged: | Yes |
Mean group size: | 27.6 |
Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 24.976 | 1.700 | 14.689 | 0.000 | 21.644 | 28.308 |
PropertyRights | 0.401 | 0.021 | 19.104 | 0.000 | 0.360 | 0.443 |
LaborFreedom | 0.125 | 0.025 | 4.978 | 0.000 | 0.076 | 0.175 |
FiscalHealth | 0.112 | 0.011 | 10.362 | 0.000 | 0.091 | 0.133 |
Region Var | 1.896 | 0.540 |
See http://www.statsmodels.org/stable/index.html for other commands about the linear (mixed) model. Also, https://www.statsmodels.org/stable/examples/notebooks/generated/mixed_lm_example.html makes a comparison between R lmer and Statsmodels MixedLM.
Import packages:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np
Standardize data:
features = ['PropertyRights', 'LaborFreedom', 'GovernmentIntegrity', 'JudicalEffectiveness','FiscalHealth']
# Separating out the features
x = dt.loc[:, features].values
# Separating out the target
y = dt.loc[:,'Score'].values
# Standardizing the features
x = StandardScaler().fit_transform(x)
Perform PCA considering \(2\) principal components:
pca = PCA(4)
projected = pca.fit_transform(x)
print(x.shape)
## (173, 5)
print(projected.shape)
## (173, 4)
Plot the first \(2\) principal components:
plt.scatter(projected[:, 0], projected[:, 1],
c=y, edgecolor='none', alpha=0.5,
cmap=plt.cm.get_cmap('seismic', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar()
## <matplotlib.colorbar.Colorbar object at 0x000001E29F0F59A0>
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel("Number of component")
plt.ylabel("Variance explained")
plt.xticks(range(4), [1,2,3,4])
## ([<matplotlib.axis.XTick object at 0x000001E29F06E6D0>, <matplotlib.axis.XTick object at 0x000001E29F06E7C0>, <matplotlib.axis.XTick object at 0x000001E29F109C40>, <matplotlib.axis.XTick object at 0x000001E29F00BA00>], [Text(0, 0, '1'), Text(1, 0, '2'), Text(2, 0, '3'), Text(3, 0, '4')])