In this session, we will explore how to analyze relationships between variables using regression models. As we’ve seen in prior classes, pandas allows one to load data and calculate basic statistics. To do more advanced statistics like regressions, we will require the help of other libraries such as statsmodels.
Statsmodels is a Python library that provides a wide range of statistical tools and models for data analysis and modeling. It is an open-source package that is built on top of the NumPy and SciPy libraries, and it has gained popularity among data scientists, statisticians, and researchers.
To use Statsmodels in your Python project, you first need to install and import the library into your code. You can do this by using the following import statement:
! pip3 install statsmodels
Requirement already satisfied: statsmodels in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (0.14.0)
Requirement already satisfied: numpy>=1.18 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from statsmodels) (1.26.4)
Requirement already satisfied: scipy!=1.9.2,>=1.4 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from statsmodels) (1.13.0)
Requirement already satisfied: pandas>=1.0 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from statsmodels) (2.2.1)
Requirement already satisfied: patsy>=0.5.2 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from statsmodels) (0.5.3)
Requirement already satisfied: packaging>=21.3 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from statsmodels) (23.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from pandas>=1.0->statsmodels) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from pandas>=1.0->statsmodels) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from pandas>=1.0->statsmodels) (2023.3)
Requirement already satisfied: six in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from patsy>=0.5.2->statsmodels) (1.16.0)
import statsmodels.api as sm
Loading sample data
For the sake of example, we will be using the Grunfeld dataset. A long introduction is beyond the scope, but it is a panel dataset that contains data on investment and other variables for 10 firms over a period of 20 years. It is a commonly used dataset in econometrics and has been used to study various aspects of investment behavior and firm performance.
data = sm.datasets.grunfeld.load_pandas()df = data.datadf.head()
invest
value
capital
firm
year
0
317.6
3078.5
2.8
General Motors
1935.0
1
391.8
4661.7
52.6
General Motors
1936.0
2
410.6
5387.1
156.9
General Motors
1937.0
3
257.7
2792.2
209.2
General Motors
1938.0
4
330.8
4313.2
203.4
General Motors
1939.0
Every row represents an annual observation of investment made by the company on physical capital (in millions), firm value (in millions), and capital stock (in millions) for General Motors over the years 1935 to 1939.
Ratios
To adjust the numbers per year and make them more comparable, we could express investment and capital stock as percentages of firm value. This normalizes the data relative to the size of the firm each year.
One of the key features of Statsmodels is its Formula API, which allows you to specify statistical models using tilde-style formulas. This makes it easy to specify complex statistical models with minimal coding effort. To use the Formula API, you can import the following module:
import statsmodels.formula.api as smf
Let’s say we want to predict the investment ratio based on the current valuation, capital, and the year. We could do that as follows:
This sounds complicated, until you see how the whole regression is just a 1-liner in statsmodels!
model = smf.ols('InvestRatio ~ CapitalStockRatio + year', data=df).fit()
Once you have defined your model, you can obtain a summary of the results, which includes important statistics such as the R-squared value, standard errors, and p-values. This summary can be accessed using the .summary() method of the model object, as shown below:
model.summary()
OLS Regression Results
Dep. Variable:
InvestRatio
R-squared:
0.561
Model:
OLS
Adj. R-squared:
0.556
Method:
Least Squares
F-statistic:
138.4
Date:
Mon, 02 Dec 2024
Prob (F-statistic):
1.82e-39
Time:
10:39:02
Log-Likelihood:
-729.30
No. Observations:
220
AIC:
1465.
Df Residuals:
217
BIC:
1475.
Df Model:
2
Covariance Type:
nonrobust
coef
std err
t
P>|t|
[0.025
0.975]
Intercept
-300.2015
155.916
-1.925
0.055
-607.506
7.102
CapitalStockRatio
0.0862
0.005
15.723
0.000
0.075
0.097
year
0.1589
0.080
1.980
0.049
0.001
0.317
Omnibus:
31.316
Durbin-Watson:
0.477
Prob(Omnibus):
0.000
Jarque-Bera (JB):
43.824
Skew:
0.871
Prob(JB):
3.05e-10
Kurtosis:
4.321
Cond. No.
6.71e+05
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 6.71e+05. This might indicate that there are strong multicollinearity or other numerical problems.
As you can see, this prints a nicely formatted output of the model summary, including the R-squared value and much more.
For this particular case, we conclude that firms with a higher proportion of capital stock relative to their value are more inclined or able to reinvest in their physical assets. The effect is significant and positive effect (coef = 0.086, p < 0.001) and over time we see a smaller upward trend (year coef = 0.159, p = 0.049). However, the other statistical tests show we may need to refine the model further.
Categorical variables are automatically detected and dummy encoded, but it’s good practice to mark them as categoricals by adding C around the variable names. Let’s say we wanted to consider firm-specific factors like management or culture:
model_cat = smf.ols('InvestRatio ~ CapitalStockRatio + year + C(firm)', data=df).fit()model_cat.summary()
OLS Regression Results
Dep. Variable:
InvestRatio
R-squared:
0.835
Model:
OLS
Adj. R-squared:
0.825
Method:
Least Squares
F-statistic:
87.25
Date:
Mon, 02 Dec 2024
Prob (F-statistic):
4.99e-74
Time:
10:39:03
Log-Likelihood:
-621.59
No. Observations:
220
AIC:
1269.
Df Residuals:
207
BIC:
1313.
Df Model:
12
Covariance Type:
nonrobust
coef
std err
t
P>|t|
[0.025
0.975]
Intercept
-393.8923
112.090
-3.514
0.001
-614.877
-172.907
C(firm)[T.Atlantic Refining]
9.5560
1.540
6.207
0.000
6.521
12.591
C(firm)[T.Chrysler]
7.9287
1.734
4.573
0.000
4.510
11.347
C(firm)[T.Diamond Match]
0.6988
1.789
0.391
0.697
-2.829
4.227
C(firm)[T.General Electric]
0.6825
1.709
0.399
0.690
-2.686
4.051
C(firm)[T.General Motors]
9.5595
1.751
5.459
0.000
6.107
13.012
C(firm)[T.Goodyear]
2.9180
1.386
2.106
0.036
0.186
5.650
C(firm)[T.IBM]
7.9037
1.697
4.657
0.000
4.558
11.249
C(firm)[T.US Steel]
16.6070
1.748
9.503
0.000
13.162
20.052
C(firm)[T.Union Oil]
13.9922
1.600
8.747
0.000
10.838
17.146
C(firm)[T.Westinghouse]
2.3673
1.768
1.339
0.182
-1.119
5.854
CapitalStockRatio
0.0715
0.010
7.230
0.000
0.052
0.091
year
0.2042
0.058
3.521
0.001
0.090
0.318
Omnibus:
92.283
Durbin-Watson:
1.173
Prob(Omnibus):
0.000
Jarque-Bera (JB):
493.190
Skew:
1.557
Prob(JB):
8.04e-108
Kurtosis:
9.642
Cond. No.
7.69e+05
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 7.69e+05. This might indicate that there are strong multicollinearity or other numerical problems.
We see that Union Oil and US Steel firms consistently invest a much larger proportion of their firm value into capital expenditures compared to the baseline firm, after controlling for year and the capital stock ratio. Why?
Making predictions
Finally, once you have fitted your model, you can use it to make predictions on new data. To do this, you can use the .predict() method of the model object, as shown below:
Don’t be misled by the term ‘predict’ though. While it suggests uncovering the unknown, this model is not forward-looking. To achieve that, you would need to use next year’s investments instead of this year’s. For a deeper understanding of using statistical models for business insights, refer to the Business Analytics course offered in the first-year core.
We can also use this function to plot the regression line of a model with 1 variable.
# DEFINE MODELmodel = smf.ols('InvestRatio ~ CapitalStockRatio', data=df).fit()# MAKE PREDICTIONSdf['Predicted'] = model.predict(df)# PLOT MODELimport matplotlib.pyplot as pltdf.plot(x='CapitalStockRatio', y ='Predicted', color='red', ax = plt.gca())df.plot.scatter(y='InvestRatio', x='CapitalStockRatio', alpha=.3, ax = plt.gca());