• Review
  • Image Data
  • Tabular Data
  • Text Data
  • Assignments
    • Haiteng Engineering: Quality Control System Analysis
    • AirBnBarcelona
  1. Tabular Data
  2. Investment Drivers
  • Review
    • Introduction to Python
    • NumPy: Numerical Python
  • Image Data
    • How Computers See
    • Computer Vision
  • Tabular Data
    • Pandas: Better than Excel
    • Investment Drivers
    • From Data to Model: AirBnB
    • Time Series Data in Pandas
  • Text Data
    • How Computers “Read”
    • 30-Text-Data/060-Embeddings.ipynb

Navigation

On this page

  • Loading sample data
    • Ratios
  • Defining a model
  • Making predictions
  • Additional information
  1. Tabular Data
  2. Investment Drivers

Investment Drivers

In this session, we will explore how to analyze relationships between variables using regression models. As we’ve seen in prior classes, pandas allows one to load data and calculate basic statistics. To do more advanced statistics like regressions, we will require the help of other libraries such as statsmodels.

Statsmodels is a Python library that provides a wide range of statistical tools and models for data analysis and modeling. It is an open-source package that is built on top of the NumPy and SciPy libraries, and it has gained popularity among data scientists, statisticians, and researchers.

To use Statsmodels in your Python project, you first need to install and import the library into your code. You can do this by using the following import statement:

! pip3 install statsmodels
Requirement already satisfied: statsmodels in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (0.14.0)
Requirement already satisfied: numpy>=1.18 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from statsmodels) (1.26.4)
Requirement already satisfied: scipy!=1.9.2,>=1.4 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from statsmodels) (1.13.0)
Requirement already satisfied: pandas>=1.0 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from statsmodels) (2.2.1)
Requirement already satisfied: patsy>=0.5.2 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from statsmodels) (0.5.3)
Requirement already satisfied: packaging>=21.3 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from statsmodels) (23.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from pandas>=1.0->statsmodels) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from pandas>=1.0->statsmodels) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from pandas>=1.0->statsmodels) (2023.3)
Requirement already satisfied: six in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from patsy>=0.5.2->statsmodels) (1.16.0)
import statsmodels.api as sm

Loading sample data

For the sake of example, we will be using the Grunfeld dataset. A long introduction is beyond the scope, but it is a panel dataset that contains data on investment and other variables for 10 firms over a period of 20 years. It is a commonly used dataset in econometrics and has been used to study various aspects of investment behavior and firm performance.

data = sm.datasets.grunfeld.load_pandas()
df   = data.data

df.head()
invest value capital firm year
0 317.6 3078.5 2.8 General Motors 1935.0
1 391.8 4661.7 52.6 General Motors 1936.0
2 410.6 5387.1 156.9 General Motors 1937.0
3 257.7 2792.2 209.2 General Motors 1938.0
4 330.8 4313.2 203.4 General Motors 1939.0

Every row represents an annual observation of investment made by the company on physical capital (in millions), firm value (in millions), and capital stock (in millions) for General Motors over the years 1935 to 1939.

Ratios

To adjust the numbers per year and make them more comparable, we could express investment and capital stock as percentages of firm value. This normalizes the data relative to the size of the firm each year.

You try it

Compute the investment and capital ratios:

\[ \begin{align} \text{InvestRatio} &= \left( \frac{\text{Investment}}{\text{Firm Value}} \right) \times 100 \\ \text{CapitalStockRatio} &= \left( \frac{\text{Capital Stock}}{\text{Firm Value}} \right) \times 100 \end{align} \]

Then, using df[column].plot.hist(), plot the distribution of CapitalStockRatio.

...
Ellipsis
Solution
df['InvestRatio']       = df.invest / df.value * 100
df['CapitalStockRatio'] = df.capital / df.value * 100
df.InvestRatio.plot.hist(figsize=(7,5), color='gray', edgecolor='white');

We can also plot the correlation

df.plot.scatter(x='CapitalStockRatio',y='InvestRatio', alpha=.5);

Defining a model

One of the key features of Statsmodels is its Formula API, which allows you to specify statistical models using tilde-style formulas. This makes it easy to specify complex statistical models with minimal coding effort. To use the Formula API, you can import the following module:

import statsmodels.formula.api as smf

Let’s say we want to predict the investment ratio based on the current valuation, capital, and the year. We could do that as follows:

\[ \textrm{InvestRatio} = a + b_0 \textrm{CapitalStockRatio} + b_1 \textrm{Year} \]

The “tilde-style” formula’s that we were referring to before go as follows:

\[ Y \sim X_1 + X_2 + ... \] Which in our case would be:

\[ \textrm{InvestRatio}\,\, \sim\,\, \textrm{CapitalStockRatio} + \textrm{Year} \]

This sounds complicated, until you see how the whole regression is just a 1-liner in statsmodels!

model = smf.ols('InvestRatio ~ CapitalStockRatio + year', data=df).fit()

Once you have defined your model, you can obtain a summary of the results, which includes important statistics such as the R-squared value, standard errors, and p-values. This summary can be accessed using the .summary() method of the model object, as shown below:

model.summary()
OLS Regression Results
Dep. Variable: InvestRatio R-squared: 0.561
Model: OLS Adj. R-squared: 0.556
Method: Least Squares F-statistic: 138.4
Date: Mon, 02 Dec 2024 Prob (F-statistic): 1.82e-39
Time: 10:39:02 Log-Likelihood: -729.30
No. Observations: 220 AIC: 1465.
Df Residuals: 217 BIC: 1475.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -300.2015 155.916 -1.925 0.055 -607.506 7.102
CapitalStockRatio 0.0862 0.005 15.723 0.000 0.075 0.097
year 0.1589 0.080 1.980 0.049 0.001 0.317
Omnibus: 31.316 Durbin-Watson: 0.477
Prob(Omnibus): 0.000 Jarque-Bera (JB): 43.824
Skew: 0.871 Prob(JB): 3.05e-10
Kurtosis: 4.321 Cond. No. 6.71e+05


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.71e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

As you can see, this prints a nicely formatted output of the model summary, including the R-squared value and much more.

For this particular case, we conclude that firms with a higher proportion of capital stock relative to their value are more inclined or able to reinvest in their physical assets. The effect is significant and positive effect (coef = 0.086, p < 0.001) and over time we see a smaller upward trend (year coef = 0.159, p = 0.049). However, the other statistical tests show we may need to refine the model further.

Categorical variables are automatically detected and dummy encoded, but it’s good practice to mark them as categoricals by adding C around the variable names. Let’s say we wanted to consider firm-specific factors like management or culture:

model_cat = smf.ols('InvestRatio ~ CapitalStockRatio + year + C(firm)', data=df).fit()
model_cat.summary()
OLS Regression Results
Dep. Variable: InvestRatio R-squared: 0.835
Model: OLS Adj. R-squared: 0.825
Method: Least Squares F-statistic: 87.25
Date: Mon, 02 Dec 2024 Prob (F-statistic): 4.99e-74
Time: 10:39:03 Log-Likelihood: -621.59
No. Observations: 220 AIC: 1269.
Df Residuals: 207 BIC: 1313.
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -393.8923 112.090 -3.514 0.001 -614.877 -172.907
C(firm)[T.Atlantic Refining] 9.5560 1.540 6.207 0.000 6.521 12.591
C(firm)[T.Chrysler] 7.9287 1.734 4.573 0.000 4.510 11.347
C(firm)[T.Diamond Match] 0.6988 1.789 0.391 0.697 -2.829 4.227
C(firm)[T.General Electric] 0.6825 1.709 0.399 0.690 -2.686 4.051
C(firm)[T.General Motors] 9.5595 1.751 5.459 0.000 6.107 13.012
C(firm)[T.Goodyear] 2.9180 1.386 2.106 0.036 0.186 5.650
C(firm)[T.IBM] 7.9037 1.697 4.657 0.000 4.558 11.249
C(firm)[T.US Steel] 16.6070 1.748 9.503 0.000 13.162 20.052
C(firm)[T.Union Oil] 13.9922 1.600 8.747 0.000 10.838 17.146
C(firm)[T.Westinghouse] 2.3673 1.768 1.339 0.182 -1.119 5.854
CapitalStockRatio 0.0715 0.010 7.230 0.000 0.052 0.091
year 0.2042 0.058 3.521 0.001 0.090 0.318
Omnibus: 92.283 Durbin-Watson: 1.173
Prob(Omnibus): 0.000 Jarque-Bera (JB): 493.190
Skew: 1.557 Prob(JB): 8.04e-108
Kurtosis: 9.642 Cond. No. 7.69e+05


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.69e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

We see that Union Oil and US Steel firms consistently invest a much larger proportion of their firm value into capital expenditures compared to the baseline firm, after controlling for year and the capital stock ratio. Why?

Making predictions

Finally, once you have fitted your model, you can use it to make predictions on new data. To do this, you can use the .predict() method of the model object, as shown below:

import pandas as pd

new_data = pd.DataFrame({
    'CapitalStockRatio': [25],
    'year' : [1965]
})
predictedInvestmentRatio = model.predict(new_data)
print(predictedInvestmentRatio)
0    14.093509
dtype: float64

Don’t be misled by the term ‘predict’ though. While it suggests uncovering the unknown, this model is not forward-looking. To achieve that, you would need to use next year’s investments instead of this year’s. For a deeper understanding of using statistical models for business insights, refer to the Business Analytics course offered in the first-year core.

We can also use this function to plot the regression line of a model with 1 variable.

# DEFINE MODEL
model = smf.ols('InvestRatio ~ CapitalStockRatio', data=df).fit()

# MAKE PREDICTIONS
df['Predicted'] = model.predict(df)

# PLOT MODEL
import matplotlib.pyplot as plt

df.plot(x='CapitalStockRatio', y = 'Predicted', color='red', ax = plt.gca())
df.plot.scatter(y='InvestRatio', x='CapitalStockRatio', alpha=.3, ax = plt.gca());

Additional information

https://www.statsmodels.org/stable/examples/index.html