Investment Drivers

In this session, we will explore how to analyze relationships between variables using regression models. As we’ve seen in prior classes, pandas allows one to load data and calculate basic statistics. To do more advanced statistics like regressions, we will require the help of other libraries such as statsmodels.

Statsmodels is a Python library that provides a wide range of statistical tools and models for data analysis and modeling. It is an open-source package that is built on top of the NumPy and SciPy libraries, and it has gained popularity among data scientists, statisticians, and researchers.

To use Statsmodels in your Python project, you first need to install and import the library into your code. You can do this by using the following import statement:

! pip3 install statsmodels

Requirement already satisfied: statsmodels in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (0.14.0)
Requirement already satisfied: numpy>=1.18 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from statsmodels) (1.26.4)
Requirement already satisfied: scipy!=1.9.2,>=1.4 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from statsmodels) (1.13.0)
Requirement already satisfied: pandas>=1.0 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from statsmodels) (2.2.1)
Requirement already satisfied: patsy>=0.5.2 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from statsmodels) (0.5.3)
Requirement already satisfied: packaging>=21.3 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from statsmodels) (23.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from pandas>=1.0->statsmodels) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from pandas>=1.0->statsmodels) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from pandas>=1.0->statsmodels) (2023.3)
Requirement already satisfied: six in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from patsy>=0.5.2->statsmodels) (1.16.0)

import statsmodels.api as sm

Loading sample data

For the sake of example, we will be using the Grunfeld dataset. A long introduction is beyond the scope, but it is a panel dataset that contains data on investment and other variables for 10 firms over a period of 20 years. It is a commonly used dataset in econometrics and has been used to study various aspects of investment behavior and firm performance.

data = sm.datasets.grunfeld.load_pandas()
df   = data.data

df.head()

	invest	value	capital	firm	year
0	317.6	3078.5	2.8	General Motors	1935.0
1	391.8	4661.7	52.6	General Motors	1936.0
2	410.6	5387.1	156.9	General Motors	1937.0
3	257.7	2792.2	209.2	General Motors	1938.0
4	330.8	4313.2	203.4	General Motors	1939.0

Every row represents an annual observation of investment made by the company on physical capital (in millions), firm value (in millions), and capital stock (in millions) for General Motors over the years 1935 to 1939.

Ratios

To adjust the numbers per year and make them more comparable, we could express investment and capital stock as percentages of firm value. This normalizes the data relative to the size of the firm each year.

You try it

Compute the investment and capital ratios:

\[ \begin{align} \text{InvestRatio} &= \left( \frac{\text{Investment}}{\text{Firm Value}} \right) \times 100 \\ \text{CapitalStockRatio} &= \left( \frac{\text{Capital Stock}}{\text{Firm Value}} \right) \times 100 \end{align} \]

Then, using df[column].plot.hist(), plot the distribution of CapitalStockRatio.

...

Ellipsis

Solution

df['InvestRatio']       = df.invest / df.value * 100
df['CapitalStockRatio'] = df.capital / df.value * 100
df.InvestRatio.plot.hist(figsize=(7,5), color='gray', edgecolor='white');

We can also plot the correlation

df.plot.scatter(x='CapitalStockRatio',y='InvestRatio', alpha=.5);

Defining a model

One of the key features of Statsmodels is its Formula API, which allows you to specify statistical models using tilde-style formulas. This makes it easy to specify complex statistical models with minimal coding effort. To use the Formula API, you can import the following module:

import statsmodels.formula.api as smf

Let’s say we want to predict the investment ratio based on the current valuation, capital, and the year. We could do that as follows:

\[ \textrm{InvestRatio} = a + b_0 \textrm{CapitalStockRatio} + b_1 \textrm{Year} \]

The “tilde-style” formula’s that we were referring to before go as follows:

\[ Y \sim X_1 + X_2 + ... \] Which in our case would be:

\[ \textrm{InvestRatio}\,\, \sim\,\, \textrm{CapitalStockRatio} + \textrm{Year} \]

This sounds complicated, until you see how the whole regression is just a 1-liner in statsmodels!

model = smf.ols('InvestRatio ~ CapitalStockRatio + year', data=df).fit()

Once you have defined your model, you can obtain a summary of the results, which includes important statistics such as the R-squared value, standard errors, and p-values. This summary can be accessed using the .summary() method of the model object, as shown below:

model.summary()

OLS Regression Results
Dep. Variable:	InvestRatio	R-squared:	0.561
Model:	OLS	Adj. R-squared:	0.556
Method:	Least Squares	F-statistic:	138.4
Date:	Mon, 02 Dec 2024	Prob (F-statistic):	1.82e-39
Time:	10:39:02	Log-Likelihood:	-729.30
No. Observations:	220	AIC:	1465.
Df Residuals:	217	BIC:	1475.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-300.2015	155.916	-1.925	0.055	-607.506	7.102
CapitalStockRatio	0.0862	0.005	15.723	0.000	0.075	0.097
year	0.1589	0.080	1.980	0.049	0.001	0.317

Omnibus:	31.316	Durbin-Watson:	0.477
Prob(Omnibus):	0.000	Jarque-Bera (JB):	43.824
Skew:	0.871	Prob(JB):	3.05e-10
Kurtosis:	4.321	Cond. No.	6.71e+05

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.71e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

As you can see, this prints a nicely formatted output of the model summary, including the R-squared value and much more.

For this particular case, we conclude that firms with a higher proportion of capital stock relative to their value are more inclined or able to reinvest in their physical assets. The effect is significant and positive effect (coef = 0.086, p < 0.001) and over time we see a smaller upward trend (year coef = 0.159, p = 0.049). However, the other statistical tests show we may need to refine the model further.

Categorical variables are automatically detected and dummy encoded, but it’s good practice to mark them as categoricals by adding C around the variable names. Let’s say we wanted to consider firm-specific factors like management or culture:

model_cat = smf.ols('InvestRatio ~ CapitalStockRatio + year + C(firm)', data=df).fit()
model_cat.summary()

OLS Regression Results
Dep. Variable:	InvestRatio	R-squared:	0.835
Model:	OLS	Adj. R-squared:	0.825
Method:	Least Squares	F-statistic:	87.25
Date:	Mon, 02 Dec 2024	Prob (F-statistic):	4.99e-74
Time:	10:39:03	Log-Likelihood:	-621.59
No. Observations:	220	AIC:	1269.
Df Residuals:	207	BIC:	1313.
Df Model:	12
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-393.8923	112.090	-3.514	0.001	-614.877	-172.907
C(firm)[T.Atlantic Refining]	9.5560	1.540	6.207	0.000	6.521	12.591
C(firm)[T.Chrysler]	7.9287	1.734	4.573	0.000	4.510	11.347
C(firm)[T.Diamond Match]	0.6988	1.789	0.391	0.697	-2.829	4.227
C(firm)[T.General Electric]	0.6825	1.709	0.399	0.690	-2.686	4.051
C(firm)[T.General Motors]	9.5595	1.751	5.459	0.000	6.107	13.012
C(firm)[T.Goodyear]	2.9180	1.386	2.106	0.036	0.186	5.650
C(firm)[T.IBM]	7.9037	1.697	4.657	0.000	4.558	11.249
C(firm)[T.US Steel]	16.6070	1.748	9.503	0.000	13.162	20.052
C(firm)[T.Union Oil]	13.9922	1.600	8.747	0.000	10.838	17.146
C(firm)[T.Westinghouse]	2.3673	1.768	1.339	0.182	-1.119	5.854
CapitalStockRatio	0.0715	0.010	7.230	0.000	0.052	0.091
year	0.2042	0.058	3.521	0.001	0.090	0.318

Omnibus:	92.283	Durbin-Watson:	1.173
Prob(Omnibus):	0.000	Jarque-Bera (JB):	493.190
Skew:	1.557	Prob(JB):	8.04e-108
Kurtosis:	9.642	Cond. No.	7.69e+05

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.69e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

We see that Union Oil and US Steel firms consistently invest a much larger proportion of their firm value into capital expenditures compared to the baseline firm, after controlling for year and the capital stock ratio. Why?

Making predictions

Finally, once you have fitted your model, you can use it to make predictions on new data. To do this, you can use the .predict() method of the model object, as shown below:

import pandas as pd

new_data = pd.DataFrame({
    'CapitalStockRatio': [25],
    'year' : [1965]
})
predictedInvestmentRatio = model.predict(new_data)
print(predictedInvestmentRatio)

0    14.093509
dtype: float64

Don’t be misled by the term ‘predict’ though. While it suggests uncovering the unknown, this model is not forward-looking. To achieve that, you would need to use next year’s investments instead of this year’s. For a deeper understanding of using statistical models for business insights, refer to the Business Analytics course offered in the first-year core.

We can also use this function to plot the regression line of a model with 1 variable.

# DEFINE MODEL
model = smf.ols('InvestRatio ~ CapitalStockRatio', data=df).fit()

# MAKE PREDICTIONS
df['Predicted'] = model.predict(df)

# PLOT MODEL
import matplotlib.pyplot as plt

df.plot(x='CapitalStockRatio', y = 'Predicted', color='red', ax = plt.gca())
df.plot.scatter(y='InvestRatio', x='CapitalStockRatio', alpha=.3, ax = plt.gca());

Additional information

https://www.statsmodels.org/stable/examples/index.html