• Review
  • Image Data
  • Tabular Data
  • Text Data
  • Assignments
    • Haiteng Engineering: Quality Control System Analysis
    • AirBnBarcelona
  1. Tabular Data
  2. Time Series Data in Pandas
  • Review
    • Introduction to Python
    • NumPy: Numerical Python
  • Image Data
    • How Computers See
    • Computer Vision
  • Tabular Data
    • Pandas: Better than Excel
    • Investment Drivers
    • From Data to Model: AirBnB
    • Time Series Data in Pandas
  • Text Data
    • How Computers “Read”
    • 30-Text-Data/060-Embeddings.ipynb

Navigation

On this page

  • Downloading the data
    • Datetime indexes in Pandas
    • Calculating returns
    • Resampling
    • Rolling windows
  • Microsoft v.s. Apple Analysis
  1. Tabular Data
  2. Time Series Data in Pandas

Time Series Data in Pandas

Dates and times can be used as the index of a Pandas data frame. Many time series methods which use data in Pandas format rely on the time information in the index to provide time series functionality. Although there are several types of indexes specific for that, this tutorial is restricted to the type DatetimeIndex. To illustrate the use of date indexes, let us take a look at the Bitcoin prices for the past year.

Downloading the data

We will call the CoinGecko API (Application Programming Interface) to obtain the price series of Bitcoin over the past 3 years. Typically, these API’s are essentially just websites that you visit. In this case:

https://api.coingecko.com/api/v3/coins/bitcoin/ohlc?vs_currency=usd&days=365&precision=2

I was able to find this information by reading throught the CoinGecko API’s documentation. In case the download doesn’t work for you for some reason, you can also download the corresponding bitcoin chart that I downloaded some time ago here.

import pandas as pd

df_downloaded = pd.read_json(
    'https://api.coingecko.com/api/v3/coins/bitcoin/ohlc?vs_currency=usd&days=365&precision=2',
)
df_downloaded.to_json('./resources/bitcoin_ohlc.csv') # save a copy

Usually these API’s have rate limitation, so it’s good practice to make a copy / save it so as to not have to download it everytime we accidently remove it or overwrite it.

df = df_downloaded.copy()
df.head()
0 1 2 3 4
0 1702080000000 43285.62 44676.25 43133.88 44202.18
1 1702425600000 44158.48 44345.69 40321.94 41450.82
2 1702771200000 41459.49 43329.32 40712.28 42247.07
3 1703116800000 42248.13 44200.80 40570.91 43634.12
4 1703462400000 43659.92 44243.31 42895.13 43034.97

This is real, raw data so we will need to make some minor adjustments before we can start our analysis. First we’re going to label the columns:

df.columns = ['Date','Open','High','Low','Close']

Now we need to tell pandas that the first column is not an integer, but in fact a date. This is done using the to_datetime() command which expects the original series and a format hint as arguments.

# the dates are in a weird format called unix time, the correct
# conversion for this is, you should ask an LLM to do this for
# you when you're not sure what format it is.
df['Date'] = pd.to_datetime(df.Date, unit='ms')
# since Date is the key identifier, it's appropriate to remove it
# from the columns and set it as index instead
df = df.set_index('Date')
df.head()
Open High Low Close
Date
2023-12-09 43285.62 44676.25 43133.88 44202.18
2023-12-13 44158.48 44345.69 40321.94 41450.82
2023-12-17 41459.49 43329.32 40712.28 42247.07
2023-12-21 42248.13 44200.80 40570.91 43634.12
2023-12-25 43659.92 44243.31 42895.13 43034.97

That’s better!

Datetime indexes in Pandas

Let’s take a look at our data types:

df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 92 entries, 2023-12-09 to 2024-12-07
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Open    92 non-null     float64
 1   High    92 non-null     float64
 2   Low     92 non-null     float64
 3   Close   92 non-null     float64
dtypes: float64(4)
memory usage: 3.6 KB

The index looks interesting, it’s called a DatetimeIndex:

df.index
DatetimeIndex(['2023-12-09', '2023-12-13', '2023-12-17', '2023-12-21',
               '2023-12-25', '2023-12-29', '2024-01-02', '2024-01-06',
               '2024-01-10', '2024-01-14', '2024-01-18', '2024-01-22',
               '2024-01-26', '2024-01-30', '2024-02-03', '2024-02-07',
               '2024-02-11', '2024-02-15', '2024-02-19', '2024-02-23',
               '2024-02-27', '2024-03-02', '2024-03-06', '2024-03-10',
               '2024-03-14', '2024-03-18', '2024-03-22', '2024-03-26',
               '2024-03-30', '2024-04-03', '2024-04-07', '2024-04-11',
               '2024-04-15', '2024-04-19', '2024-04-23', '2024-04-27',
               '2024-05-01', '2024-05-05', '2024-05-09', '2024-05-13',
               '2024-05-17', '2024-05-21', '2024-05-25', '2024-05-29',
               '2024-06-02', '2024-06-06', '2024-06-10', '2024-06-14',
               '2024-06-18', '2024-06-22', '2024-06-26', '2024-06-30',
               '2024-07-04', '2024-07-08', '2024-07-12', '2024-07-16',
               '2024-07-20', '2024-07-24', '2024-07-28', '2024-08-01',
               '2024-08-05', '2024-08-09', '2024-08-13', '2024-08-17',
               '2024-08-21', '2024-08-25', '2024-08-29', '2024-09-02',
               '2024-09-06', '2024-09-10', '2024-09-14', '2024-09-18',
               '2024-09-22', '2024-09-26', '2024-09-30', '2024-10-04',
               '2024-10-08', '2024-10-12', '2024-10-16', '2024-10-20',
               '2024-10-24', '2024-10-28', '2024-11-01', '2024-11-05',
               '2024-11-09', '2024-11-13', '2024-11-17', '2024-11-21',
               '2024-11-25', '2024-11-29', '2024-12-03', '2024-12-07'],
              dtype='datetime64[ns]', name='Date', freq=None)

Pandas inherits data type datetime64 from Numpy. The expression 'ns' within the square brackets means nanoseconds. The index takes the name of the corresponding column in the data source (Date).

This index can be used to select rows, as the following examples shows. Mind that the selection is based on the dates, irrespective order of the index values in the data frame.

df['2024-08-01':]
Open High Low Close
Date
2024-08-01 67777.47 69907.77 64542.65 64679.78
2024-08-05 64586.88 65504.97 57399.09 58006.21
2024-08-09 58163.74 62490.53 49221.15 61859.03
2024-08-13 61570.23 61667.52 57814.74 59350.07
2024-08-17 59347.11 61539.91 56646.51 58881.93
2024-08-21 58873.37 61361.50 57918.46 59060.13
2024-08-25 59028.53 64828.31 58905.22 64157.16
2024-08-29 64187.06 64893.52 57967.74 59015.30
2024-09-02 59042.05 61150.33 57257.71 57357.72
2024-09-06 57298.28 59812.90 55746.63 56132.40
2024-09-10 56163.71 57896.82 52690.15 57049.12
2024-09-14 57023.26 60654.99 55573.18 60620.63
2024-09-18 60545.37 61242.97 57558.77 60317.03
2024-09-22 60320.15 64082.40 59212.14 63403.40
2024-09-26 63406.35 64755.49 62479.16 63151.90
2024-09-30 63146.21 66439.49 62811.76 65663.69
2024-10-04 65602.97 65602.97 59953.67 60727.86
2024-10-08 60749.23 64499.65 60469.82 62287.39
2024-10-12 62211.23 63361.72 58935.06 62392.34
2024-10-16 62464.88 67803.50 62059.51 66962.22
2024-10-20 67103.49 68970.43 66738.86 68388.87
2024-10-24 68361.87 69431.86 65161.83 66683.97
2024-10-28 66634.17 68821.39 65563.64 67938.55
2024-11-01 67959.92 73562.33 67557.67 70265.00
2024-11-05 70179.74 71500.10 66853.85 67793.30
2024-11-09 67799.95 77239.75 67447.92 76550.02
2024-11-13 76513.26 89864.13 75868.49 88264.60
2024-11-17 87977.19 93477.11 86353.40 90606.45
2024-11-21 90596.74 94891.17 88774.43 94217.02
2024-11-25 94365.50 99645.39 94159.39 98015.94
2024-11-29 98025.78 98882.48 90752.43 95661.60
2024-12-03 95648.62 98684.89 94497.60 95833.14
2024-12-07 95861.96 103678.84 92980.47 99973.85

The structure of the index also helps to select the data in special ways. For instance, to select the Friday opening prices, you can apply the function day_name to the index:

df.index.day_name()
Index(['Saturday', 'Wednesday', 'Sunday', 'Thursday', 'Monday', 'Friday',
       'Tuesday', 'Saturday', 'Wednesday', 'Sunday', 'Thursday', 'Monday',
       'Friday', 'Tuesday', 'Saturday', 'Wednesday', 'Sunday', 'Thursday',
       'Monday', 'Friday', 'Tuesday', 'Saturday', 'Wednesday', 'Sunday',
       'Thursday', 'Monday', 'Friday', 'Tuesday', 'Saturday', 'Wednesday',
       'Sunday', 'Thursday', 'Monday', 'Friday', 'Tuesday', 'Saturday',
       'Wednesday', 'Sunday', 'Thursday', 'Monday', 'Friday', 'Tuesday',
       'Saturday', 'Wednesday', 'Sunday', 'Thursday', 'Monday', 'Friday',
       'Tuesday', 'Saturday', 'Wednesday', 'Sunday', 'Thursday', 'Monday',
       'Friday', 'Tuesday', 'Saturday', 'Wednesday', 'Sunday', 'Thursday',
       'Monday', 'Friday', 'Tuesday', 'Saturday', 'Wednesday', 'Sunday',
       'Thursday', 'Monday', 'Friday', 'Tuesday', 'Saturday', 'Wednesday',
       'Sunday', 'Thursday', 'Monday', 'Friday', 'Tuesday', 'Saturday',
       'Wednesday', 'Sunday', 'Thursday', 'Monday', 'Friday', 'Tuesday',
       'Saturday', 'Wednesday', 'Sunday', 'Thursday', 'Monday', 'Friday',
       'Tuesday', 'Saturday'],
      dtype='object', name='Date')
df['Open'][df.index.day_name() == 'Friday'].head()
Date
2023-12-29    42999.49
2024-01-26    41545.55
2024-02-23    52117.12
2024-03-22    68425.48
2024-04-19    65680.45
Name: Open, dtype: float64

Alternatively, you could use df.index.weekday == 4. To select the opening prices from a particular month:

df['Open'][df.index.month == 2].head()
Date
2024-02-03    43285.81
2024-02-07    43172.85
2024-02-11    43077.87
2024-02-15    47769.89
2024-02-19    51821.25
Name: Open, dtype: float64

Also, Pandas takes advantage of the index to format the horizontal axis in a better way, as shown in following chart:

df['Open'].plot(figsize=(10,6), color='black', linewidth=1, ylabel='Bitcoin Price ($)');

# Fancy extra code for text annotation
import matplotlib.pyplot as plt

# Adding text annotation
plt.scatter(pd.Timestamp('2024-11-05'), df.loc['2024-11-05', 'Open'], color='r')
plt.gca().text(x=pd.Timestamp('2024-10-15'), 
               y=df.loc['2024-11-05', 'Open'] + 2500, 
               s="Election Day", 
               fontsize=10, color='red', ha='center');

Calculating returns

Daily returns can be obtained with the Pandas function pct_change. Though the first observation gets a NaN value, because there is no return for the first day, the index of the series is preserved. Mind that pct_change uses the current order of the rows, not the date values of the index, so the data have to sorted by date.

df_return = df['Close'].pct_change()
df_return.head()
Date
2023-12-09         NaN
2023-12-13   -0.062245
2023-12-17    0.019210
2023-12-21    0.032832
2023-12-25   -0.013731
Name: Close, dtype: float64

The daily returns can be plotted in the same way as the prices:

df_return.plot(figsize=(10,6), color='black', linewidth=1);

Weekly returns, based on the percentage change across Fridays, can be calculated after filtering out the rest of the days. Beware that this could be wrong for the weeks in which the Friday is not a trading day.

df_w_return = df[df.index.weekday == 4]['Close'].pct_change()
df_w_return.head()
Date
2023-12-29         NaN
2024-01-26   -0.062496
2024-02-23    0.284970
2024-03-22    0.277019
2024-04-19   -0.031653
Name: Close, dtype: float64

You try it

What is the median return (pct_change) on the different days of the week if always buy/sell at the High price?

  1. Create a new column where you store the day name
  2. Create another column to store the percentage change
  3. Group by the day column, and calculate the relevant statistic for the percentage change.
  4. Plot the result using df.plot.bar()
...
Ellipsis
Solution
df['DayOfWeek'] = df.index.day_name()
df['PctReturn'] = df.Close.pct_change()*100

df.groupby('DayOfWeek')['PctReturn'].median().plot.bar();

Prettier Solution
import matplotlib.pyplot  as plt

# Work with % change instead (return from high to high)
df['DayOfWeek'] = df.index.day_name()
df['PctReturn'] = df.Close.pct_change()*100

weekday_change = df.groupby('DayOfWeek')['PctReturn'].agg(['median','std'])

# Nicer plot
weekday_change = weekday_change.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.bar(
    weekday_change.index,
    weekday_change['median'],
    yerr=weekday_change['std'],                  # uncomment me.
    # error_kw=dict(ecolor='red', elinewidth=2)  
)
plt.title('BTC\nMedian Percentage Return by Day of Week')
plt.ylabel("Median Percentage Change (%)");

Resampling

Aggregating time series data by a new time period is often called resampling. The method resample, designed for Pandas data frames with a Datetime is driven by the index. In the following example, I apply this method to get the average weekly open price of Bitcoin.

df_w_open = df['Open'].resample('W').mean()

The same can be done to aggregate the data by month. Note that the series is indexed by the Sundays (even when they are not included in the actual data), this is just how pandas does weekly resampling by default.

df_w_open.head(5)
Date
2023-12-10    43285.620
2023-12-17    42808.985
2023-12-24    42248.130
2023-12-31    43329.705
2024-01-07    43369.915
Freq: W-SUN, Name: Open, dtype: float64

We can get a line plot of the weekly prices with:

df_w_open.plot(figsize=(10,6), color='black', linewidth=1);

Rolling windows

The logic of rolling is similar to that of resampling, but it groups the rows around every value of the index (except at the extremes). The set of these groups is called a rolling window. This is easily seen in the following calculation, which extracts a moving average (MA) trend for the opening prices (moving average is a popular name for the rolling mean).

df['Open'].rolling(5).mean().head(10)
Date
2023-12-09          NaN
2023-12-13          NaN
2023-12-17          NaN
2023-12-21          NaN
2023-12-25    42962.328
2023-12-29    42905.102
2024-01-02    42591.698
2024-01-06    43129.474
2024-01-10    43512.834
2024-01-14    44009.056
Name: Open, dtype: float64

As a default, the terms included in the rolling window are collected leftwards. Here the length has been set to 5. So, the first 4 terms of the moving average are NaN’s. But you may prefer to center the window in the current observation. In that case, you will get 2 NaN’s at each extreme of the series. This is done with:

ma = df['Open'].rolling(5, center=True).mean()
ma.head(10)
Date
2023-12-09          NaN
2023-12-13          NaN
2023-12-17    42962.328
2023-12-21    42905.102
2023-12-25    42591.698
2023-12-29    43129.474
2024-01-02    43512.834
2024-01-06    44009.056
2024-01-10    43973.846
2024-01-14    44006.104
Name: Open, dtype: float64

A moving average is frequently used to create a nonparametric trend, that is, a trend which is not given as a mathematical function of time, like a linear or quadratic trend. The chart below illustrates this. The dashed line corresponds to the actual prices, and the solid line to the MA trend.

ma.plot(figsize=(10,6), color='black', linewidth=1)
df['Open'].plot(color='gray', linestyle='--', linewidth=1);

This part is only for if we have enough time, though I highly recommend that you try it as it shows how to combine datasets.

Microsoft v.s. Apple Analysis

We could do a similar analysis for more traditional stocks like Apple and Microsoft. For this, we’d need to download their respective datasets from Yahoo Finance and load it in as usual. Yahoo has recently disabled this functionality from their website but you can either crawl it yourself or use packages that can help you for that.

Let’s use yfinance (installed using ! pip install yfinance). If it doesn’t work for you , you can download the time series from here: MSFT, AAPL, NVDA.

import yfinance as yf

# Define the date range
start_date = "1980-12-12"
end_date = "2024-12-09"

# Fetch the stock data
df_msft = yf.download("MSFT", start=start_date, end=end_date)
df_msft.columns = [col[0] for col in df_msft.columns]
df_msft.to_csv('./resources/MSFT.csv')

df_appl = yf.download("AAPL", start=start_date, end=end_date)
df_appl.columns = [col[0] for col in df_appl.columns]
df_appl.to_csv('./resources/AAPL.csv')
# Load it back in
df_msft = pd.read_csv('./resources/MSFT.csv', index_col='Date')
df_aapl = pd.read_csv('./resources/AAPL.csv', index_col='Date')
df_aapl.tail()
Adj Close Close High Low Open Volume
Date
2024-12-02 239.589996 239.589996 240.789993 237.160004 237.270004 48137100
2024-12-03 242.649994 242.649994 242.759995 238.899994 239.809998 38861000
2024-12-04 243.009995 243.009995 244.110001 241.250000 242.869995 44383900
2024-12-05 243.039993 243.039993 244.539993 242.130005 243.990005 40033900
2024-12-06 242.839996 242.839996 244.630005 242.080002 242.910004 36852100

We have two prices close prices here, it’s common to use Adj Close for analysis as it normalizes for splits, dividends etc.

We have another problem though, these are separate datasets! How do we combine them into one dataframe?

Combining dataframes

In Pandas, there are several ways to combine two dataframes. If we two datasets look the same w.r.t. their columns and we wish to paste them together vertically, we use the pd.concat([df1,df2]) function. A more frequently occuring scenario is that we need to join two dataframes horiontally based on a common column or index. The most common ways to merge dataframes in Pandas:

Join: df1.join(df2,how='inner'), joins two dataframes by index. You can also specify a different column using the on keyword, but in such cases, it’s probably better to use the merge command explained below. The how keyword specifies what happens in case a value is not shared in the common column: * inner means that a row with an index that only appears in df1 OR in df2, will be dropped in the resulting dataframe * outer means that all rows are kept, this will create NaN values in rows where the index value only appears in either one of df1 or df2. * left returns all the rows from the left dataframe and only the matching rows from the right dataframe * right returns all the rows from the right dataframe and only the matching rows from the left dataframe

Merge: pd.merge(df1,df2,on='common_column',how='inner'), this merges the two dataframes on their common columns (by default any column that appears in both dataframes). The how parameter specifies what happens in case a value is not shared in the common column as explained before.

Note: merging frames is a classic way to crash your kernel as you will potentially use massive amounts of memory. This shouldn’t be a problem for this particular dataset though.

Exercises

You try it:

  1. Create two new DataFrames (one for MSFT and one for AAPL) by:
    1. Extracting 1 column with the Close price, take care the you extract a dataframe, not a series!
    2. Renaming that column so that it’s called MSFT and AAPL respectively using .columns
  2. Use the join command to merge the apple and microsoft dataframes.
df_msft_open = ... # MSFT dataframe with index Date and column MSFT
df_appl_open = ... # AAPL dataframe with index Date and column AAPL
df_combined = ... # join the two dataframes
                  # select only data after the year 2000       
Solution
df_msft_open = df_msft[['Open']]
df_appl_open = df_aapl[['Open']]

df_msft_open.columns = ['MSFT']
df_appl_open.columns = ['AAPL']

df_combined = df_msft_open.join(df_appl_open,how='inner')

df_combined.head()
MSFT AAPL
Date
1986-03-13 0.088542 0.110491
1986-03-14 0.097222 0.110491
1986-03-17 0.100694 0.116071
1986-03-18 0.102431 0.116071
1986-03-19 0.099826 0.119978

You try it:

  1. Select only the data from 2000 onward and plot the price.
  2. Use plot() to plot the estimated returns since 2000.
...
Ellipsis
Naive Solution
df_combined = df_combined['2000-01-01':]

# Attempt 1: it shows the price evolution but not the return
df_combined.plot(figsize=(10,4), color=['black','red']);

Pct Change Solution
# Calculate daily percentage change
daily_returns = df_combined.pct_change()

# Convert to cumulative returns
cumulative_returns = (1 + daily_returns).cumprod() - 1

# Plot the cumulative returns
cumulative_returns.plot(figsize=(10,4), color=['black','red']);

Easier Solution
# Here we divide by the starting price to get a 'fairer' comparison
df_combined = df_combined['2000-01-01':]

(df_combined / df_combined.iloc[0,:]).plot(figsize=(10,4), color=['black','red'], title='Gross Returns - Buy and Hold Strategy');

Exercises

You try it:

What would have happened had you started at the beginning of the time series around March 1986?

Try adding NVIDIA in the mix.

...
# NVIDIA
import yfinance as yf

if(False):
    # Define the date range
    start_date = "1980-12-12"
    end_date = "2024-12-09"

    # Fetch the stock data
    df_nvda = yf.download("NVDA", start=start_date, end=end_date)
    df_nvda.columns = [col[0] for col in df_nvda.columns]
    df_nvda.to_csv('./resources/NVDA.csv')


df_nvda = pd.read_csv('./resources/NVDA.csv', index_col='Date')

df_nvda_open = df_nvda[['Open']]
df_msft_open = df_msft[['Open']]
df_appl_open = df_aapl[['Open']]

df_nvda_open.columns = ['NVDA']
df_msft_open.columns = ['MSFT']
df_appl_open.columns = ['AAPL']

df_combined = df_msft_open.join(df_appl_open,how='inner').join(df_nvda_open, how='inner')

# Plot
(df_combined / df_combined.iloc[0,:]).plot(figsize=(10,4), color=['black','red','green'], title='Gross Returns - Buy and Hold Strategy');

References

Original AAPL/MSFT content adapted from: Miguel Ángel Canela