import pandas as pd
= pd.read_json(
df_downloaded 'https://api.coingecko.com/api/v3/coins/bitcoin/ohlc?vs_currency=usd&days=365&precision=2',
)'./resources/bitcoin_ohlc.csv') # save a copy df_downloaded.to_json(
Time Series Data in Pandas
Dates and times can be used as the index of a Pandas data frame. Many time series methods which use data in Pandas format rely on the time information in the index to provide time series functionality. Although there are several types of indexes specific for that, this tutorial is restricted to the type DatetimeIndex
. To illustrate the use of date indexes, let us take a look at the Bitcoin prices for the past year.
Downloading the data
We will call the CoinGecko API (Application Programming Interface) to obtain the price series of Bitcoin over the past 3 years. Typically, these API’s are essentially just websites that you visit. In this case:
https://api.coingecko.com/api/v3/coins/bitcoin/ohlc?vs_currency=usd&days=365&precision=2
I was able to find this information by reading throught the CoinGecko API’s documentation. In case the download doesn’t work for you for some reason, you can also download the corresponding bitcoin chart that I downloaded some time ago here.
Usually these API’s have rate limitation, so it’s good practice to make a copy / save it so as to not have to download it everytime we accidently remove it or overwrite it.
= df_downloaded.copy()
df df.head()
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 1702080000000 | 43285.62 | 44676.25 | 43133.88 | 44202.18 |
1 | 1702425600000 | 44158.48 | 44345.69 | 40321.94 | 41450.82 |
2 | 1702771200000 | 41459.49 | 43329.32 | 40712.28 | 42247.07 |
3 | 1703116800000 | 42248.13 | 44200.80 | 40570.91 | 43634.12 |
4 | 1703462400000 | 43659.92 | 44243.31 | 42895.13 | 43034.97 |
This is real, raw data so we will need to make some minor adjustments before we can start our analysis. First we’re going to label the columns:
= ['Date','Open','High','Low','Close'] df.columns
Now we need to tell pandas that the first column is not an integer, but in fact a date. This is done using the to_datetime()
command which expects the original series and a format hint as arguments.
# the dates are in a weird format called unix time, the correct
# conversion for this is, you should ask an LLM to do this for
# you when you're not sure what format it is.
'Date'] = pd.to_datetime(df.Date, unit='ms') df[
# since Date is the key identifier, it's appropriate to remove it
# from the columns and set it as index instead
= df.set_index('Date')
df df.head()
Open | High | Low | Close | |
---|---|---|---|---|
Date | ||||
2023-12-09 | 43285.62 | 44676.25 | 43133.88 | 44202.18 |
2023-12-13 | 44158.48 | 44345.69 | 40321.94 | 41450.82 |
2023-12-17 | 41459.49 | 43329.32 | 40712.28 | 42247.07 |
2023-12-21 | 42248.13 | 44200.80 | 40570.91 | 43634.12 |
2023-12-25 | 43659.92 | 44243.31 | 42895.13 | 43034.97 |
That’s better!
Datetime indexes in Pandas
Let’s take a look at our data types:
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 92 entries, 2023-12-09 to 2024-12-07
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Open 92 non-null float64
1 High 92 non-null float64
2 Low 92 non-null float64
3 Close 92 non-null float64
dtypes: float64(4)
memory usage: 3.6 KB
The index looks interesting, it’s called a DatetimeIndex
:
df.index
DatetimeIndex(['2023-12-09', '2023-12-13', '2023-12-17', '2023-12-21',
'2023-12-25', '2023-12-29', '2024-01-02', '2024-01-06',
'2024-01-10', '2024-01-14', '2024-01-18', '2024-01-22',
'2024-01-26', '2024-01-30', '2024-02-03', '2024-02-07',
'2024-02-11', '2024-02-15', '2024-02-19', '2024-02-23',
'2024-02-27', '2024-03-02', '2024-03-06', '2024-03-10',
'2024-03-14', '2024-03-18', '2024-03-22', '2024-03-26',
'2024-03-30', '2024-04-03', '2024-04-07', '2024-04-11',
'2024-04-15', '2024-04-19', '2024-04-23', '2024-04-27',
'2024-05-01', '2024-05-05', '2024-05-09', '2024-05-13',
'2024-05-17', '2024-05-21', '2024-05-25', '2024-05-29',
'2024-06-02', '2024-06-06', '2024-06-10', '2024-06-14',
'2024-06-18', '2024-06-22', '2024-06-26', '2024-06-30',
'2024-07-04', '2024-07-08', '2024-07-12', '2024-07-16',
'2024-07-20', '2024-07-24', '2024-07-28', '2024-08-01',
'2024-08-05', '2024-08-09', '2024-08-13', '2024-08-17',
'2024-08-21', '2024-08-25', '2024-08-29', '2024-09-02',
'2024-09-06', '2024-09-10', '2024-09-14', '2024-09-18',
'2024-09-22', '2024-09-26', '2024-09-30', '2024-10-04',
'2024-10-08', '2024-10-12', '2024-10-16', '2024-10-20',
'2024-10-24', '2024-10-28', '2024-11-01', '2024-11-05',
'2024-11-09', '2024-11-13', '2024-11-17', '2024-11-21',
'2024-11-25', '2024-11-29', '2024-12-03', '2024-12-07'],
dtype='datetime64[ns]', name='Date', freq=None)
Pandas inherits data type datetime64
from Numpy. The expression 'ns'
within the square brackets means nanoseconds. The index takes the name of the corresponding column in the data source (Date
).
This index can be used to select rows, as the following examples shows. Mind that the selection is based on the dates, irrespective order of the index values in the data frame.
'2024-08-01':] df[
Open | High | Low | Close | |
---|---|---|---|---|
Date | ||||
2024-08-01 | 67777.47 | 69907.77 | 64542.65 | 64679.78 |
2024-08-05 | 64586.88 | 65504.97 | 57399.09 | 58006.21 |
2024-08-09 | 58163.74 | 62490.53 | 49221.15 | 61859.03 |
2024-08-13 | 61570.23 | 61667.52 | 57814.74 | 59350.07 |
2024-08-17 | 59347.11 | 61539.91 | 56646.51 | 58881.93 |
2024-08-21 | 58873.37 | 61361.50 | 57918.46 | 59060.13 |
2024-08-25 | 59028.53 | 64828.31 | 58905.22 | 64157.16 |
2024-08-29 | 64187.06 | 64893.52 | 57967.74 | 59015.30 |
2024-09-02 | 59042.05 | 61150.33 | 57257.71 | 57357.72 |
2024-09-06 | 57298.28 | 59812.90 | 55746.63 | 56132.40 |
2024-09-10 | 56163.71 | 57896.82 | 52690.15 | 57049.12 |
2024-09-14 | 57023.26 | 60654.99 | 55573.18 | 60620.63 |
2024-09-18 | 60545.37 | 61242.97 | 57558.77 | 60317.03 |
2024-09-22 | 60320.15 | 64082.40 | 59212.14 | 63403.40 |
2024-09-26 | 63406.35 | 64755.49 | 62479.16 | 63151.90 |
2024-09-30 | 63146.21 | 66439.49 | 62811.76 | 65663.69 |
2024-10-04 | 65602.97 | 65602.97 | 59953.67 | 60727.86 |
2024-10-08 | 60749.23 | 64499.65 | 60469.82 | 62287.39 |
2024-10-12 | 62211.23 | 63361.72 | 58935.06 | 62392.34 |
2024-10-16 | 62464.88 | 67803.50 | 62059.51 | 66962.22 |
2024-10-20 | 67103.49 | 68970.43 | 66738.86 | 68388.87 |
2024-10-24 | 68361.87 | 69431.86 | 65161.83 | 66683.97 |
2024-10-28 | 66634.17 | 68821.39 | 65563.64 | 67938.55 |
2024-11-01 | 67959.92 | 73562.33 | 67557.67 | 70265.00 |
2024-11-05 | 70179.74 | 71500.10 | 66853.85 | 67793.30 |
2024-11-09 | 67799.95 | 77239.75 | 67447.92 | 76550.02 |
2024-11-13 | 76513.26 | 89864.13 | 75868.49 | 88264.60 |
2024-11-17 | 87977.19 | 93477.11 | 86353.40 | 90606.45 |
2024-11-21 | 90596.74 | 94891.17 | 88774.43 | 94217.02 |
2024-11-25 | 94365.50 | 99645.39 | 94159.39 | 98015.94 |
2024-11-29 | 98025.78 | 98882.48 | 90752.43 | 95661.60 |
2024-12-03 | 95648.62 | 98684.89 | 94497.60 | 95833.14 |
2024-12-07 | 95861.96 | 103678.84 | 92980.47 | 99973.85 |
The structure of the index also helps to select the data in special ways. For instance, to select the Friday opening prices, you can apply the function day_name
to the index:
df.index.day_name()
Index(['Saturday', 'Wednesday', 'Sunday', 'Thursday', 'Monday', 'Friday',
'Tuesday', 'Saturday', 'Wednesday', 'Sunday', 'Thursday', 'Monday',
'Friday', 'Tuesday', 'Saturday', 'Wednesday', 'Sunday', 'Thursday',
'Monday', 'Friday', 'Tuesday', 'Saturday', 'Wednesday', 'Sunday',
'Thursday', 'Monday', 'Friday', 'Tuesday', 'Saturday', 'Wednesday',
'Sunday', 'Thursday', 'Monday', 'Friday', 'Tuesday', 'Saturday',
'Wednesday', 'Sunday', 'Thursday', 'Monday', 'Friday', 'Tuesday',
'Saturday', 'Wednesday', 'Sunday', 'Thursday', 'Monday', 'Friday',
'Tuesday', 'Saturday', 'Wednesday', 'Sunday', 'Thursday', 'Monday',
'Friday', 'Tuesday', 'Saturday', 'Wednesday', 'Sunday', 'Thursday',
'Monday', 'Friday', 'Tuesday', 'Saturday', 'Wednesday', 'Sunday',
'Thursday', 'Monday', 'Friday', 'Tuesday', 'Saturday', 'Wednesday',
'Sunday', 'Thursday', 'Monday', 'Friday', 'Tuesday', 'Saturday',
'Wednesday', 'Sunday', 'Thursday', 'Monday', 'Friday', 'Tuesday',
'Saturday', 'Wednesday', 'Sunday', 'Thursday', 'Monday', 'Friday',
'Tuesday', 'Saturday'],
dtype='object', name='Date')
'Open'][df.index.day_name() == 'Friday'].head() df[
Date
2023-12-29 42999.49
2024-01-26 41545.55
2024-02-23 52117.12
2024-03-22 68425.48
2024-04-19 65680.45
Name: Open, dtype: float64
Alternatively, you could use df.index.weekday == 4
. To select the opening prices from a particular month:
'Open'][df.index.month == 2].head() df[
Date
2024-02-03 43285.81
2024-02-07 43172.85
2024-02-11 43077.87
2024-02-15 47769.89
2024-02-19 51821.25
Name: Open, dtype: float64
Also, Pandas takes advantage of the index to format the horizontal axis in a better way, as shown in following chart:
'Open'].plot(figsize=(10,6), color='black', linewidth=1, ylabel='Bitcoin Price ($)');
df[
# Fancy extra code for text annotation
import matplotlib.pyplot as plt
# Adding text annotation
'2024-11-05'), df.loc['2024-11-05', 'Open'], color='r')
plt.scatter(pd.Timestamp(=pd.Timestamp('2024-10-15'),
plt.gca().text(x=df.loc['2024-11-05', 'Open'] + 2500,
y="Election Day",
s=10, color='red', ha='center'); fontsize
Calculating returns
Daily returns can be obtained with the Pandas function pct_change
. Though the first observation gets a NaN
value, because there is no return for the first day, the index of the series is preserved. Mind that pct_change
uses the current order of the rows, not the date values of the index, so the data have to sorted by date.
= df['Close'].pct_change()
df_return df_return.head()
Date
2023-12-09 NaN
2023-12-13 -0.062245
2023-12-17 0.019210
2023-12-21 0.032832
2023-12-25 -0.013731
Name: Close, dtype: float64
The daily returns can be plotted in the same way as the prices:
=(10,6), color='black', linewidth=1); df_return.plot(figsize
Weekly returns, based on the percentage change across Fridays, can be calculated after filtering out the rest of the days. Beware that this could be wrong for the weeks in which the Friday is not a trading day.
= df[df.index.weekday == 4]['Close'].pct_change()
df_w_return df_w_return.head()
Date
2023-12-29 NaN
2024-01-26 -0.062496
2024-02-23 0.284970
2024-03-22 0.277019
2024-04-19 -0.031653
Name: Close, dtype: float64
You try it
What is the median return (pct_change
) on the different days of the week if always buy/sell at the High price?
- Create a new column where you store the day name
- Create another column to store the percentage change
- Group by the day column, and calculate the relevant statistic for the percentage change.
- Plot the result using
df.plot.bar()
...
Ellipsis
Solution
'DayOfWeek'] = df.index.day_name()
df['PctReturn'] = df.Close.pct_change()*100
df[
'DayOfWeek')['PctReturn'].median().plot.bar(); df.groupby(
Prettier Solution
import matplotlib.pyplot as plt
# Work with % change instead (return from high to high)
'DayOfWeek'] = df.index.day_name()
df['PctReturn'] = df.Close.pct_change()*100
df[
= df.groupby('DayOfWeek')['PctReturn'].agg(['median','std'])
weekday_change
# Nicer plot
= weekday_change.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
weekday_change
plt.bar(
weekday_change.index,'median'],
weekday_change[=weekday_change['std'], # uncomment me.
yerr# error_kw=dict(ecolor='red', elinewidth=2)
)'BTC\nMedian Percentage Return by Day of Week')
plt.title("Median Percentage Change (%)"); plt.ylabel(
Resampling
Aggregating time series data by a new time period is often called resampling. The method resample
, designed for Pandas data frames with a Datetime
is driven by the index. In the following example, I apply this method to get the average weekly open price of Bitcoin.
= df['Open'].resample('W').mean() df_w_open
The same can be done to aggregate the data by month. Note that the series is indexed by the Sundays (even when they are not included in the actual data), this is just how pandas does weekly resampling by default.
5) df_w_open.head(
Date
2023-12-10 43285.620
2023-12-17 42808.985
2023-12-24 42248.130
2023-12-31 43329.705
2024-01-07 43369.915
Freq: W-SUN, Name: Open, dtype: float64
We can get a line plot of the weekly prices with:
=(10,6), color='black', linewidth=1); df_w_open.plot(figsize
Rolling windows
The logic of rolling
is similar to that of resampling
, but it groups the rows around every value of the index (except at the extremes). The set of these groups is called a rolling window. This is easily seen in the following calculation, which extracts a moving average (MA) trend for the opening prices (moving average is a popular name for the rolling mean).
'Open'].rolling(5).mean().head(10) df[
Date
2023-12-09 NaN
2023-12-13 NaN
2023-12-17 NaN
2023-12-21 NaN
2023-12-25 42962.328
2023-12-29 42905.102
2024-01-02 42591.698
2024-01-06 43129.474
2024-01-10 43512.834
2024-01-14 44009.056
Name: Open, dtype: float64
As a default, the terms included in the rolling window are collected leftwards. Here the length has been set to 5. So, the first 4 terms of the moving average are NaN
’s. But you may prefer to center the window in the current observation. In that case, you will get 2 NaN
’s at each extreme of the series. This is done with:
= df['Open'].rolling(5, center=True).mean()
ma 10) ma.head(
Date
2023-12-09 NaN
2023-12-13 NaN
2023-12-17 42962.328
2023-12-21 42905.102
2023-12-25 42591.698
2023-12-29 43129.474
2024-01-02 43512.834
2024-01-06 44009.056
2024-01-10 43973.846
2024-01-14 44006.104
Name: Open, dtype: float64
A moving average is frequently used to create a nonparametric trend, that is, a trend which is not given as a mathematical function of time, like a linear or quadratic trend. The chart below illustrates this. The dashed line corresponds to the actual prices, and the solid line to the MA trend.
=(10,6), color='black', linewidth=1)
ma.plot(figsize'Open'].plot(color='gray', linestyle='--', linewidth=1); df[
This part is only for if we have enough time, though I highly recommend that you try it as it shows how to combine datasets.
Microsoft v.s. Apple Analysis
We could do a similar analysis for more traditional stocks like Apple and Microsoft. For this, we’d need to download their respective datasets from Yahoo Finance and load it in as usual. Yahoo has recently disabled this functionality from their website but you can either crawl it yourself or use packages that can help you for that.
Let’s use yfinance
(installed using ! pip install yfinance
). If it doesn’t work for you , you can download the time series from here: MSFT, AAPL, NVDA.
import yfinance as yf
# Define the date range
= "1980-12-12"
start_date = "2024-12-09"
end_date
# Fetch the stock data
= yf.download("MSFT", start=start_date, end=end_date)
df_msft = [col[0] for col in df_msft.columns]
df_msft.columns './resources/MSFT.csv')
df_msft.to_csv(
= yf.download("AAPL", start=start_date, end=end_date)
df_appl = [col[0] for col in df_appl.columns]
df_appl.columns './resources/AAPL.csv') df_appl.to_csv(
# Load it back in
= pd.read_csv('./resources/MSFT.csv', index_col='Date')
df_msft = pd.read_csv('./resources/AAPL.csv', index_col='Date')
df_aapl df_aapl.tail()
Adj Close | Close | High | Low | Open | Volume | |
---|---|---|---|---|---|---|
Date | ||||||
2024-12-02 | 239.589996 | 239.589996 | 240.789993 | 237.160004 | 237.270004 | 48137100 |
2024-12-03 | 242.649994 | 242.649994 | 242.759995 | 238.899994 | 239.809998 | 38861000 |
2024-12-04 | 243.009995 | 243.009995 | 244.110001 | 241.250000 | 242.869995 | 44383900 |
2024-12-05 | 243.039993 | 243.039993 | 244.539993 | 242.130005 | 243.990005 | 40033900 |
2024-12-06 | 242.839996 | 242.839996 | 244.630005 | 242.080002 | 242.910004 | 36852100 |
We have two prices close prices here, it’s common to use Adj Close
for analysis as it normalizes for splits, dividends etc.
We have another problem though, these are separate datasets! How do we combine them into one dataframe?
Combining dataframes
In Pandas, there are several ways to combine two dataframes. If we two datasets look the same w.r.t. their columns and we wish to paste them together vertically, we use the pd.concat([df1,df2])
function. A more frequently occuring scenario is that we need to join two dataframes horiontally based on a common column or index. The most common ways to merge dataframes in Pandas:
Join: df1.join(df2,how='inner')
, joins two dataframes by index. You can also specify a different column using the on
keyword, but in such cases, it’s probably better to use the merge command explained below. The how
keyword specifies what happens in case a value is not shared in the common column: * inner
means that a row with an index that only appears in df1 OR in df2, will be dropped in the resulting dataframe * outer
means that all rows are kept, this will create NaN values in rows where the index value only appears in either one of df1 or df2. * left
returns all the rows from the left dataframe and only the matching rows from the right dataframe * right
returns all the rows from the right dataframe and only the matching rows from the left dataframe
Merge: pd.merge(df1,df2,on='common_column',how='inner')
, this merges the two dataframes on their common columns (by default any column that appears in both dataframes). The how
parameter specifies what happens in case a value is not shared in the common column as explained before.
Note: merging frames is a classic way to crash your kernel as you will potentially use massive amounts of memory. This shouldn’t be a problem for this particular dataset though.
Exercises
You try it:
- Create two new DataFrames (one for MSFT and one for AAPL) by:
- Extracting 1 column with the
Close
price, take care the you extract a dataframe, not a series! - Renaming that column so that it’s called
MSFT
andAAPL
respectively using.columns
- Extracting 1 column with the
- Use the
join
command to merge the apple and microsoft dataframes.
= ... # MSFT dataframe with index Date and column MSFT
df_msft_open = ... # AAPL dataframe with index Date and column AAPL df_appl_open
= ... # join the two dataframes
df_combined # select only data after the year 2000
Solution
= df_msft[['Open']]
df_msft_open = df_aapl[['Open']]
df_appl_open
= ['MSFT']
df_msft_open.columns = ['AAPL']
df_appl_open.columns
= df_msft_open.join(df_appl_open,how='inner')
df_combined
df_combined.head()
MSFT | AAPL | |
---|---|---|
Date | ||
1986-03-13 | 0.088542 | 0.110491 |
1986-03-14 | 0.097222 | 0.110491 |
1986-03-17 | 0.100694 | 0.116071 |
1986-03-18 | 0.102431 | 0.116071 |
1986-03-19 | 0.099826 | 0.119978 |
You try it:
- Select only the data from 2000 onward and plot the price.
- Use
plot()
to plot the estimated returns since 2000.
...
Ellipsis
Naive Solution
= df_combined['2000-01-01':]
df_combined
# Attempt 1: it shows the price evolution but not the return
=(10,4), color=['black','red']); df_combined.plot(figsize
Pct Change Solution
# Calculate daily percentage change
= df_combined.pct_change()
daily_returns
# Convert to cumulative returns
= (1 + daily_returns).cumprod() - 1
cumulative_returns
# Plot the cumulative returns
=(10,4), color=['black','red']); cumulative_returns.plot(figsize
Easier Solution
# Here we divide by the starting price to get a 'fairer' comparison
= df_combined['2000-01-01':]
df_combined
/ df_combined.iloc[0,:]).plot(figsize=(10,4), color=['black','red'], title='Gross Returns - Buy and Hold Strategy'); (df_combined
Exercises
You try it:
What would have happened had you started at the beginning of the time series around March 1986?
Try adding NVIDIA in the mix.
...
# NVIDIA
import yfinance as yf
if(False):
# Define the date range
= "1980-12-12"
start_date = "2024-12-09"
end_date
# Fetch the stock data
= yf.download("NVDA", start=start_date, end=end_date)
df_nvda = [col[0] for col in df_nvda.columns]
df_nvda.columns './resources/NVDA.csv')
df_nvda.to_csv(
= pd.read_csv('./resources/NVDA.csv', index_col='Date')
df_nvda
= df_nvda[['Open']]
df_nvda_open = df_msft[['Open']]
df_msft_open = df_aapl[['Open']]
df_appl_open
= ['NVDA']
df_nvda_open.columns = ['MSFT']
df_msft_open.columns = ['AAPL']
df_appl_open.columns
= df_msft_open.join(df_appl_open,how='inner').join(df_nvda_open, how='inner')
df_combined
# Plot
/ df_combined.iloc[0,:]).plot(figsize=(10,4), color=['black','red','green'], title='Gross Returns - Buy and Hold Strategy'); (df_combined
References
Original AAPL/MSFT content adapted from: Miguel Ángel Canela