- Published on
Time series Analysis 101 - Part 1
data science- Authors
- Name
- Ndamulelo Nemakhavhani
- @ndamulelonemakh
Compiled by endeesa. Last update 23/April/2022
1. Pandas refresher
The following notes will revisit a few pandas concepts which are important for doing time series analysis in Python. Suppose you created a pandas dataframe like the one shown in figure 1. The name assigned to the dataframe is df
Sales | |
---|---|
Date | |
22/04/2022 | 70 |
23/04/2022 | 50 |
24/04/2022 | 77 |
24/04/2022 | 90 |
These are some common operations that you might want to perform:
index values of type string into datetime objects using pd_todatetime()
i. Convertdf.index = pd.to_datetime(df.index)
ii. Plot a time series on a line graph
# Produces a matplotlib line plot(s) using all the columns
df.plot(grid=True)
iii. Index slicing
- Recall that slicing is used to filter a subset of the data based on the position
- Similarly you can slice pandas datetime indexes to filter data based on years, months, days etc.
# Pandas datetime indexing examples
# Filter by year 2022
timeSeries2012 = df['2022']
# Filter values from 2022 April
timeSeries2012May = df['2012-04']
- For more examples, visit: pandas datetime indexing tutorial
iv. Frequency conversion
- Sometimes we may wish to downsample or updample readings into monthly, quarterly or yearly frequency
- This functionality can be easily obtained from the built-in pandas function pd.resample()
# Example: Convert daily readings to MONTHLY readings using the median
df.resample('M').median()
- Other popular frequencies: Q-quarter, D-day, W-week, A=year, T-minute etc.
v. Merge multiple dataframes
# Assume we have another dataframe df2 similar to df
# We can merge the columns of these 2 dataframes as follows
df.join(other=df2, how='innner', on=None)
Note that if we don't specify the value for the on argument, the two dataframes will be matched by the index. Read the docs for more info
- If instead, we wanted to merge the rows, we would use df.concat()
vi. Calculating correlation and autocorrelation
- Correlation is a simple measure that tells us whether the values between two columns vary together or not
# Assume you have a dataframe named 'stocks' with stock prices for microsoft and google
# The columns are named 'MSFT' and 'GOOGL' respectively
correlation = stocks['MSFT'].corr(stocks['GOOGLE'])
Typically when dealing with time series data. We do not calculate the correlation on the actual prices , but the percentage changes instead. Use the 'pct_change()' function to convert the values before computing the correlation.
- If we are interested in knowing the the correlation of a time series with a delayed version of itself, we can calculate the autocorrelation as follows:
# First convert the actual prices to returns
msft_returns = stocks['MSFT'].pct_change()
# Then compute the autocorrelation
msft_returns.autocorr()
Knowledge check
Try putting the concepts covered above into practice with the following short exercise
- Download a the oil prices dataset from here
- Read the data into a pandas dataframe
- Set the index of the dataframe to be the date column
- Plot the oil prices from 2000 to 2020
- Create a new dataframe with 2019 data only and change the frequency to quarters
- Calculate lag 2 autocorrelation of the oil prices in 2019
- Plot the autocorrelation function of the oil prices for 2020(Optional)
Once completed, move on to Part 2 of this series where we will cover EDA(exploratory data analysis) methods applicable for time series data