Published on

Time series Analysis 101 - Part 1

data science

Compiled by endeesa. Last update 23/April/2022

1. Pandas refresher

The following notes will revisit a few pandas concepts which are important for doing time series analysis in Python. Suppose you created a pandas dataframe like the one shown in figure 1. The name assigned to the dataframe is df


These are some common operations that you might want to perform:

i. Convert index values of type string into datetime objects using pd_todatetime()

df.index = pd.to_datetime(df.index)

ii. Plot a time series on a line graph

# Produces a matplotlib line plot(s) using all the columns

iii. Index slicing

  • Recall that slicing is used to filter a subset of the data based on the position
  • Similarly you can slice pandas datetime indexes to filter data based on years, months, days etc.
# Pandas datetime indexing examples

# Filter by year 2022
timeSeries2012 = df['2022']

# Filter values from 2022 April
timeSeries2012May = df['2012-04']

iv. Frequency conversion

  • Sometimes we may wish to downsample or updample readings into monthly, quarterly or yearly frequency
  • This functionality can be easily obtained from the built-in pandas function pd.resample()
# Example: Convert daily readings to MONTHLY readings using the median
  • Other popular frequencies: Q-quarter, D-day, W-week, A=year, T-minute etc.

v. Merge multiple dataframes

# Assume we have another dataframe df2 similar to df
# We can merge the columns of these 2 dataframes as follows

df.join(other=df2, how='innner', on=None)

Note that if we don't specify the value for the on argument, the two dataframes will be matched by the index. Read the docs for more info

  • If instead, we wanted to merge the rows, we would use df.concat()

vi. Calculating correlation and autocorrelation

  • Correlation is a simple measure that tells us whether the values between two columns vary together or not
# Assume you have a dataframe named 'stocks' with stock prices for microsoft and google
# The columns are named 'MSFT' and 'GOOGL' respectively

correlation = stocks['MSFT'].corr(stocks['GOOGLE'])

Typically when dealing with time series data. We do not calculate the correlation on the actual prices , but the percentage changes instead. Use the 'pct_change()' function to convert the values before computing the correlation.

  • If we are interested in knowing the the correlation of a time series with a delayed version of itself, we can calculate the autocorrelation as follows:

# First convert the actual prices to returns 
msft_returns = stocks['MSFT'].pct_change()

# Then compute the autocorrelation

Knowledge check

  • Try putting the concepts covered above into practice with the following short exercise

    • Download a the oil prices dataset from here
    • Read the data into a pandas dataframe
    • Set the index of the dataframe to be the date column
    • Plot the oil prices from 2000 to 2020
    • Create a new dataframe with 2019 data only and change the frequency to quarters
    • Calculate lag 2 autocorrelation of the oil prices in 2019
    • Plot the autocorrelation function of the oil prices for 2020(Optional)
  • Once completed, move on to Part 2 of this series where we will cover EDA(exploratory data analysis) methods applicable for time series data