Time series Analysis 101 - Part 1 | CRUD Flow | Technical Blog about AI, Cloud Engineering, and Data Science

Compiled by endeesa. Last update 23/April/2022

1. Pandas refresher

The following notes will revisit a few pandas concepts which are important for doing time series analysis in Python. Suppose you created a pandas dataframe like the one shown in figure 1. The name assigned to the dataframe is df

	Sales
Date
22/04/2022	70
23/04/2022	50
24/04/2022	77
24/04/2022	90

These are some common operations that you might want to perform:

i. Convert index values of type string into datetime objects using pd_todatetime()

df.index = pd.to_datetime(df.index)

ii. Plot a time series on a line graph

# Produces a matplotlib line plot(s) using all the columns
df.plot(grid=True)

iii. Index slicing

Recall that slicing is used to filter a subset of the data based on the position
Similarly you can slice pandas datetime indexes to filter data based on years, months, days etc.

# Pandas datetime indexing examples

# Filter by year 2022
timeSeries2012 = df['2022']

# Filter values from 2022 April
timeSeries2012May = df['2012-04']

For more examples, visit: pandas datetime indexing tutorial

iv. Frequency conversion

Sometimes we may wish to downsample or updample readings into monthly, quarterly or yearly frequency
This functionality can be easily obtained from the built-in pandas function pd.resample()

# Example: Convert daily readings to MONTHLY readings using the median
df.resample('M').median()

Other popular frequencies: Q-quarter, D-day, W-week, A=year, T-minute etc.

v. Merge multiple dataframes

# Assume we have another dataframe df2 similar to df
# We can merge the columns of these 2 dataframes as follows

df.join(other=df2, how='innner', on=None)

Note that if we don't specify the value for the on argument, the two dataframes will be matched by the index. Read the docs for more info

If instead, we wanted to merge the rows, we would use df.concat()

vi. Calculating correlation and autocorrelation

Correlation is a simple measure that tells us whether the values between two columns vary together or not

# Assume you have a dataframe named 'stocks' with stock prices for microsoft and google
# The columns are named 'MSFT' and 'GOOGL' respectively

correlation = stocks['MSFT'].corr(stocks['GOOGLE'])

Typically when dealing with time series data. We do not calculate the correlation on the actual prices , but the percentage changes instead. Use the 'pct_change()' function to convert the values before computing the correlation.

If we are interested in knowing the the correlation of a time series with a delayed version of itself, we can calculate the autocorrelation as follows:


# First convert the actual prices to returns 
msft_returns = stocks['MSFT'].pct_change()

# Then compute the autocorrelation
msft_returns.autocorr()

Knowledge check

Try putting the concepts covered above into practice with the following short exercise
- Download a the oil prices dataset from here
- Read the data into a pandas dataframe
- Set the index of the dataframe to be the date column
- Plot the oil prices from 2000 to 2020
- Create a new dataframe with 2019 data only and change the frequency to quarters
- Calculate lag 2 autocorrelation of the oil prices in 2019
- Plot the autocorrelation function of the oil prices for 2020(Optional)
Once completed, move on to Part 2 of this series where we will cover EDA(exploratory data analysis) methods applicable for time series data