Demystifying Time Series Outliers: 2/4
Unraveling Outliers in Soccer’s Social Media Time Series

After distributing coffee to everyone, Morelli, Zappa and I revisit what happened yesterday:

Rovella and the Rebel Data

We began with the #rovella-related tweets, a time series densely packed with outliers, and pinpointed them in a very straightforward manner, using just two basic pieces of information: the mean and standard deviation.

import pandas as pdimport numpy as nplink = ''tweets = pd.read_csv(link, sep=';', decimal=',', index_col='date', parse_dates=['date'])tweets_series = tweets['target']

Then we began to mercilessly cut them off, as if with a chainsaw.

Cutting-points work: 3 2 1… go! — Author# function DEFINITIONdef detect_outliers_zscore(ts, thres=3, points_not_to_touch=60, max_window=40, outliers_param=0.9): ''' param ts : Time series containing datetime index param thres : Threshold greater than 3 for making the outliers detection more strict param points_not_to_touch : Points you do not manipulate at the beginning of the series param max_window : Window considered for computing the local max param outliers_param : [0, 1] lower if I want to follow the outliers ''' ts_reworked = ts.copy(deep=True) outliers = [] dates = [] for i, d in zip(ts, ts.index): ts_so_far = ts[ts.index <= d] ts_so_far = ts_so_far.iloc[points_not_to_touch:] ts_so_far = ts_so_far[~ts_so_far.index.isin(dates)] length_so_far = ts_so_far.shape[0] mean = np.mean(ts_so_far) std = np.std(ts_so_far) max_so_far = np.max(ts_so_far.iloc[:-max_window]) surplus = (outliers_param * (i – max_so_far)) max_so_far_augmented =… Read the full blog for free on Medium.

