Timeserie Analytics¶

Examples:

In [1]: from wax_toolbox.tsanalytics import analyse_datetimeindex

In [2]: idx_gap  # let's look at that pd.DatetimeIndex
Out[2]: 
DatetimeIndex(['2016-03-01 02:00:00+01:00', '2016-03-01 03:00:00+01:00',
               '2016-03-01 04:00:00+01:00', '2016-03-01 07:00:00+01:00',
               '2016-03-01 08:00:00+01:00', '2016-03-01 09:00:00+01:00',
               '2016-03-01 14:00:00+01:00', '2016-03-01 15:00:00+01:00',
               '2016-03-01 16:00:00+01:00', '2016-03-01 17:00:00+01:00',
               ...
               '2016-03-30 15:00:00+02:00', '2016-03-30 16:00:00+02:00',
               '2016-03-30 17:00:00+02:00', '2016-03-30 18:00:00+02:00',
               '2016-03-30 19:00:00+02:00', '2016-03-30 20:00:00+02:00',
               '2016-03-30 21:00:00+02:00', '2016-03-30 22:00:00+02:00',
               '2016-03-30 23:00:00+02:00', '2016-03-31 00:00:00+02:00'],
              dtype='datetime64[ns, CET]', length=712, freq=None)

In [3]: tsinfo = analyse_datetimeindex(idx_gap, start=start, end=end)

In [4]: tsinfo
Out[4]: 
freq: <Hour>
sorted: True
continuous: [(Timestamp('2016-03-01 02:00:00+0100', tz='CET'), Timestamp('2016-03-01 04:00:00+0100', tz='CET')), (Timestamp('2016-03-01 07:00:00+0100', tz='CET'), Timestamp('2016-03-01 09:00:00+0100', tz='CET')), (Timestamp('2016-03-01 14:00:00+0100', tz='CET'), Timestamp('2016-03-31 00:00:00+0200', tz='CET'))]
gaps: [(Timestamp('2016-03-01 00:00:00+0100', tz='CET'), Timestamp('2016-03-01 01:00:00+0100', tz='CET')), (Timestamp('2016-03-01 05:00:00+0100', tz='CET'), Timestamp('2016-03-01 06:00:00+0100', tz='CET')), (Timestamp('2016-03-01 10:00:00+0100', tz='CET'), Timestamp('2016-03-01 13:00:00+0100', tz='CET'))]
duplicates: []

In [5]: print('This timeserie got a (minimal) frequency of {}'.format(tsinfo.freq))
This timeserie got a (minimal) frequency of <Hour>

In [6]: print('This timeserie is sorted ? {}'.format(tsinfo.sorted))
This timeserie is sorted ? True

# continuous parts:
In [7]: tsinfo.continuous
Out[7]: 
[(Timestamp('2016-03-01 02:00:00+0100', tz='CET'),
  Timestamp('2016-03-01 04:00:00+0100', tz='CET')),
 (Timestamp('2016-03-01 07:00:00+0100', tz='CET'),
  Timestamp('2016-03-01 09:00:00+0100', tz='CET')),
 (Timestamp('2016-03-01 14:00:00+0100', tz='CET'),
  Timestamp('2016-03-31 00:00:00+0200', tz='CET'))]

# gaps parts:
In [8]: tsinfo.gaps
Out[8]: 
[(Timestamp('2016-03-01 00:00:00+0100', tz='CET'),
  Timestamp('2016-03-01 01:00:00+0100', tz='CET')),
 (Timestamp('2016-03-01 05:00:00+0100', tz='CET'),
  Timestamp('2016-03-01 06:00:00+0100', tz='CET')),
 (Timestamp('2016-03-01 10:00:00+0100', tz='CET'),
  Timestamp('2016-03-01 13:00:00+0100', tz='CET'))]

# duplicates if any:
In [9]: tsinfo.duplicates
Out[9]: []

A module with timeseries analysis tools.

class wax_toolbox.tsanalytics.TSAnalytics(freq, sorted, continuous, gaps, duplicates)[source]¶

Wrapper for time serie analysis results.

Parameters:	freq (str) – frequency sorted (bool) – whether timeindex is sorted. continuous (list of datetime tuples) – continuous segments. gaps (list of datetime tuples) – gaps segments. duplicates (list of datetime) – duplicated index.

wax_toolbox.tsanalytics.analyse_datetimeindex(idx, start=None, end=None, freq=None)[source]¶

Check if the given index is of type DatetimeIndex & is aware. Returns the implied frequency, a sorted flag, the list of continuous segment, the list of gap segments and the list of duplicated indices. Continuous and gaps segments are expressed as [start:end] (both side inclusive). If the index is not sorted, it will be sorted before checking for continuity. Specifying start and end check for gaps at beginning and end of the index. Specifying freq enforces control of gaps according to frequency.

Parameters:

idx (pd.DatetimeIndex) – datetimeindex aware to be analysed
start (datetime expression) – from when to start the analysis. Defaults to None, which means from the lower bound of idx.
start (datetime expression) – from when to end the analysis. Defaults to None, which means from the upper bound of idx.
freq (str) – analyise on this frequency. Defaults to None, which means the idx actual frequency.

Returns:

(TSAnalytics namedtuple) – freq, sorted, continuous, gaps, duplicates

wax_toolbox.tsanalytics.detect_frequency(idx)[source]¶

Return the most plausible frequency of pd.DatetimeIndex (even when gaps in it). It calculates the delta between element of the index (idx[1:] - idx[:1]), gets the ‘mode’ of the delta (most frequent delta) and transforms it into a frequency (‘H’,‘15T’,…)

Parameters:	idx (pd.DatetimeIndex) – datetime index to analyse.
Returns:	frequency (str)

Note

A solution exists in pandas:

from pandas.tseries.frequencies import _TimedeltaFrequencyInferer
inferer = _TimedeltaFrequencyInferer(idx)
freq = inferer.get_freq()

But for timeseries with nonconstant frequencies (like for ‘publication_date’ of forecast timeseries), then the inferer.get_freq() return None.

In those cases, we are going to return the smallest frequency possible.

wax_toolbox.tsanalytics.get_tz_info(tzname, limit_year=2000)[source]¶

Get DST informations.

Parameters:

tzname (str) – a timezone.
limit_year (int) – filter the DST transitions datetimes older than this given year.

Returns:

(tuple) –

2-elements tuple containing:

tz (pytz.timezone): the converted string into timezone object.

df (pd.DataFrame): dataframe containing DST informations.

In [1]: from wax_toolbox.tsanalytics import get_tz_info

In [2]: tz, df = get_tz_info('CET')

In [3]: tz
Out[3]: <DstTzInfo 'CET' CET+1:00:00 STD>

In [4]: df.head(10)
Out[4]: 
   dstoffset            timestamp
61  01:00:00  2000-03-26 01:00:00
62  00:00:00  2000-10-29 01:00:00
63  01:00:00  2001-03-25 01:00:00
64  00:00:00  2001-10-28 01:00:00
65  01:00:00  2002-03-31 01:00:00
66  00:00:00  2002-10-27 01:00:00
67  01:00:00  2003-03-30 01:00:00
68  00:00:00  2003-10-26 01:00:00
69  01:00:00  2004-03-28 01:00:00
70  00:00:00  2004-10-31 01:00:00

wax_toolbox.tsanalytics.tz_convert_multiindex(ts, to_tz='UTC')[source]¶

Convert all aware indexes of multiIndex timeserie. It also checks first if the indexes are effectively aware.

Parameters:	ts (pd.Series with pd.DatetimeIndex) – timeserie with multiindex. to_tz (str) – timezone to be converted into.
Returns:	(pd.Series) with timezone converted.

wax_toolbox.tsanalytics.tz_fix(df, time_col, from_tz='Europe/Brussels', split_by=None, dropval_on_fail=False)[source]¶

Try to fix the timezone of a datetime column.

Parameters:

df (pd.DataFrame) – dataframe to process.
time_col (str) – name of the column to be processed.
from_tz (str) – initial timezone of the naive time_col. Defaults to ‘Europe/Brussels’
split_by (str) – Name of the column to split by. It is necessary when the dataframe go several series. Defaults to None.
dropval_on_fail (bool) –
- if false, raise TzFixFail if couldn’t resolve
- if true, drop dst values if case of TzFixFail.
Defaults to False.

wax_toolbox.tsanalytics.tz_localize_multiindex(ts, from_tz='UTC')[source]¶

Localize all naive indexes of multiIndex timeserie. It also checks first if the indexes are effectively naives.

Parameters:	ts (pd.Series with pd.DatetimeIndex) – timeserie with multiindex. from_tz (str) – timezone to be localized into.
Returns:	(pd.Series) with localized mutliindex.