make_historical_data

Description

Transforming input data to the historical format and extract features. This function prepares the reformed data frame including features and target variable for modeling. The set of features consists of spatial covariates, temporal covariates at current temporal unit (t) and historical values of these covariates at h-1 previous temporal units (t-1 , t-2 , … , t-h+1). The target of the output data frame is the values of the target variable at the temporal unit t+r, where h and r denote the user specified history length and forecast horizon. In addition, if the user prefers to output data frame(s) include the values of some covariates in the future temporal units, the name of these covariates could be specified using the futuristic_covariates argument.

Usage

preprocess.make_historical_data(data, forecast_horizon, history_length=1, column_identifier=None, futuristic_covariates=None, future_data_table=None, step=1, verbose=0)

Parameters

#

Input Name

Input Description

1
















































data
















































type: data frame, string or dict
default: -
details: a data frame or data address of all the covariates and
the target variable. The data on temporal (time dependent) covariates
and spatial (time independent) covariates could also be passed to the
function separately. In this case, the data frame or address of data
on temporal covariates and target variable must be included in the
dictionary with the ‘temporal_data’ key, and the data frame or address
of data on spatial covariates must be the value of key
‘spatial_data’.

The temporal_data must include the following columns:

Spatial ids: The id of the units in the finest spatial scale of input
data must be included in the temporal_data in a column with the name
‘spatial id level 1’.

Temporal ids: The id of time units recorded in the input data for each
temporal scale must be included as a separate column in the
temporal_data with a name in a format ‘temporal id level x’, where ‘x’
is the related temporal scale level beginning with level 1 for the
smallest scale. The temporal units could have a free but sortable
format like year number, week number and so on. The combination of
these temporal scale levels’ ids should form a unique identifier.
However, the integrated format of date and time is also supported. In
the case of using integrated format, only the smallest temporal scale
must be included in the temporal_data with the column name of
‘temporal id’. The expected format of each scale is shown in

Temporal covariates: The temporal covariates must be specified in a
temporal_data with the column name in a format ‘temporal covariate x’
where ‘x’ is the covariate number.

Target: The column of the target variable in the temporal_data must be
named ‘target’.

The spatial_data must include the following columns:

Spatial ids: The id of the units in the finest spatial scale of input
data must be included in the spatial_data with the name ‘spatial id
level 1’.

Spatial covariates : The spatial covariates must be specified in a
spatial_data with the column names in a format ‘spatial covariate x’,
where the ‘x’ is the covariate number.

example: {‘temporal_data’ : ‘./Covid 19 temporal data.csv’,
‘spatial_data’ : ‘./Covid 19 spatial data.csv’}
2



forecast_horizon



type: int
default:-
details: The number of temporal units in the future to be
forecasted.
3










history_length










type: int or dict
default: 1
details: The number of temporal units in the past which their
information is used to predict. This history length could be different
for each temporal covariate, that in this case, a dictionary must be
passed with the temporal covariate names as it’s keys and the
corresponding history lengths as it’s values. The keys could also be a
tuple of multiple covariate names.

example: {(‘temperature’,’precipitation’):2,’social distancing
policy’:5}
4
























column_identifier
























type: dict or None
default: None
details: If the input data column names do not match the
specific format of temporal and spatial ids and covariates (i.e.,
‘temporal id’, ‘temporal id level x’, ‘spatial id level x’, ‘temporal
covariate x’, ‘spatial covariate x’,’target’), a dictionary must be
passed to specify the content of each column.
The keys must be a string in one of the formats: {‘temporal
id’,’temporal id level x’,’spatial id level x’, ‘temporal covariates’,
‘spatial covariates’,’target’}
The values of ‘temporal id level x’ and ‘spatial id level x’ must be
the name of the column containing the temporal or spatial ids in the
scale level x respectively.
If the input data has integrated format for temporal ids, the name
of the corresponding column must be specified with the key ‘temporal
id’.
The values of ‘temporal covariates’ and ‘spatial covariates’ are the
list of temporal and spatial covariates respectively, and the value of
the ‘target’ is the column name of the target variable.

example: {‘temporal id level 1’: ‘week’,’temporal id level 2’:
‘year’,’spatial id level 1’: ‘county_fips’, ‘spatial id level 2’:
‘state_fips’, ‘temporal covariates’:[‘temperature’, ‘social distance
policy’], ‘spatial covariates’:[‘population’,’hospital
beds’],’target’:’covid-19 deaths’}
5










futuristic_covariates










type: dict or None
default: None
details: a dict of temporal covariates whose values at the future
temporal units will be considered for prediction. The keys are the
name of temporal covariates (or tuple of multiple covariate names) and
the values are the list of length 2 representing the start and end
point of the temporal interval in the future in which values of
covariates will be included in the historical data frame.

example: {‘temperature’: [2,4], (‘temperature’, ‘social distancing
policy’): [6,6]}
6












future_data_table












type: data frame or string or None
default: None
details: data address or data frame containing futuristic
covariates values in the temporal units in the corresponding interval
in the future, where the future refers to the temporal units after the
last unit with recorded information for the covariates and target
variable in input data.
These values can also be included in the input data (temporal data)
in the rows corresponding to the future temporal units having values
for the futuristic covariates and NA for other covariates.
Note that all the temporal and spatial id’s in the input data must
be included in the future_data_table. An example of
future_data_table is shown in Fig. 4.
7







step







type: int
default: 1
details: The number of instances in the time sequence to be
considered as a temporal unit in the process of constructing
historical data. Normally the step is equal to one, and each instance
is considered as a temporal unit, but if the augmentation is used in
the temporal scale transformation, the step must be set to the moving
average window size.
8





verbose





type: int
default: 0
details: The level of details in produced logging information
available options:
0: no logging
1: only important information logging

Note

We assume that there is no gap in the time sequences of input data.

Returns

#

Output Name

Output Description

1






historical_data






type: dataframe
details: a data frame including historical values of each
covariates with the specified history length for that covariate and
the target variable values at the user-specified forecast_horizon.
If the futuristic_covariates is not None the output also includes
the values of these covariates at the time points in the specified
interval for that covariate in the future.

Example

import pandas as pd
from stpredict.preprocess import make_historical_data

df1 = pd.read_csv('USA COVID-19 temporal data.csv')
df2 = pd.read_csv('USA COVID-19 spatial data.csv')


historical_data_frame = make_historical_data(data = {'temporal_data':df1,'spatial_data':df2},
                                             forecast_horizon = 4,
                                             history_length = {('temperature','precipitation'):2})