preprocess_data
Description
Transform data to the user defined format and prepare it for modeling. The preprocessing procedure has several steps. First the imputation of missing values will be performed; and in the second step the temporal and spatial scales of data are transformed to the user’s desired scale for prediction. Then in the third step, the target variable will be modified based on the user specified mode. In the fourth step, if the user prefers, the values of covariates in the neighbouring spatial units to each unit are averaged and added to the data as new covariates. The last step is to reform the data to the historical format containing the historical values of input data covariates and values of the target variable at the forecast horizon. Additionally, if the user prefers that the output data frame(s) contain the values of some covariates in future temporal units, the names of these covariates can be specified using the futuristic_covariates argument.
Usage
- preprocess.preprocess_data(data, forecast_horizon, history_length=1, column_identifier=None, spatial_scale_table=None, spatial_scale_level=1, temporal_scale_level=1, target_mode='normal', imputation=True, aggregation_mode='mean', augmentation=False, futuristic_covariates=None, future_data_table=None, neighbouring_matrix=None, neighbouring_layers=0, save_address=None, verbose=0)
Parameters
# |
Input Name |
Input Description |
|---|---|---|
1
|
data
|
type: data frame, string or dict
default: -
details: a data frame or data address of all the covariates and
the target variable. The data on temporal (time dependent) covariates
and spatial (time independent) covariates could also be passed to the
function separately. In this case, a dictionary must be passed. The
data frame or address of data on temporal covariates and target
variable must be included in the dictionary with the ‘temporal_data’
key, and the data frame or address of data on spatial covariates must
be the value of key ‘spatial_data’. Fig. 3 represent a
sample input data tables.
The temporal_data must include the following columns:
Spatial ids: The id of the units in the finest spatial scale of input
data must be included in the temporal_data in a column with the name
‘spatial id level 1’.
The id of units in the secondary spatial scales of input data could be
included in the temporal_data in columns named ‘spatial id level x’,
where x shows the related scale level or could be given in a
spatial_scale_table. Note that spatial id(s) must have unique
values.
Temporal ids: The id of time units recorded in the input data for each
temporal scale must be included as a separate column in the
temporal_data with a name in a format ‘temporal id level x’, where ‘x’
is the related temporal scale level beginning with level 1 for the
smallest scale. The temporal units could have a free but sortable
format like year number, week number and so on. The combination of
these temporal scale levels’ ids should form a unique identifier.
However the integrated format of date and time is also supported. In
the case of using integrated format, only the smallest temporal scale
must be included in the temporal_data with the column name of
‘temporal id’. The expected format of each scale is shown in
Temporal covariates: The temporal covariates must be specified in a
temporal_data with the column name in a format ‘temporal covariate x’
where ‘x’ is the covariate number.
Target: The column of the target variable in the temporal_data must be
named ‘target’.
The spatial_data must includes following columns:
Spatial ids: The id of the units in the finest spatial scale of input
data must be included in the spatial_data with the name ‘spatial id
level 1’. The id of units in the secondary spatial scales of input
data could be included in the spatial_data in columns named ‘spatial
id level x’, where x shows the related scale level or could be given
in the spatial_scale_table.
Spatial covariates: The spatial covariates must be specified in a
spatial_data with the column names in a format ‘spatial covariate x’,
where the ‘x’ is the covariate number.
example: {‘temporal_data’ : ‘./Covid 19 temporal data.csv’,
‘spatial_data’ : ‘./Covid 19 spatial data.csv’}
|
2
|
forecast_horizon
|
type: int
default: -
details: Number of temporal units in the future to be forecasted.
|
3
|
history_length
|
type: int or dict
default: 1
details: The number of temporal units in the past which their
information is used to predict. If an integer is passed, function will
produce only a single data frame including the historical values of
all the covariates with the same history length (i.e., the specified
integer value), but if the maximum history length of each covariate is
specified in a dictionary with the temporal covariate names as it’s
keys and the corresponding maximum history lengths as it’s values, the
function will produce a dataframe for each combination of covariates’
history lengths, as an example is shown in Table 2.
example: {(‘temperature’,’precipitation’):3,’social distancing
policy’:5}
|
4
|
column_identifier
|
type: dict or None
default: None
details: If the input data column names do not match the
specific format of temporal and spatial ids and covariates (i.e.,
‘temporal id’, ‘temporal id level x’, ‘spatial id level x’, ‘temporal
covariate x’, ‘spatial covariate x’,’target’), a dictionary must be
passed to specify the content of each column.
The keys must be a string in one of the formats: {‘temporal
id’,’temporal id level x’,’spatial id level x’, ‘temporal covariates’,
‘spatial covariates’, ‘target’}
The values of ‘temporal id level x’ and ‘spatial id level x’ must be
the name of the column containing the temporal or spatial ids in the
scale level x respectively.
If the input data has integrated format for temporal ids, the name
of the corresponding column must be specified with the key ‘temporal
id’.
The values of ‘temporal covariates’ and ‘spatial covariates’ are the
list of temporal and spatial covariates respectively, and the value of
the ‘target’ is the column name of the target variable.
example: {‘temporal id level 1’: ‘week’,’temporal id level 2’:
‘year’,’spatial id level 1’: ‘county_fips’, ‘spatial id level 2’:
‘state_fips’, ‘temporal covariates’:[‘temperature’, ‘social distance
policy’], ‘spatial covariates’:[‘population’,’hospital
beds’],’target’:’covid-19 deaths’}
|
5
|
spatial_scale_table
|
type: data frame, string, or None
default: None
details: If the ids of secondary spatial scale units are not
included in the input data, a data frame must be passed to the
function containing different spatial scales information, with the
first column named ‘spatial id level 1’, and including the id of the
units in the smallest spatial scale and the rest of the columns
including the id of bigger scale units for each unit of the smallest
scale. If the column names do not match the format ‘spatial id level
x’ the content of each column must be specified using
column_identifier argument.
The address of the dataframe could also be passed.
|
6
|
spatial_scale_level
|
type: int
default: 1
details: The spatial scale level that is considered for
prediction.
|
7
|
temporal_scale_level
|
type: int
default: 1
details: The temporal scale level that is considered for
prediction.
Note.If the temporal id have an integrated format, the scale of the
specified level will be determined based on the input scale and the
following sequence of temporal scales:
Second, Minute, Hour, Day, Week, Month, Year
|
9
|
target_mode
|
type: {‘normal’, ‘cumulative’, ‘differential’,’moving average’}
default: ‘normal’
details: The mode of target variable which will be used to learn
the methods for prediction:
‘normal’:
No modification.
‘cumulative’:
Target variable shows the cumulative value of the variable from the
first date in the data.
‘differential’:
Target variable shows the difference between the value of the variable
in the current and previous temporal unit.
‘moving average’:
Target variable values are modified to represent the average of the
variable values in the previous higher level scale temporal unit for
each current scale temporal unit. (e.g., If the current temporal scale
is day, the value of the target variable in each day will be the
average of values in the previous week.)
|
10
|
imputation
|
type: bool
default: True
details: Specify whether or not to perform imputation.
|
11
|
aggregation_mode
|
type: {‘sum’,’mean’} or dict
default: ‘mean’
details: Aggregation operator which is used to derive covariate
values for samples of bigger spatial scale from samples of smaller
spatial scale in the spatial scale transforming process.
This operator could be different for each covariate which in this case
a dictionary (dict) must be passed with covariates as its keys and
‘mean’ or ‘sum’ as its values.
example: {‘temperature’:’mean’,’precipitation’:’sum’,
‘population’:’sum’}
|
12
|
augmentation
|
type: bool
default: False
details: Specify whether or not to augment data when using bigger
temporal scales to avoid data volume decrease. For this purpose, in
the process of temporal scale transformation, instead of taking the
average of smaller scale units’ values to get the bigger scale unit
value, the moving average method is used.
|
13
|
futuristic_covariates
|
type: dict or None
default: None
details: a dictionary of temporal covariates whose values at the
future temporal units will be considered for prediction. The keys are
the name of temporal covariates (or tuple of multiple covariate names)
and the values are the list of length 2 representing the start and end
point of the temporal interval in the future in which values of
covariates will be included in the historical dataframe.
example: {‘temperature’: [2,4], (‘social distancing
policy’,’precipitation’): [6,6]}
|
14
|
future_data_table
|
type: data frame or string or None
default: None
details: data address or data frame containing futuristic
covariates values in the temporal units in the future, where the
future refers to the temporal units after the last unit with recorded
information for the covariates and target variable in input data.
These values can also be included in the input data (temporal data)
in the rows corresponding to the future temporal units having values
for the futuristic covariates and NA for other covariates.
Note that all the temporal and spatial id’s in the input data must
be included in the future_data_table. An example of
future_data_table is shown in Fig. 4 .
|
15
|
neighbouring_matrix
|
type: numpy.ndarray or None
default: None
details: The adjacency matrix of spatial units
A two-dimensional binary array with dimensions equal to the number
of spatial units included in the data. The value of each element of this
array indicates the adjacency (value 1) or non-adjacency (value 0) of
two spatial units. Note that the order considered for the spatial units
in the rows and columns of the matrix should be based on their spatial
id order (numerical order for numeric ids or lexicographical order for
string ids).
example: numpy.array([[0,1,0],[1,0,1],[0,1,0]])
|
16
|
neighbouring_layers
|
type: int
default: 0
details: The number of neighbouring layers
Each neighbouring layer for a spatial unit includes neighbours with
a certain distance from this spatial unit. The first layer contains
adjacent neighbours, the second layer contains neighbours with a
distance of one spatial unit, and so on for other layers. For each
covariate, the average values of this covariate in the spatial units
included in a neighbouring layer are added to the data as a new
covariate. Therefore, neighbouring_layers = n, for each covariate,
adds n new covariates to the data. The name of the new covariates
have a special format, e.g. the covariate obtained by averaging the
‘temperature’ in spatial units of the first neighbouring layer has
the name ‘temperature_l1’.
|
17
|
save_address
|
type: string or None
default: None
details: The path to save a resulting data frame(s) as a CSV file.
If None is passed the data will not be saved. The number of CSV
file(s) saved depends on the user specified history length. If the
specified history_length is an integer (x) the single data frame
will be saved with the name in format ‘historical data h=x.csv’, but
if the history_length is the dictionary of max history lengths of
each covariate, for each resulting historical data frame with maximum
history length of x, a CSV file will be saved with the name in format
‘historical data h=x.csv’.
example: ‘./’
|
18
|
verbose
|
type: int
default: 0
details: The level of details in produced logging information.
available options:
0: no logging
1: only important information logging
2: all details logging
|
Note
We assume that there is no gap in the time sequences of input data.
Note
The gap in the sequence of temporal id levels is not allowed. More clearly if input data contains columns ‘temporal id level 1’,’temporal id level 2’, … , ‘temporal id level x’ , ‘temporal id level x+2’, the column ‘temporal id level x+2’ is not considered and will be removed from the data.
Returns
# |
Output Name |
Output Description |
|---|---|---|
1
|
preprocessed_data
|
type: data frame or list of data frames
details: If the user_specified history_length is a single
integer, the function returns a data frame including the historical
values of the covariates with this history_length and the target
variable values at the user-specified forecast_horizon, but if the
passed history_length is a dictionary of covariates’ max history
lengths, the function returns a list of historical data frames ordered
based on their max history length in {1,…,max(h)} where max(h) is
the greatest maximum history length of covariates. Each data frame
includes historical values of each covariate with the history length
equal to the dataframe max history length or smaller (if the maximum
history length of that covariate specified in history_length is less
than the data frame max history length). An example of the resulting
preprocessed data frames for an history_length of type dictionary
with different max history lengths for each covariate is shown in
|
Example
import pandas as pd
from stpredict.preprocess import preprocess_data
df1 = pd.read_csv('USA COVID-19 temporal data.csv')
df2 = pd.read_csv('USA COVID-19 spatial data.csv')
historical_data = preprocess_data(data = {'temporal_data':df1,'spatial_data':df2},
forecast_horizon = 2, history_length = 2,
futuristic_covariates = {'Social distancing policy':[1,2]})
Covariate 1 |
Covariate 2 |
Covariate 3 |
Covariate 4 |
|
Covariate maxsimum history length |
||||
3 |
1 |
5 |
4 |
|
Historical data frame number |
covariate history length in historical data frame |
|||
1 |
1 |
1 |
1 |
1 |
2 |
2 |
1 |
2 |
2 |
3 |
3 |
1 |
3 |
3 |
4 |
3 |
1 |
4 |
4 |
5 |
3 |
1 |
5 |
4 |