preprocess_data

Description

Transform data to the user defined format and prepare it for modeling. The preprocessing procedure has several steps. First the imputation of missing values will be performed; and in the second step the temporal and spatial scales of data are transformed to the user’s desired scale for prediction. Then in the third step, the target variable will be modified based on the user specified mode. In the fourth step, if the user prefers, the values of covariates in the neighbouring spatial units to each unit are averaged and added to the data as new covariates. The last step is to reform the data to the historical format containing the historical values of input data covariates and values of the target variable at the forecast horizon. Additionally, if the user prefers that the output data frame(s) contain the values of some covariates in future temporal units, the names of these covariates can be specified using the futuristic_covariates argument.

Usage

preprocess.preprocess_data(data, forecast_horizon, history_length=1, column_identifier=None, spatial_scale_table=None, spatial_scale_level=1, temporal_scale_level=1, target_mode='normal', imputation=True, aggregation_mode='mean', augmentation=False, futuristic_covariates=None, future_data_table=None, neighbouring_matrix=None, neighbouring_layers=0, save_address=None, verbose=0)

Parameters

#	Input Name	Input Description
1	data	type: data frame, string or dict default: - details: a data frame or data address of all the covariates and the target variable. The data on temporal (time dependent) covariates and spatial (time independent) covariates could also be passed to the function separately. In this case, a dictionary must be passed. The data frame or address of data on temporal covariates and target variable must be included in the dictionary with the ‘temporal_data’ key, and the data frame or address of data on spatial covariates must be the value of key ‘spatial_data’. Fig. 3 represent a sample input data tables. The temporal_data must include the following columns: Spatial ids: The id of the units in the finest spatial scale of input data must be included in the temporal_data in a column with the name ‘spatial id level 1’. The id of units in the secondary spatial scales of input data could be included in the temporal_data in columns named ‘spatial id level x’, where x shows the related scale level or could be given in a spatial_scale_table. Note that spatial id(s) must have unique values. Temporal ids: The id of time units recorded in the input data for each temporal scale must be included as a separate column in the temporal_data with a name in a format ‘temporal id level x’, where ‘x’ is the related temporal scale level beginning with level 1 for the smallest scale. The temporal units could have a free but sortable format like year number, week number and so on. The combination of these temporal scale levels’ ids should form a unique identifier. However the integrated format of date and time is also supported. In the case of using integrated format, only the smallest temporal scale must be included in the temporal_data with the column name of ‘temporal id’. The expected format of each scale is shown in Table 2. Temporal covariates: The temporal covariates must be specified in a temporal_data with the column name in a format ‘temporal covariate x’ where ‘x’ is the covariate number. Target: The column of the target variable in the temporal_data must be named ‘target’. The spatial_data must includes following columns: Spatial ids: The id of the units in the finest spatial scale of input data must be included in the spatial_data with the name ‘spatial id level 1’. The id of units in the secondary spatial scales of input data could be included in the spatial_data in columns named ‘spatial id level x’, where x shows the related scale level or could be given in the spatial_scale_table. Spatial covariates: The spatial covariates must be specified in a spatial_data with the column names in a format ‘spatial covariate x’, where the ‘x’ is the covariate number. example: {‘temporal_data’ : ‘./Covid 19 temporal data.csv’, ‘spatial_data’ : ‘./Covid 19 spatial data.csv’}
2	forecast_horizon	type: int default: - details: Number of temporal units in the future to be forecasted.
3	history_length	type: int or dict default: 1 details: The number of temporal units in the past which their information is used to predict. If an integer is passed, function will produce only a single data frame including the historical values of all the covariates with the same history length (i.e., the specified integer value), but if the maximum history length of each covariate is specified in a dictionary with the temporal covariate names as it’s keys and the corresponding maximum history lengths as it’s values, the function will produce a dataframe for each combination of covariates’ history lengths, as an example is shown in Table 2. example: {(‘temperature’,’precipitation’):3,’social distancing policy’:5}
4	column_identifier	type: dict or None default: None details: If the input data column names do not match the specific format of temporal and spatial ids and covariates (i.e., ‘temporal id’, ‘temporal id level x’, ‘spatial id level x’, ‘temporal covariate x’, ‘spatial covariate x’,’target’), a dictionary must be passed to specify the content of each column. The keys must be a string in one of the formats: {‘temporal id’,’temporal id level x’,’spatial id level x’, ‘temporal covariates’, ‘spatial covariates’, ‘target’} The values of ‘temporal id level x’ and ‘spatial id level x’ must be the name of the column containing the temporal or spatial ids in the scale level x respectively. If the input data has integrated format for temporal ids, the name of the corresponding column must be specified with the key ‘temporal id’. The values of ‘temporal covariates’ and ‘spatial covariates’ are the list of temporal and spatial covariates respectively, and the value of the ‘target’ is the column name of the target variable. example: {‘temporal id level 1’: ‘week’,’temporal id level 2’: ‘year’,’spatial id level 1’: ‘county_fips’, ‘spatial id level 2’: ‘state_fips’, ‘temporal covariates’:[‘temperature’, ‘social distance policy’], ‘spatial covariates’:[‘population’,’hospital beds’],’target’:’covid-19 deaths’}
5	spatial_scale_table	type: data frame, string, or None default: None details: If the ids of secondary spatial scale units are not included in the input data, a data frame must be passed to the function containing different spatial scales information, with the first column named ‘spatial id level 1’, and including the id of the units in the smallest spatial scale and the rest of the columns including the id of bigger scale units for each unit of the smallest scale. If the column names do not match the format ‘spatial id level x’ the content of each column must be specified using column_identifier argument. The address of the dataframe could also be passed.
6	spatial_scale_level	type: int default: 1 details: The spatial scale level that is considered for prediction.
7	temporal_scale_level	type: int default: 1 details: The temporal scale level that is considered for prediction. Note.If the temporal id have an integrated format, the scale of the specified level will be determined based on the input scale and the following sequence of temporal scales: Second, Minute, Hour, Day, Week, Month, Year
9	target_mode	type: {‘normal’, ‘cumulative’, ‘differential’,’moving average’} default: ‘normal’ details: The mode of target variable which will be used to learn the methods for prediction: ‘normal’: No modification. ‘cumulative’: Target variable shows the cumulative value of the variable from the first date in the data. ‘differential’: Target variable shows the difference between the value of the variable in the current and previous temporal unit. ‘moving average’: Target variable values are modified to represent the average of the variable values in the previous higher level scale temporal unit for each current scale temporal unit. (e.g., If the current temporal scale is day, the value of the target variable in each day will be the average of values in the previous week.)
10	imputation	type: bool default: True details: Specify whether or not to perform imputation.
11	aggregation_mode	type: {‘sum’,’mean’} or dict default: ‘mean’ details: Aggregation operator which is used to derive covariate values for samples of bigger spatial scale from samples of smaller spatial scale in the spatial scale transforming process. This operator could be different for each covariate which in this case a dictionary (dict) must be passed with covariates as its keys and ‘mean’ or ‘sum’ as its values. example: {‘temperature’:’mean’,’precipitation’:’sum’, ‘population’:’sum’}
12	augmentation	type: bool default: False details: Specify whether or not to augment data when using bigger temporal scales to avoid data volume decrease. For this purpose, in the process of temporal scale transformation, instead of taking the average of smaller scale units’ values to get the bigger scale unit value, the moving average method is used.
13	futuristic_covariates	type: dict or None default: None details: a dictionary of temporal covariates whose values at the future temporal units will be considered for prediction. The keys are the name of temporal covariates (or tuple of multiple covariate names) and the values are the list of length 2 representing the start and end point of the temporal interval in the future in which values of covariates will be included in the historical dataframe. example: {‘temperature’: [2,4], (‘social distancing policy’,’precipitation’): [6,6]}
14	future_data_table	type: data frame or string or None default: None details: data address or data frame containing futuristic covariates values in the temporal units in the future, where the future refers to the temporal units after the last unit with recorded information for the covariates and target variable in input data. These values can also be included in the input data (temporal data) in the rows corresponding to the future temporal units having values for the futuristic covariates and NA for other covariates. Note that all the temporal and spatial id’s in the input data must be included in the future_data_table. An example of future_data_table is shown in Fig. 4 .
15	neighbouring_matrix	type: numpy.ndarray or None default: None details: The adjacency matrix of spatial units A two-dimensional binary array with dimensions equal to the number of spatial units included in the data. The value of each element of this array indicates the adjacency (value 1) or non-adjacency (value 0) of two spatial units. Note that the order considered for the spatial units in the rows and columns of the matrix should be based on their spatial id order (numerical order for numeric ids or lexicographical order for string ids). example: numpy.array([[0,1,0],[1,0,1],[0,1,0]])
16	neighbouring_layers	type: int default: 0 details: The number of neighbouring layers Each neighbouring layer for a spatial unit includes neighbours with a certain distance from this spatial unit. The first layer contains adjacent neighbours, the second layer contains neighbours with a distance of one spatial unit, and so on for other layers. For each covariate, the average values of this covariate in the spatial units included in a neighbouring layer are added to the data as a new covariate. Therefore, neighbouring_layers = n, for each covariate, adds n new covariates to the data. The name of the new covariates have a special format, e.g. the covariate obtained by averaging the ‘temperature’ in spatial units of the first neighbouring layer has the name ‘temperature_l1’.
17	save_address	type: string or None default: None details: The path to save a resulting data frame(s) as a CSV file. If None is passed the data will not be saved. The number of CSV file(s) saved depends on the user specified history length. If the specified history_length is an integer (x) the single data frame will be saved with the name in format ‘historical data h=x.csv’, but if the history_length is the dictionary of max history lengths of each covariate, for each resulting historical data frame with maximum history length of x, a CSV file will be saved with the name in format ‘historical data h=x.csv’. example: ‘./’
18	verbose	type: int default: 0 details: The level of details in produced logging information. available options: 0: no logging 1: only important information logging 2: all details logging

Note

We assume that there is no gap in the time sequences of input data.

Note

The gap in the sequence of temporal id levels is not allowed. More clearly if input data contains columns ‘temporal id level 1’,’temporal id level 2’, … , ‘temporal id level x’ , ‘temporal id level x+2’, the column ‘temporal id level x+2’ is not considered and will be removed from the data.

Returns

#	Output Name	Output Description
1	preprocessed_data	type: data frame or list of data frames details: If the user_specified history_length is a single integer, the function returns a data frame including the historical values of the covariates with this history_length and the target variable values at the user-specified forecast_horizon, but if the passed history_length is a dictionary of covariates’ max history lengths, the function returns a list of historical data frames ordered based on their max history length in {1,…,max(h)} where max(h) is the greatest maximum history length of covariates. Each data frame includes historical values of each covariate with the history length equal to the dataframe max history length or smaller (if the maximum history length of that covariate specified in history_length is less than the data frame max history length). An example of the resulting preprocessed data frames for an history_length of type dictionary with different max history lengths for each covariate is shown in Table 1, and Fig. 5 shows an instance of a historical data frame.

Example

import pandas as pd
from stpredict.preprocess import preprocess_data

df1 = pd.read_csv('USA COVID-19 temporal data.csv')
df2 = pd.read_csv('USA COVID-19 spatial data.csv')

historical_data = preprocess_data(data = {'temporal_data':df1,'spatial_data':df2},
                                  forecast_horizon = 2, history_length = 2,
                                  futuristic_covariates = {'Social distancing policy':[1,2]})

Table 1 covariates history length in output historical data frames
	Covariate 1	Covariate 2	Covariate 3	Covariate 4
	Covariate maxsimum history length
	3	1	5	4
Historical data frame number	covariate history length in historical data frame
1	1	1	1	1
2	2	1	2	2
3	3	1	3	3
4	3	1	4	4
5	3	1	5	4