stpredict
Description
Performing all steps of data preprocessing, training, validation and forecasting. This steps will be implemented in two phases. First phase is preprocessing. In this phase first the imputation of missing values will be performed; and in the second step the temporal and spatial scales of data are transformed to the user’s desired scale for prediction. Then in the third step, the target variable will be modified based on the user specified mode, and the last step is to reform the data to the historical format containing the historical values of input data covariates and values of the target variable at the forecast horizon. In addition, if the user prefers to use the values of some covariates in the future temporal units for prediction, the name of these covariates could be specified using the futuristic_covariates argument. The second phase is prediction. In this phase firstly the best model, history length, and feature set are selected based on models performance on the validation set and then the selected configuration will be used to predict the target variable values for the test set and report the prediction performance. The predictions of the target variable values for the temporal units in the future which are obtained using the selected configuration will also be saved in csv file.
Usage
- stpredict(data, forecast_horizon, history_length=1, column_identifier=None, feature_sets={'covariate': 'mRMR'}, models=['knn'], model_type='regression', test_type='whole-as-one', mixed_models=[], performance_benchmark='MAPE', performance_measures=['MAPE'], performance_mode='normal', splitting_type='training-validation', instance_testing_size=0.2, instance_validation_size=0.3, instance_random_partitioning=False, fold_total_number=5, imputation=True, target_mode='normal', feature_scaler=None, target_scaler=None, forced_covariates=[], futuristic_covariates=None, scenario='current', future_data_table=None, temporal_scale_level=1, spatial_scale_level=1, spatial_scale_table=None, aggregation_mode='mean', augmentation=False, validation_performance_report=True, testing_performance_report=True, save_predictions=True, save_ranked_features=True, plot_predictions=False, verbose=0)
Parameters
# |
Input Name |
Input Description |
|---|---|---|
1
|
data
|
type: data frame, string or dict
default: -
details: a data frame or data address of all the covariates and
the target variable. The data on temporal (time dependent) covariates
and spatial (time independent) covariates could also be passed to the
function separately. In this case, a dictionary must be passed. The
data frame or address of data on temporal covariates and target
variable must be included in the dictionary with the ‘temporal_data’
key, and the data frame or address of data on spatial covariates must
be the value of key ‘spatial_data’. Fig. 3 represent a
sample input data tables.
The temporal_data must include the following columns:
Spatial ids: The id of the units in the finest spatial scale of input
data must be included in the temporal_data in a column with the name
‘spatial id level 1’.
The id of units in the secondary spatial scales of input data could be
included in the temporal_data in columns named ‘spatial id level x’,
where x shows the related scale level or could be given in a
spatial_scale_table. Note that spatial id(s) must have unique
values.
Temporal ids: The id of time units recorded in the input data for each
temporal scale must be included as a separate column in the
temporal_data with a name in a format ‘temporal id level x’, where ‘x’
is the related temporal scale level beginning with level 1 for the
smallest scale. The temporal units could have a free but sortable
format like year number, week number and so on. The combination of
these temporal scale levels’ ids should form a unique identifier.
However the integrated format of date and time is also supported. In
the case of using integrated format, only the smallest temporal scale
must be included in the temporal_data with the column name of
‘temporal id’. The expected format of each scale is shown in
Temporal covariates: The temporal covariates must be specified in a
temporal_data with the column name in a format ‘temporal covariate x’
where ‘x’ is the covariate number.
Target: The column of the target variable in the temporal_data must be
named ‘target’.
The spatial_data must includes following columns:
Spatial ids: The id of the units in the finest spatial scale of input
data must be included in the spatial_data with the name ‘spatial id
level 1’. The id of units in the secondary spatial scales of input
data could be included in the spatial_data in columns named ‘spatial
id level x’, where x shows the related scale level or could be given
in the spatial_scale_table.
Spatial covariates: The spatial covariates must be specified in a
spatial_data with the column names in a format ‘spatial covariate x’,
where the ‘x’ is the covariate number.
example: {‘temporal_data’ : ‘./Covid 19 temporal data.csv’,
‘spatial_data’ : ‘./Covid 19 spatial data.csv’}
|
2
|
forecast_horizon
|
type: int
default: -
details: Number of temporal units in the future to be forecasted.
|
3
|
history_length
|
type: int or dict
default: 1
details: The maximum number of temporal units in the past which
their information is used to predict. If int, it represents the maximum
history length of all the covariates. If dict, it should include the temporal
covariate names as it’s keys and the corresponding maximum history
lengths as it’s values.
example: {(‘temperature’,’precipitation’):3,’social distancing
policy’:5}
|
4
|
column_identifier
|
type: dict or None
default: None
details: If the input data column names do not match the
specific format of temporal and spatial ids and covariates (i.e.,
‘temporal id’, ‘temporal id level x’, ‘spatial id level x’, ‘temporal
covariate x’, ‘spatial covariate x’,’target’), a dictionary must be
passed to specify the content of each column.
The keys must be a string in one of the formats: {‘temporal
id’,’temporal id level x’,’spatial id level x’, ‘temporal covariate’,
‘spatial covariate’,’target’}
The values of ‘temporal id level x’ and ‘spatial id level x’ must be
the name of the column containing the temporal or spatial ids in the
scale level x respectively.
If the input data has integrated format for temporal ids, the name
of the corresponding column must be specified with the key ‘temporal
id’.
The values of ‘temporal covariate’ and ‘spatial covariate’ are the
list of temporal and spatial covariates respectively, and the value of
the ‘target’ is the column name of the target variable.
example: {‘temporal id level 1’: ‘week’,’temporal id level 2’:
‘year’,’spatial id level 1’: ‘county_fips’, ‘spatial id level 2’:
‘state_fips’, ‘temporal covariates’:[‘temperature’, ‘social distance
policy’], ‘spatial covariates’:[‘population’,’hospital
beds’],’target’:’covid-19 deaths’}
|
5
|
feature_sets
|
type: dict
default: {‘covariate’: ‘mRMR’}
details:
If dict, the key indicates that in the process of feature selection
based on the rank, the ranking must be performed on the covariates or
on the whole features (covariates and their historical values)
and the value specifies the method of ranking from the supported
methods ‘correlation’, ‘mRMR’, and ‘variance’.
If ‘correlation’, covariates(or features) are ranked based on their
correlation with target variable. If ‘mRMR’ the ranking will be
conducted based on mRMR method in which the correlation between
the features themselves also affects the choice of rank.
If ‘variance’ the variance-based sensitivity analysis method will be
used in which covariates(features) are prioritized based on the
fraction of target variable variance that can be attributed to their
variance.
The candid feature sets will be obtained by slicing the ranked list
of covariates or features from the first item to item number n where
n is varied from 1 to the total number of list items, and if items are
covariates, the covariates in each resulting subset along with their
corresponding historical values will form a candid feature set in the
feature selection process.
dict options: {‘covariate’: ‘correlation’}, {‘covariate’: ‘mRMR’},
{‘covariate’: ‘variance’}, {‘feature’: ‘correlation’},
{‘feature’: ‘mRMR’}, {‘feature’: ‘variance’}
example:
{‘covariate’: ‘correlation’}
|
6
|
models
|
type: list<string, dict, or function>
default: [‘knn’]
details: a list of predefined model names and user-defined model
functions. The supported options are:
‘knn’: K nearest neighbour regressor for model_type of regression, K
nearest neighbour classifier for model_type of classification.
‘gbm’: Gradient boosting regression for model_type of regression,
Gradient boosting classification for model_type of classification.
‘glm’: Elastic net linear regression for model_type of regression,
logistic regression for model_type of classification.
‘nn’: Neural network with adjustable number of layers and neurons for
model_type of regression and classification.
[callable]: a user-defined function that accepts the data frames of
features in the training and validation set and a target variable
values of the training set in a form of ndarray, and returns the
predictions on the training and validation set and the trained model
(See User defined model) .
If the user prefers to specify the hyperparameters of the predefined
model, the related item in the list must be in a form of a dictionary
with the name of the model as the key and the dictionary of
hyperparameters as its value. To perform a grid search on hyper
parameters values, a list of values could be specified in the
dictionary of hyper parameters.
example: [{‘knn’:{‘n_neighbors’:10, ‘metric’:’minkowski’}}, ‘nn’,
lstm]
or
[{‘knn’:{‘n_neighbors’:[5,10], ‘metric’:’minkowski’}},
{‘nn’:{‘hidden_layers_number’:[1,2,3],’hidden_layers_neurons’:[8,16]}}]
Note. Since for ‘knn’, ‘glm’, and ‘gbm’ we use the implementation of
the Scikit Learn package1, all of the hyperparameters defined in this
package are supported. For ‘nn’ model the implementation of
TensorFlow is used, and the list of supported hyperparameters
is as below:
[‘hidden_layers_structure’, ‘hidden_layers_number’,
‘hidden_layers_neurons’, ‘hidden_layers_activations’,
‘output_activation’, ‘loss’, ‘optimizer’, ‘metrics’,
‘early_stopping_monitor’, ‘early_stopping_patience’, ‘batch_size’,
‘validation_split’,’epochs’]
The structure of the neural network can be specified using
‘hidden_layers_structure’ which takes a list of two item tuples each
corresponds to a hidden layer in a network with the first item of each
tuple as the number of neurons in the layer and second item as the
activation function of the layer.
Another way is to specify ‘hidden_layers_number’,
‘hidden_layers_neurons’, and ‘hidden_layers_activations’ which will
result in network with the number of hidden layers according to
‘hidden_layers_number’ having the same activation function and
number of neurons according to the values of ‘hidden_layers_neurons’
and ‘hidden_layers_activations’ parameters. The ‘output_activation’
hyper parameter is the activation function of the output layer. Note
that to search over different structures, each of these parameters
could have a list of values.
The supported values for the rest of the hyperparameters are the same
as the TensorFlow package.
|
7
|
model_type
|
type: {‘regression’,’classification’}
default: ‘regression’
details: type of the prediction task
|
8
|
test_type
|
type: {‘whole-as-one’,’one-by-one’}
default: ‘whole-as-one’
details: If whole-as-one, the prediction for all the test samples is
made with the best model, feature set and history length
which are obtained based on the prediction results of an
identical training and validation set. The training and
validation sets in this mode are obtained by removing all
the test instances from the data.
If one-by-one, each test sample has an different training
and validation sets which obtained by removing only this test
sample, and all of its subsequent test samples from the data.
|
9
|
mixed_models
|
type: list<string, dict, or function>
default: []
details: A list of models to be considered as a mixed_model which
uses a combination of simple models’ predictions as input to predict
the target variable. The predictions of the models specified in the
models input will be used to feed the mixed models. The supported
values for the mixed_models input are exactly the same as the models
input.
example: [{‘knn’:{‘n_neighbors’:10, ‘metric’:’minkowski’}}, ‘nn’,
lstm]
|
10
|
performance
_benchmark
|
type: {‘MAE’, ‘MAPE’, ‘MASE’, ‘MSE’, ‘R2_score’, ‘AUC’, ‘AUPR’,
‘likelihood’, ‘AIC’, ‘BIC’}
default: ‘MAPE’
details: a performance measure which is used to select best_model,
best_history_length, and best_feature_set
|
11
|
performance
_measure
|
type: list<{‘MAE’, ‘MAPE’, ‘MASE’, ‘MSE’, ‘R2_score’, ‘AUC’,
‘AUPR’, ‘likelihood’, ‘AIC’, ‘BIC’}>
default: [‘MAPE’]
details: a list of performance measures to be included in the
performance report
example: [‘MAE’, ‘MAPE’]
|
12
|
performance
_mode
|
type: ‘normal’ or ‘cumulative’ or ‘moving_average+x’
default: ‘normal’
The mode of target variable based on which the performance measures
are calculated.
If the user desires to use moving average mode, the window of the
moving average must also be specified as x in the format
‘moving_average+x’
|
13
|
splitting_type
|
type: {‘training-validation’, ‘cross-validation’}
default: ‘training-validation’
details: The type of splitting the data.
If ‘training-validation’, the validation set is selected from the last
temporal units in the dataset with the size of
instance_validation_size units.
If ‘cross-validation’, the cross-validation method is performed on
dataset by considering the fold_total_number as the total number of
folds.
In both cases, before splitting the validation set, first the last
part of the dataset with the size of instance_testing_size is selected
as the testing part.
|
14
|
instance_testing
_size
|
type: int or float
default: 0.2
details: The number of temporal units to be considered as testing
set.
If int, is considered as the number of temporal units to be included
in the testing set.
If float, is considered as the proportion of temporal units in the
dataset to be included in the testing split, thus should be between
0.0 and 1.0.
example: 7 or 0.3
|
15
|
instance_validation
_size
|
type: int or float
default: 0.3
details: the number of temporal units to be considered as
validation set.
If int, is considered as the number of temporal units to be included
in the validation set
If float, is considered as the proportion of temporal units in the
data set (after removing the testing set) to be included in the
validation split, thus should be between 0.0 and 1.0.
this input is considered if the splitting_type input is
‘training-validation’.
example: 7 or 0.3
|
16
|
instance_random
_partitioning
|
type: bool
default: False
details: The type of data partitioning in validation process
If False, the validation part will be selected from the last recorded
temporal units in the data. But if True the temporal units of
validation part are selected randomly from the recorded temporal units
in the data.
This input is considered if the splitting_type input is
‘training-validation’
Warning: Except for educational purposes, this should always be set to
False.
|
17
|
fold_total_number
|
type: int
default: 5
details: total number of folds in the cross validation process
this input is considered if the splitting_type input is set to
‘cross-validation’.
|
18
|
imputation
|
type: bool
default: True
details: Specify whether or not to perform imputation.
|
19
|
target_mode
|
type: {‘normal’, ‘cumulative’, ‘differential’,’moving average’}
default: ‘normal’
details: The mode of target variable which will be used to learn
the methods for prediction:
‘normal’:
No modification.
‘cumulative’:
Target variable shows the cumulative value of the variable from the
first date in the data.
‘differential’:
Target variable shows the difference between the value of the
variable in the current and previous temporal unit.
‘moving average’:
Target variable values are modified to represent the average of the
variable values in the previous higher level scale temporal unit for
each current scale temporal unit. (e.g., If the current temporal scale
is day, the value of the target variable in each day will be the
average of values in the previous week.)
|
20
|
feature_scaler
|
type: {‘logarithmic’,’normalize’,’standardize’,None}
default: None
details: Type of scaling the features in the input data set for
modeling
|
21
|
target_scaler
|
type: {‘logarithmic’,’normalize’,’standardize’,None}
default: None
details: Type of scaling the target variable in the input data set
for modeling
|
22
|
forced_covariates
|
type: list<string>
default: []
details: a list of covariates which are placed in the feature sets
by force with their historical values
example:
[ ‘temperature’, ‘precipitation’]
|
23
|
futuristic_covariates
|
type: dict or None
default: None
details: a dictionary of temporal covariates whose values at the
future temporal units will be considered for prediction. The keys are
the name of temporal covariates (or tuple of multiple covariate names)
and the values are the list of length 2 representing the start and end
point of the temporal interval in the future in which values of
covariates will be included in the historical dataframe.
example: {‘temperature’: [2,4], (‘social distancing
policy’,’precipitation’): [6,6]}
|
24
|
scenario
|
type: {‘max’, ‘min’, ‘mean’, ‘current’, None}
default: ‘current’
details: The scenario based on which the future values of
futuristic covariates will be determined in the process of forecasting
the future temporal units.
If min, the minimum observed values for each spatial unit are
considered as future values of the futuristic covariate for that
spatial unit. The same manner is performed for max and mean.
If current, the value of the covariate in the last temporal unit for
each spatial unit is considered as the future value of that
covariate.
If None, the values of futuristic covariates must be provided in the
input data.
example: ‘min’
|
25
|
future_data_table
|
type: data frame or string or None
default: None
details: data address or data frame containing futuristic
covariates values in the temporal units in the future, where the
future refers to the temporal units after the last unit with recorded
information for the covariates and target variable in input data.
These values can also be included in the input data (temporal data)
in the rows corresponding to the future temporal units having values
for the futuristic covariates and NA for other covariates.
Note that all the temporal and spatial id’s in the input data must
be included in the future_data_table. An example of
future_data_table is shown in Fig. 4 .
|
26
|
temporal_scale_level
|
type: int
default: 1
details: The temporal scale level that is considered for
prediction.
Note.If the temporal id have an integrated format, the scale of the
specified level will be determined based on the input scale and the
following sequence of temporal scales:
Second, Minute, Hour, Day, Week, Month, Year
|
27
|
spatial_scale_level
|
type: int
default: 1
details: The spatial scale level that is considered for
prediction.
|
28
|
spatial_scale_table
|
type: data frame, string, or None
default: None
details: If the ids of secondary spatial scale units are not
included in the input data, a data frame must be passed to the
function containing different spatial scales information, with the
first column named ‘spatial id level 1’, and including the id of the
units in the smallest spatial scale and the rest of the columns
including the id of bigger scale units for each unit of the smallest
scale. If the column names do not match the format ‘spatial id level
x’ the content of each column must be specified using
column_identifier argument.
The address of the dataframe could also be passed.
|
29
|
aggregation_mode
|
type: {‘sum’,’mean’} or dict
default: ‘mean’
details: Aggregation operator which is used to derive covariate
values for samples of bigger spatial scale from samples of smaller
spatial scale in the spatial scale transforming process.
This operator could be different for each covariate which in this case
a dictionary (dict) must be passed with covariates as its keys and
‘mean’ or ‘sum’ as its values.
example: {‘temperature’:’mean’,’precipitation’:’sum’,
‘population’:’sum’}
|
30
|
augmentation
|
type: bool
default: False
details: Specify whether or not to augment data when using bigger
temporal scales to avoid data volume decrease. For this purpose, in
the process of temporal scale transformation, instead of taking the
average of smaller scale units’ values to get the bigger scale unit
value, the moving average method is used.
|
31
|
validation
_performance
_report
|
type: bool
default: True
details: performance report of train_validate process.
If True, a table containing a report on models and their corresponding
errors (based on performance_measures) will be saved in the sub
directory ‘performance/validation process’ in the same directory as
the code is running and as in ‘.csv’ format.
Each row of the table represents the performance of a specific model
for the given history length and feature set, and each column
represents one of the user specified performance_measures.
|
32
|
testing
_performance
_report
|
type: bool
default: True
details: performance report of train_test process.
If True, a table containing a report on performance measures of the
best model for the test set will be saved in the sub directory
‘performance/testing process’ in the same directory as the code is
running and as in ‘.csv’ format.
Each column of this table represents one of the user specified
performance_measures.
|
33
|
save_predictions
|
type: bool
default: True
details:
If True, the prediction values of trained models with the best
configuration (best history length and feature set) for the training
data and validation data through train_validate process will be saved
in the sub directory ‘predictions/validation process/’ in the same
directory as the code is running and as in ‘.csv’ format, and the
prediction values of the overall best model for the test set will be
saved in the sub directory ‘predictions/test process/’ as in ‘.csv’
format.
Note.
If the splitting_type is set to ‘cross-validation’ only test set
predictions will be saved.
|
34
|
save_ranked
_features
|
type: bool
default: True
details:
If True, the features ranking will be saved in the sub directory
‘ranked features/’ in the same directory as the code is running.
|
35
|
plot_predictions
|
type: bool
default: False
details:
If True, the real and predicted values of the target variable for some
of the spatial id’s in the data input will be plotted and saved in the
sub directory ‘plots/’ in the same directory as the code is running.
|
36
|
verbose
|
type: int
default: 0
details: The level of details in produced logging information
available options:
0: no logging
1: only important information logging
|
Note
In the current version, ‘AIC’ and ‘BIC’ can only be calculated for the ‘glm’ model and ‘classification’ model_type.
Returns
# |
Output Name |
Output Description |
|---|---|---|
1
|
_
|
type: None
|
Example
from stpredict import stpredict
df1 = pd.read_csv('USA COVID-19 temporal data.csv')
df2 = pd.read_csv('USA COVID-19 spatial data.csv')
stpredict(data = {'temporal_data':df1,'spatial_data':df2},
forecast_horizon = 2, history_length = 2,
models = ['knn'], model_type = 'regression')