train_validate
Description
training and validation process on preprocessed historical data to find the best model and the corresponding best configurations (best feature set and history length) for it.
Usage
- predict.train_validate(data, feature_sets, forced_covariates=[], instance_validation_size=0.3, instance_testing_size=0, fold_total_number=5, instance_random_partitioning=False, forecast_horizon=1, models=['knn'], mixed_models=None, model_type='regression', splitting_type='training-validation', performance_measures=None, performance_benchmark=None, performance_mode='normal', feature_scaler=None, target_scaler=None, labels=None, performance_report=True, save_predictions=True, verbose=0)
Parameters
# |
Input Name |
Input Description |
|---|---|---|
1
|
data
|
type: list <dataframe> or list <string>
default: -
details:
list<dataframe>: a list of preprocessed data frames for history
length=1, 2, …, max history length, sorted based on the data frame
history length.
list<string>: a list of addresses of the preprocessed data frames for
history length=1, 2, …, max history length, sorted based on the data
frame history length.
The preprocessed data frames must have a column name format
conforming to Fig. 5.
|
2
|
feature_sets
|
type: dict
default: {‘covariate’: ‘mRMR’}
details:
If dict, the key indicates that in the process of feature selection
based on the rank, the ranking must be performed on the covariates or
on the whole features (covariates and their historical values)
and the value specifies the method of ranking from the supported
methods ‘correlation’, ‘mRMR’, and ‘variance’.
If ‘correlation’, covariates(or features) are ranked based on their
correlation with target variable. If ‘mRMR’ the ranking will be
conducted based on mRMR method in which the correlation between
the features themselves also affects the choice of rank.
If ‘variance’ the variance-based sensitivity analysis method will be
used in which covariates(features) are prioritized based on the
fraction of target variable variance that can be attributed to their
variance.
The candid feature sets will be obtained by slicing the ranked list
of covariates or features from the first item to item number n where
n is varied from 1 to the total number of list items, and if items are
covariates, the covariates in each resulting subset along with their
corresponding historical values will form a candid feature set in the
feature selection process.
dict options: {‘covariate’: ‘correlation’}, {‘covariate’: ‘mRMR’},
{‘covariate’: ‘variance’}, {‘feature’: ‘correlation’},
{‘feature’: ‘mRMR’}, {‘feature’: ‘variance’}
example:
{‘covariate’: ‘correlation’}
|
3
|
forced_covariates
|
type: list<string>
default: []
details: a list of covariates which are placed in the feature sets
by force with their historical values.
example:
[ ‘temperature’, ‘precipitation’]
|
4
|
instance_validation
_size
|
type: int or float
default: -
details: The number of temporal units to be considered as a
validation set.
If int, is considered as the number of temporal units to be included
in the validation set.
If float, is considered as the proportion of temporal units in the
dataset to be included in the validation split, thus should be between
0.0 and 1.0.
This input is considered if the splitting_type input is
‘training-validation’.
Note that if the instance_testing_size has value greater than zero,
and instance_validation_size is of type float, it is considered as
the proportion of the temporal units in the remaining data after
removing the testing set.
example: 7 or 0.3
|
5
|
instance_testing
_size
|
type: int or float
default: -
details: The number of temporal units to be considered as a
testing set.
If int, is considered as the number of temporal units to be included
in the testing set.
If float, is considered as the proportion of temporal units in the
dataset to be included in the testing split, thus should be between
0.0 and 1.0.
example: 7 or 0.3
|
6
|
fold_total_number
|
type: int
default: 5
details: total number of folds in cross validation process
This input is considered if the splitting_type input is set to
‘cross-validation’.
|
7
|
instance_random
_partitioning
|
type: bool
default: False
details: The type of data partitioning in validation process
If False, the validation part will be selected from the last recorded
temporal units, but if True, the temporal units of validation part are
selected randomly from the recorded temporal units.
This input is considered if the splitting_type input is
‘training-validation’.
Warning: Except for educational purposes, this should always be set to
False.
|
8
|
forecast_horizon
|
type: int
default: 1
details: forecast horizon to gap consideration in data splitting
process.
By the gap, we mean the number of temporal units which are excluded
from data to simulate the situation of real prediction in which we do
not have access to the information of the forecast horizon - 1 units
before the time point of the target variable.
|
9
|
models
|
type: list<string, dict, or function>
default: [‘knn’]
details: a list of predefined model names and user-defined model
functions. The supported options are:
‘knn’: K nearest neighbour regressor for model_type of regression, K
nearest neighbour classifier for model_type of classification.
‘gbm’: Gradient boosting regression for model_type of regression,
Gradient boosting classification for model_type of classification.
‘glm’: Elastic net linear regression for model_type of regression,
logistic regression for model_type of classification.
‘nn’: Neural network with adjustable number of layers and neurons for
model_type of regression and classification.
[callable]: a user-defined function that accepts the data frames of
features in the training and validation set and a target variable
values of the training set in a form of ndarray, and returns the
predictions on the training and validation set and the trained model
(See User defined model).
If the user prefers to specify the hyperparameters of the predefined
model, the related item in the list must be in a form of a dictionary
with the name of the model as the key and the dictionary of
hyperparameters as its value.
example: [{‘knn’:{‘n_neighbors’:10, ‘metric’:’minkowski’}}, ‘nn’,
lstm()]
Note. Since for ‘knn’, ‘glm’, and ‘gbm’ we use the implementation of
the Scikit Learn package, all of the hyperparameters defined in this
package are supported. For ‘nn’ model the implementation of TensorFlow
is used, and the list of supported hyperparameters is as
below:[‘hidden_layers_neurons’,
‘hidden_layers_activations’,’output_activation’, ‘loss’, ‘optimizer’,
‘metrics’, ‘early_stopping_monitor’, ‘early_stopping_patience’,
‘batch_size’, ‘validation_split’,’epochs’]
Note that the value of the ‘hidden_layers_neurons’ hyperparameter must
be an ordered list of the number of neurons in each of the neural
network’s hidden layers. The value of ‘hidden_layers_activations’ is a
list of activation functions of the network layers. The
output_activation hyper parameter is the activation function of the
output layer, and the supported values for the rest of the
hyperparameters are the same as the TensorFlow package.
|
10
|
mixed_models
|
type: list<string, dict, or function>
default: []
details: A list of models to be considered as a mixed_model which
uses a combination of simple models’ predictions as input to predict
the target variable. The predictions of the models specified in the
models input will be used to feed the mixed models. The supported
values for the mixed_models input are exactly the same as the models
input.
example: [{‘knn’:{‘n_neighbors’:10, ‘metric’:’minkowski’}}, ‘nn’,
lstm()]
|
11
|
model_type
|
type: {‘regression’,’classification}
default: ‘regression’
details: type of prediction task.
|
12
|
splitting_type
|
type: {‘training-validation’, ‘cross-validation’}
default: ‘training-validation’
details: The type of splitting the input data frames.
If ‘training-validation’, the validation set is selected from the last
temporal units in the data with the size of instance_validation_size
units, but if the instance_testing_size has value greater than one,
before splitting the validation part, first the last part of the data
with the size of instance_testing_size is selected as the testing
set.
If ‘cross-validation’ is set, the cross-validation method is performed
by considering the fold_total_numer as the total number of folds.
|
13
|
performance
_measures
|
type: list<{‘MAE’, ‘MAPE’, ‘MASE’, ‘MSE’, ‘R2_score’, ‘AUC’,
‘AUPR’, ‘likelihood’, ‘AIC’, ‘BIC’}>
default: [‘MAPE’]
details: a list of performance measures to be included in the
performance report.
example: [‘MAE’, ‘MAPE’]
|
14
|
performance
_benchmark
|
type: {‘MAE’, ‘MAPE’, ‘MASE’, ‘MSE’, ‘R2_score’, ‘AUC’, ‘AUPR’,
‘likelihood’, ‘AIC’, ‘BIC’}
default: ‘MAPE’
details: a performance measure which is used to select best_model,
best_history_length, and best_feature_or_covariate_set.
|
15
|
performance
_mode
|
type: ‘normal’ or ‘cumulative’ or ‘moving_average+x’
default: ‘normal’
The mode of target variable based on which the performance measures
are calculated.
If the user desires to use moving average mode, the window of the
moving average must also be specified as x in the format
‘moving_average+x’.
|
16
|
feature_scaler
|
type: {‘logarithmic’,’normalize’,’standardize’,None}
default: None
details: Type of scaling the features in the input data set for
modeling.
|
17
|
target_scaler
|
type: {‘logarithmic’,’normalize’,’standardize’,None}
default: None
details: Type of scaling the target variable in the input data set
for modeling.
|
18
|
labels
|
type: list<any type> or None
default: None
details: The class labels of the target variable.
|
19
|
performance
_report
|
type: bool
default: True
details: If True, a table containing a report on models and their
corresponding errors (based on performance_measures) will be saved
in the sub directory ‘performance/validation process’ in the same
directory as the code is running and as in ‘.csv’ format.
Each row of the table represents the performance of a specific model
for the given history length and feature set, and each column
represents one of the user specified performance_measures.
|
20
|
save_predictions
|
type: bool
default: True
details:
If True, the prediction values of trained models with the best
configuration (best history length and feature set) for the training
data and validation data will be saved in the sub directory
‘predictions/validation process/’ in the same directory as the code is
running and as in ‘.csv’ format.
The name of saving csv files has a suffix with the format ‘T = x’
where x represents the total number of temporal units in the training
and validation set.
Note.
If the splitting_type is set to ‘cross-validation’ only test set
predictions will be saved.
|
21
|
verbose
|
type: int
default: 0
details: The level of details in produced logging information
available options:
0: no logging
1: only important information logging
|
Note
In the current version, ‘AIC’ and ‘BIC’ can only be calculated for the ‘glm’ model and ‘classification’ model_type.
Returns
# |
Output Name |
Output Description |
|---|---|---|
1
|
best_model
|
type: string or function
details: best model based on performance_benchmark
If the best model is one of the predefined models, the name of the
model will be returned, otherwise, the user-defined function will be
returned.
example: ‘knn’ or lstm()
|
2
|
best_model
_parameters
|
type: dict or None
details: a dict containing the name and values of the best model
hyperparameters. If no hyperparameter is specified for the model, None
is returned.
example: {‘n_neighbors’:10, ‘metric’:’minkowski’}
|
3
|
best_history_length
|
type: int
details: The best history length, which the best performance is
obtained by training the best model on data frame with this history
length
|
4
|
best_feature_or
_covariate_set
|
type: list <str>
details: a list of features or covariates, which the best
performance is obtained by training the best model using the
corresponding feature set.
|
5
|
best_model_base
_models
|
type: list or None
details: If the selected best model is not a mixed model None is
returned, but if it is a mixed model, list of models which is used as
the base models of this model is returned.
example: {‘n_neighbors’:10, ‘metric’:’minkowski’}
|
6
|
best_trained_model
|
type: object
details: The best model trained on the whole training data.
|
Example
from stpredict.predict import train_validate
best_model, best_model_parameters, best_history_length,
best_feature_or_covariate_set, best_model_base_models, best_trained_model = train_validate(
data = ['./historical_data h=1.csv', './historical_data h=2.csv',
'./historical_data h=3.csv'],
feature_sets = {'covariate': 'correlation'}, forecast_horizon = 4)