impute
Description
Impute the missing values of each covariate in the input data based on the observed values for that covariate using the KNN imputation method and remove the spatial units which have all missing values for a covariate. To impute each of the covariates using the knn method, first values of that covariate are arranged as a data frame with each row representing a spatial unit and each column representing a temporal unit (see Fig. 1). The KNN imputation method is applied on this data frame to fill up the missing values of the covariate.
Usage
- preprocess.impute(data, column_identifier=None, verbose=0)
Parameters
# |
Input Name |
Input Description |
|---|---|---|
1
|
data
|
type: data frame
default: -
details: Data frame of temporal covariates including missing
values.
The data must include the following columns with names in the
specified format in the description, or if the columns have arbitrary
names their content must be specified using the column_identifier
argument.
Spatial id:
The id of the units in the finest spatial scale of input data must be
included in the data with the name ‘spatial id level 1’. The spatial
id must have unique values.
Temporal ids:
The id of temporal units recorded in the input data for each temporal
scale must be included as a separate column in the data with a name
in a format ‘temporal id level x’, where ‘x’ is the related temporal
scale level beginning with level 1 for the smallest scale.
The temporal units could have a free but sortable format like year
number, week number and so on. The combination of these temporal scale
levels’ ids should form a unique identifier.
However, the integrated format of date and time is also supported. In
the case of using integrated format, only the smallest temporal scale
must be included in the data with the column name of ‘temporal id’.
The expected format of each scale is shown in Table 2.
Temporal covariates:
The temporal (time-dependent) covariates must be specified in a
temporal_data with the format ‘temporal covariate x’ where ‘x’ is the
covariate number.
Target:
The column of the target variable in the temporal_data must be named
‘target’.
|
2
|
column_identifier
|
type: dict or None
default: None
details: If the input data column names do not match the
specific format of temporal and spatial ids and covariates (i.e.,
‘temporal id’, ‘temporal id level x’, ‘spatial id level x’, ‘temporal
covariate x’, ‘target’), a dictionary must be passed to specify the
content of each column.
The keys must be a string in one of the formats: {‘temporal
id’,’temporal id level x’,’spatial id level x’}
The values of ‘temporal id level x’ and ‘spatial id level x’ must be
the name of the column containing the temporal or spatial ids in the
scale level x respectively.
If the input data has integrated format for temporal ids, the name of
the corresponding column must be specified with the key ‘temporal
id’.
example: {‘temporal id level 1’: ‘week’,’temporal id level 2’:
‘year’,’spatial id level 1’: ‘county_fips’, ‘spatial id level 2’:
‘state_fips’}
|
3
|
verbose
|
type: int
default: 0
details: The level of details in produced logging information
available options:
0: no logging
1: only important information logging
2: all details logging
|
Note
The gap in the sequence of temporal id levels is not allowed. More clearly if input data contains columns ‘temporal id level 1’,’temporal id level 2’, … , ‘temporal id level x’ , ‘temporal id level x+2’, the column ‘temporal id level x+2’ is not considered in identifying the temporal units and will be removed from the data.
Returns
# |
Output Name |
Output Description |
|---|---|---|
1
|
imputed_data
|
type: data frame
details: The imputed data
|
Example
import pandas as pd
from stpredict.preprocess import impute
df = pd.read_csv('data.csv')
imp = impute(data = df)
Fig. 1 Imputation of missing values in temporal data