make_neighbouring_data

Description Extracting new covariates from the data using spatial correlation. To extract a new covariate, for each spatial unit, the values of an existing covariate (spatial or temporal) are averaged in the neighboring spatial units of this unit. It can be repeated for several neighbouring layers, where each layer includes the neighbouring spatial units with a certain distance. The first layer contains adjacent neighbours, the second layer contains neighbours with a distance of one spatial unit, and so on for other layers.

Usage

preprocess.make_neighbouring_data(data, column_identifier=None, number_of_layers=1, neighbouring_matrix=None, time_dependency_flag=1, verbose=0)

Parameters

#

Input Name

Input Description

1





























data





























type: data frame or str
default: -
details: a data frame or address of a data frame containing
temporal or spatial covariates and the id of spatial (and temporal)
units.
The data must include the following columns with names in the
specified format in the description, or if the columns have arbitrary
names their content must be specified using the column_identifier
argument.


Spatial id:
The id of the units in the finest spatial scale of input data must be
included in the data with the name ‘spatial id level 1’. The spatial
id must have unique values.

Temporal ids:
If data is time-dependent (i.e., time_dependency_flag = 1), the id of
the temporal units recorded in the input data must be included in the
data with the column name ‘temporal id’. The temporal ids must have
unique values and a sortable format. The time dimension of data can
also be specified in multiple scales (e.g., hour, min, sec). In that case,
each scale must have a separate temporal id with a column name of
the format ‘temporal id level x’, where ‘x’ is the corresponding temporal
scale level, starting with level 1 for the smallest scale (e.g. sec). The
combination of these temporal scale levels’ ids should form a unique
identifier.

All the remaining columns are considered as covariates except those
with names of the format ‘spatial id level x’ or ‘temporal id level x’
where x is the level of the corresponding spatial or temporal scale.
2

















column_identifier

















type: dict or None
default: None
details: If the input data column names do not match the
specific format of temporal and spatial ids and covariates (i.e.,
‘temporal id’, ‘temporal id level x’, ‘spatial id level x’),
a dictionary must be passed to specify the content of each column.

Keys must be a string in one of the formats: {‘temporal id’,
‘temporal id level x’,’spatial id level x’}
The values of ‘temporal id level x’ and ‘spatial id level x’ should be
the names of the columns containing the temporal and spatial ids
in the x scale level respectively. If the input data has only one
temporal scale, the name of the column including the temporal ids
must be specified with the key ‘temporal id’.

example: {‘temporal id level 1’: ‘week’,’temporal id level 2’:
‘year’,’spatial id level 1’: ‘county_fips’, ‘spatial id level 2’:
‘state_fips’}
3










number_of_layers










type: int
default: 0
details: The number of neighbouring layers
Each neighbouring layer for a spatial unit includes neighbours with
a certain distance from this spatial unit. The first layer contains
adjacent neighbours, the second layer contains neighbours with a
distance of one spatial unit, and so on for other layers. For each
covariate, the average values of this covariate in the spatial units
included in a neighbouring layer are added to the data as a new
covariate. Therefore, neighbouring_layers = n, for each covariate,
adds n new covariates to the data.
4











neighbouring_matrix











type: numpy.ndarray or None
default: None
details: The adjacency matrix of spatial units
A two-dimensional binary array with dimensions equal to the number
of spatial units included in the data. The value of each element of this
array indicates the adjacency (value 1) or non-adjacency (value 0) of
two spatial units. Note that the order considered for the spatial units
in the rows and columns of the matrix should be based on their spatial
id order (numerical order for numeric ids or lexicographical order for
string ids).

example: numpy.array([[0,1,0],[1,0,1],[0,1,0]])
5




time_dependency_flag




type: int
default: 0
details: Time dependency of the data
0 if the data is time-dependent or 1 otherwise. In the second case,
data should be a single data frame or its address.
6





verbose





type: int
default: 0
details: The level of details in produced logging information
available options:
0: no logging
1: only important information logging

Returns

#

Output Name

Output Description

1





neighbouring_data





type: dataframe
details: a data frame including extracted covariates from
the neighbourhood. The name of the extracted covariates
have a special format, e.g. the covariate obtained by averaging the
‘temperature’ in spatial units of the first neighbouring layer has
the name ‘temperature_l1’.

Example

import pandas as pd
from stpredict.preprocess import make_neighbouring_data

df = pd.read_csv('USA COVID-19 temporal data.csv')


neighbouring_data = make_neighbouring_data(data = df, number_of_layers = 2,
                                             neighbouring_matrix = [[0,1,...,0],
                                                                    [1,0,...,1],
                                                                    ...,
                                                                    [0,1,...,0]])