Data Checks¶
Overview¶
During the train step, Continual will perform input data validation on the data that will be used to train models. This consists of a series of checks to ensure the data is valid and ready to be trained on. Some of the checks that Continual currently performs are:
- Checks for duplicate columns
- Checks for duplicate indices
- Checks for type mismatches between the query and data
- Size limit checks
- Null value checks
By default, these checks are enabled. They can however be disabled by setting
the disable_data_checks
flag in the
feature set configuration YAML file.
train:
disable_data_checks: True
List of Checks¶
Index or Time Index Present¶
Pass: If either an index or time index are specified in the query. Fail: Error if index or time index is specified by query but not present in dataset.
No Duplicates in Index or Time Index¶
Pass: If there are no duplicates in the index or time index columns (if specified). Warn: Warnings if there are duplicates in the index or time index (if specified).
No Duplicate Column Names or Identical Columns¶
Pass: If there are no duplicate column names or columns with completely identical values. Fail: Error if there are duplicated column names, and warning if there are columns with completely identical values.
No Null Values in Index or Time Index¶
Pass: If there are no null values in the index or time index columns (if specified). Fail: Errors if there are null values in the index or time index (if specified).
Target Column is not Excluded¶
Pass: If the target column is not excluded in the query. Fail: Error otherwise.
Query Features Exist in Dataset¶
Pass: If all features specified in the query exist in the dataset. Fail: Error if a feature in the query does not exist in the dataset.
Column Types Match Query¶
Pass: If a best attempt at inferring column types matches those specified by the query. Fail: Error if column specified as a NUMBER is not a numerical dtype. Warnings if there are other type mismatches.
Not Too Many Null Features¶
Pass: If there are less than a fixed threshold of nulls in feature values. Warn: Warnings about percentage of null values across feature columns.
Timestamp Ranges Have Sufficient Overlap¶
Pass: If the range of each column specified to have timestamps overlaps sufficiently with the spine column (time index). Warn: Warnings if the columns overlap too little.
Dataset is not Too Large¶
Pass: If memory usage of dataset is below the size limit specified in the feature set configuration YAML. Warn: Warning if dataset size exceeds the size limit.
Classes are reasonably balanced¶
Pass: If there are at least 2 unique non-null values in the column to be predicted (target). Fail: If the model is a classifier and there is only 1 unique non-null value in the target column.