Skip to content

Data Checks

Overview

During the train step, Continual will perform input data validation on the data that will be used to train models. This consists of a series of checks to ensure the data is valid and ready to be trained on. Some of the checks that Continual currently performs are:

  • Checks for duplicate columns
  • Checks for duplicate indices
  • Checks for type mismatches between the query and data
  • Size limit checks
  • Null value checks

By default, these checks are enabled. They can however be disabled by setting the disable_data_checks flag in the feature set configuration YAML file.

train:
  disable_data_checks: True

List of Checks

Index or Time Index Present

Pass: If either an index or time index are specified in the query. Fail: Error if index or time index is specified by query but not present in dataset.

No Duplicates in Index or Time Index

Pass: If there are no duplicates in the index or time index columns (if specified). Warn: Warnings if there are duplicates in the index or time index (if specified).

No Duplicate Column Names or Identical Columns

Pass: If there are no duplicate column names or columns with completely identical values. Fail: Error if there are duplicated column names, and warning if there are columns with completely identical values.

No Null Values in Index or Time Index

Pass: If there are no null values in the index or time index columns (if specified). Fail: Errors if there are null values in the index or time index (if specified).

Target Column is not Excluded

Pass: If the target column is not excluded in the query. Fail: Error otherwise.

Query Features Exist in Dataset

Pass: If all features specified in the query exist in the dataset. Fail: Error if a feature in the query does not exist in the dataset.

Column Types Match Query

Pass: If a best attempt at inferring column types matches those specified by the query. Fail: Error if column specified as a NUMBER is not a numerical dtype. Warnings if there are other type mismatches.

Not Too Many Null Features

Pass: If there are less than a fixed threshold of nulls in feature values. Warn: Warnings about percentage of null values across feature columns.

Timestamp Ranges Have Sufficient Overlap

Pass: If the range of each column specified to have timestamps overlaps sufficiently with the spine column (time index). Warn: Warnings if the columns overlap too little.

Dataset is not Too Large

Pass: If memory usage of dataset is below the size limit specified in the feature set configuration YAML. Warn: Warning if dataset size exceeds the size limit.

Classes are reasonably balanced

Pass: If there are at least 2 unique non-null values in the column to be predicted (target). Fail: If the model is a classifier and there is only 1 unique non-null value in the target column.

Back to top