Skip to content

YAML reference

Continual feature sets and models can be declaratively defined as YAML.

This document describes the options available when creating YAML files.

Feature set YAML reference

For reference, the general format of the file is as follows:

type: FeatureSet
name: <feature_set_name>
entity: [entity_name]
index: <column_name>
time_index: [column_name]
description: [description]
url: [url]
owners:
  - [owner@domain.com]
  ...
documentation:
    [documentation]
columns:
  - name: <feature_1_name>
    description: [feature_1_description]
    type: <feature_1_type>
        ...
exclude_columns:
  - [list_of_column_names]
profile:
  schedule: [schedule]
  ...
query: <sql_query>
table: <table_name>

type [Required]

Must be FeatureSet.

name [Required]

The name of this feature set. Allowed characters in the name are all alphanumeric characters and '_'.

See Feature Sets & Entites for more information on naming feature sets.

entity [Optional]

The name of the entity that the feature set is a part of. If not provided, this will default to to the name of the feature set. Allowed characters in the name are all alphanumeric characters and '_'.

See Feature Sets & Entites for more information on naming entities.

index [Required]

The name of the column which identifies a member of the entity (i.e. the ID). If not using time_index this should uniquely identify records in your feature set. If using a time_index this will uniquely identify records in your feature set along with the time_index.

time_index [Optional]

If using a temporal feature set, this is the name of the column that corresponds to the timestamp which, along with the index uniquely identifies records in your feature set.

description [Optional]

Short description of the feature set.

documentation [Optional]

Free-text documentation of the feature set. Used to help others understand how to use the feature set, the features contained therein, etc.

url [Optional]

URL related to the featureset.

owners [Optional]

Email addresses of owners. These should match their Continual accounts.

columns [Optional]

Columns are optional. If not provided Continual will infer types from the underlying data warehouse. If you wish to override types or provide descriptions for columns, you may include them in your YAML file.

name [Required]

The column name.

type [Optional]

The column data type. Supported options are:

  • Number
  • Text
  • Categorical
  • Boolean
  • Timestamp

description [Optional]

Short description of the column.

exclude_columns [Optional]

This is a list of columns in the resulting query or table to explicitly exclude from the feature set. This is mainly used if you wish to not explicitly list columns and instead let Continual infer them, but you also want to exclude a few columns from being used.

profile [Optional]

schedule [Optional]

Cron-based schedule which determines how often to profile the featureset. See, for example, Cron Manual Page.

query [Optional]

A valid SQL query that will generate the feature set.

NOTE

Either query or table is required in your YAML file.

table [Optional]

A valid table name in your data warehouse that will be used as the feature set.

NOTE

Either query or table is required in your YAML file.

Quoting table names

When using quoted table names, use a YAML literal block, or wrap your quoted table name in single or double quotes.

Example:

Invalid: table: "database"."schema"."table"

Valid : table: '"database"."schema"."table"'

Valid :

table: |
  "database"."schema"."table" # note how this does not need to be escaped since it's inside the YAML literal block.

Model YAML reference

For reference, the general format of the file is as follows:

type: Model
name: <model_name>
index: <column_name>
time_index: [column_name]
target: <column_name>
split: <column_name>
description: [description]
url: [url]
owners:
  - [owner@domain.com]
  ...
documentation:
    [documentation]
columns:
  - name: <feature_1_name>
    description: [feature_1_description]
    type: <feature_1_type>
    entity: [entity_name]
  ...
exclude_columns:
  - [list_of_column_names]
train:
  schedule: [schedule]
  metric: [metric]
  included_model_types: [list_of_model_types]
  excluded_model_types: [list_of_model_types]
  algorithm_configs: [map_of_algorithm_name_to_configs]
    <algorithm_name>:
      hyperparameters: [list_of_experiment_configs]
        - experiment_suffix : <experiment_1_suffix>
          param1: <value_of_param1>
        ...
      ...
    ...
  size: [size]
  plots: [list_of_plots]
  exclude_ensemble: [True/False]
  optimization: [optimization]
  disable_data_checks: [True/False]
  disable_data_profiling: [True/False]
  disable_automl_feature_generation: [True/False]
  disable_metrics: [True/False]
  disable_plots: [True/False]
  disable_feature_importabce: [True/False]
  disable_feature_timestamp_generation: [True/False]
  time_limit: [number_of_seconds]
  size_limit: [number_of_bytes]
predict:
  schedule: [schedule]
  incremental: [True/False]
promote:
  policy: [policy]
  ...
query: [sql]
table: [table_name]

type [Required]

Must be Model.

name [Required]

The name of this model. Allowed characters in the name are all alphanumeric characters and '_'.

See Models & Model Versions for more information on naming models.

index [Required]

The name of the column which identifies a member of the entity (i.e. the ID). If not using time_index this should uniquely identify records in your model spine. If using a time_index this will uniquely identify records in your feature set along with the time_index.

time_index [Optional]

If using a temporal model, this is the name of the column that corresponds to the timestamp which, along with the index, uniquely identifies records in your model.

target [Required]

The name of the column that contains the value you wish to predict.

split [Optional]

The name of the column that contains the user-defined split. Values in here must be TRAIN, TEST, and VALI. Any other values will be ignored and not used for training.

description [Optional]

Short description of the feature set.

documentation [Optional]

Free-text documentation of the feature set. Used to help others understand how to use the feature set, the features contained therein, etc.

url [Optional]

URL related to the featureset.

owners [Optional]

Email addresses of owners. These should match their Continual accounts.

columns [Optional]

Columns are optional. If not provided Continual will infer types from the underlying data warehouse. If you wish to link to external entities, override types, or provide descriptions for columns, you may include them in your YAML file.

name [Required]

The column name.

type [Optional]

The column data type. Supported options are:

  • Number
  • Text
  • Categorical
  • Boolean
  • Timestamp

entity [Optional]

The name of the entity that this column links to. I.E. values in this column are mapped to the index column in the target entity.

description [Optional]

Short description of the column.

exclude_columns [Optional]

This is a list of columns in the resulting query or table to explicitly exclude from the model. This is mainly used if you wish to not explicitly list columns and instead let Continual infer them, but you also want to exclude a few columns from being used.

train [Optional]

The training section allows you to configure Continual's AutoML Engine. This is optional configuration and Continual will use smart defaults if not provided.

schedule [Optional]

Cron-based schedule which determines how often to train the model. See, for example, Cron Manual Page.

metric [Optional]

The performance metric that the AutoML will optimize on. Current opetions are:

included_model_types [Optional]

The list of model types to include when training a model. Available types are:

  • Supported Models - XGBoost - GBM - Catboost - FastAI - ExtremeTree - Linear - RandomForest - NeuralNet - KNN

  • Experimental Models: - FASTTEXT - AG_TEXT_NN - TRANSF - LightGBMLarge

    If nothing is entered, Continual will by default run the supported model types. Users may optionally enter models to include (whitelist) or exclude (blacklist). Included models take precedent. If it is included, only those models will be run. If it is not included, Continual will run any supported models not on the excluded list.
    

excluded_model_types [Optional]

A list of model types to exclude from training. The list is the same from the included model type

algorithm_configs [Optional]

A map of specific algorithm names (e.g. Linear, RandomForest, XGBoost) to configurations that modify their behavior. Currently supported configuration sections are:

  • hyperparameters

Note that if the included_model_types section is used, only the algorithm configurations for the algorithms listed in that section will be used. The configurations for any key not present in the included_model_types section as well for those present in the excluded_model_types section will be ignored.

Hyperparameters

Allow the user to specify a list of experiment configurations. An experiment corresponds to running a certain algorithm with one set of hyperparameters. The example below shows how to set hyperparameters for 2 different experiments with the RandomForest algorithm.

train:
  algorithm_config:
    RandomForest:
      hyperparameters:
        - criterion: entropy
          max_depth: 20
        - criterion: gini
          max_depth: 40
          n_estimators: 10

This train configuration will result in a total of 3 experiments being run, the 2 shown above as well as on with the default values for those hyperparameters, which are fixed by Continual for built-in algorithms and fixed by the user for model extensions.

By default, the names of the above experiments will show up as RandomForest_1 and RandomForest_2 in the Web UI. Users can modify the displayed experiment name by providing values for the experiment_suffix parameter for each experiment. For example, if you would like to include information about the actual parameter values in the experiment name itself, you may use the following experiment configurations.

train:
  algorithm_config:
    RandomForest:
      hyperparameters:
        - experiment_suffix: ENTROPY_depth20
          criterion: entropy
          max_depth: 20
        - experiment_suffix: GINI_depth40_trees10
          criterion: gini
          max_depth: 40
          n_estimators: 10

Now the experiment names will be displayed as RandomForest_ENTROPY_depth20 and RandomForest_GINI_depth40_trees10 respectively for the above 2 experiments.

size [Optional]

The size of the trainer pod.

Available options are:

  • small
  • medium (default)
  • large
  • xlarge
  • xxlarge

If you receive out of memory errors during a training, you'll want to increase the size of the memory pod and try running your training again.

plots [Optional]

A list of option plots to generate when training a model version. Some plots may be resource or time-intensive to generate, so they are optionally included.

Available options are:

exclude_ensemble [Optional]

Whether or not to generate ensemble models during model training. Default: True.

optimization: [Optional]

Types of optimization to run during model training.

Available Options are:

disable_data_checks: [Optional]

If set to true, will not run basic checks on dataset validity such as checking for duplicate indices or null valued indices. See Data Checks for full details on what checks are performed on data. Default value is False.

disable_data_profiling: [Optional]

If set to true, data profiling is disabled during training. Data analysis tools such as the correlation matrix, category scores, time index data coverage will not be available. Default value is False.

disable_automl_feature_generation: [Optional]

If set to true, automatic feature generation using AutoML will be disabled. Default value is False.

disable_metrics: [Optional]

If set to true, performance metrics will not be collected or displayed for training experiments. Default value is False.

disable_plots: [Optional]

If set to true, plots such as those showing residuals, Cook's distance, and prediction error will not be generated or shown as one of model insights during training. Default value is False.

disable_feature_importance: [Optional]

If set to true, feature importance will not be calculated and shown as one of the model insights. Default value is False.

disable_feature_timestamp_generation: [Optional]

If set to true, timestamp feature generation during will be disabled during training and batchprediction. Default value is False.

time_limit: [Optional]

The wallclock time in seconds allowed for model training. Training will be stopped after this regardless of model performance. Default value is 6 hours.

size_limit: [Optional]

The size limit in bytes for the dataset used for training, testing or validation. A warning is thrown if the memory usage of the data exceeds this value. See the Data Checks page for more details. Default value is 10 GB.

log_level: [Optional]

The log level for logging within the training steps. Available options are one of the following strings:

  • DEBUG
  • INFO
  • WARNING
  • ERROR
  • CRITICAL
  • FATAL

predict [Optional]

schedule: [Optional]

Cron-based schedule which determines how often to run a prediction job. See, for example, Cron Manual Page.

incremental: [Optional]

Whether to run a full (False) or incremental (True) prediction. Default: False.

log_level: [Optional]

The log level for logging within the batchprediction steps. Available options are one of the following strings:

  • DEBUG
  • INFO
  • WARNING
  • ERROR
  • CRITICAL
  • FATAL

promote [Optional]

policy [Optional]

Instructs Continual how to handle new model versions that are trained.

Available options are:

  • latest (default)
  • best
  • manual

latest will also promote a new model version, whereas best will only promote a new model version if the specified performance metric of the new model is better than the currently deployed version. If manual is specified, no automatic promotions will occur and users will have to manually promote model versions after review.

query [Optional]

A valid SQL query that will generate the model spine. This should contain, at minimum, a column for index and target.

NOTE

Either query or table is required in your YAML file.

table [Optional]

A valid table name in your data warehouse that will be used as the model spine. This should contain, at minimum, a column for index and target.

NOTE

Either query or table is required in your YAML file.

Quoting table names

When using quoted table names, use a YAML literal block, or wrap your quoted table name in single or double quotes.

Example:

Invalid: table: "database"."schema"."table"

Valid : table: '"database"."schema"."table"'

Valid :

table: |
  "database"."schema"."table" # note how this does not need to be escaped since it's inside the YAML literal block.

Back to top