Models and model versions¶
Models are the main objects in Continual that users work with to define and train predictive models. Models are defined very similarly to feature sets, but contain additional metadata that informs the Continual engine how to build and maintain the models and predictions. A new model version is created whenever a model is trained in the system.
In Continual, a Model refers to a predictive model. In other tools, "models" may be associated with a data model and not a predictive model. Within the Continual context, model always corresponds to a machine learning model if not otherwise specified.
Models are constructed in Continual very similarly to feature-sets. Users must specify an index and a target, and may optionally include a time_index. These attributes, along with a sql query that generates this data, forms the core of a model definition, and this is sometimes referred to as the model spine. Additionally, models can contain ids to other entities which Continual will then join with the model when constructing the training data set. We typically recommend storing your features in feature sets and connecting your models to them via entity linking, but it's also possible to specify a list of columns in your model that represent additional features to bring into the model. Models that connect to no entities and bring in their own features via this method are referred to as standalone models. Whereas models that are linked to feature sets will be refreshed when feature sets are updated, standalone models are not, so it is upon the impetus of the user to make modifications to the model definition accordingly.
As part of the continual aspect of the system, models refreshing can be automated by providing a schedule to the model definition. This defines how often Continual will retrain the model.
Feature sets within an entity are implicitly related via the common index. In complex use cases, you may wish to link to many entities in a model. This can easily be achieved in Continual by providing indices to external entities in your model definition and the corresponding sql query. In Continual, we refer to this as a linked entity, and users must explicitly declare these relationships during model creation. In this use case, the model query will contain an index for the model spine, any number of indices for external entities, a target, and optionally a timestamp in the case of a temporal model. For those with a data warehousing background, the idea of linked entities allows you to extend your model into that of a star schema, where the model definition is your fact table and the entities are dimension tables.
When building models, Continual will expand any linked entities it finds and
incorporate the corresponding feature sets with the model as well. This
naturally also applies to temporal feature sets, where these joins will be
execute in a time-consistent manner. Returning to our customer churn example
from before, Let's also create a
products entity with
product_price_history, all sharing the index
product_id). We can then update our model definition to include a
column for p_id and we can pull in our external entities like so:
The model is then able to build the training dataset by joining all the feature sets together with the model spine.
Joining feature sets¶
Continual automatically joins feature sets into the model spine when models include a linked entity. There is nothing extra that users need to do in order to join data together. However, it is sometimes helpful to understand how these joins will happen when designing your model. All joins are executed as left joins into the model spine -- that is, the id & target from the spine will always be in your final training dataset, even if there are no matches in any included feature sets. Joins will work slightly different depending on whether or not the model includes a time_index.
In the below terminology, we consider a model with an index, (optional) time_index, and indices to external linked entities (aka external index). The external index to a linked entity in a model corresponds to the index of any feature set in the linked entity itself.
If the model has a time_index:
Joining with a temporal feature set: The join matches the external index of the model spine to the index of the feature set AND the time_index of the model spine to the most recent time_index in the feature set (as of the time_index in the model spine) for each index value. This join is done on a row-by-row basis, using the model spine's time_index as the main timestamp.
Joining with a non-temporal feature set: The join matches the external index of the model spine to the index of the feature set. Since the feature set does not have a time_index there is no time-based join to calculate.
If the model does not have a time_index:
Joining with a temporal feature set: This is similar to above, only since the model spine does not have a time_index, we use the current time as the time_index for each row. This excludes joining any data timestamped in the future, which is likely a data quality error. In particular: the join matches the external index of the model spine to the index of the feature set AND selects the most recent time_index in the feature set (as of the current time) for each index value.
Joining with a non-temporal feature set: Neither model spine nor feature set have a time_index, so a straight left join between indices is performed.
Continual works on an assumption that your index or index + time_index combination is unique. If this is not the case, the joining behavior will result in performing larger joins than the user likely intends, and it will expand the size of your training set. We do perform checks for duplicate indices during EDA of your model training/test/validation data sets to detect these issues. See data checks for more information.
Working with splits¶
Continual's AutoML engine performs cross validation by default when a model is trained. This process splits your training data set into three disjoint subsets: training, validation, and test. The training set is used for the initial model training, the validation set for model optimization, and the test set is a holdout that can be used to see how the model performs on an independent set of data. Continual selects the winning model based on the performance on the validation data set.
By default, Continual will automatically split data in one of two ways:
For models without a time_index, data will be split randomly: 80% into training, 10% into validation, and 10% into test.
For models with a timestamp, data will be split sequentially based on the time_index: the first 80% into training, the next 10% into validation, and the last 10% into test.
Continual also allows users to specify a user-defined split. This must be a string column and contain the values "TRAIN", "TEST", and "VALI", for training, test, and validation datasets respectively. If you have values in your data set that you would like to use for non-training purposes (such as only for predictions), you can utilize the split field and simply not label those rows (or, label them anything other than "TRAIN", "TEST", or "VALI" -- a good convention is to label prediction-only data as "PREDICT"). This is an optional field. If not used, Continual will randomly determine the splits per the above.
While some ML tools allow users to specify different files/tables for train/validation/test sets, we find that real production tables are never broken down this way and data is usually contained in just a single table. As such, we've designed our splitting interface to be optimized for working on a single table and gives users the flexibility of defining the split as needed. Your tables likely won't come with a split column, but these are easy to define in your model definition like below:
select index, time_index, target, case when ... then "TRAIN" when ... then "VALI" when ... then "TEST" else "PREDICT" end as split from my_table
If Models are defined via a
query, Continual will build a view in your feature
store corresponding to the model definition. This is located at
<project_id>.model_<model_id> for BigQuery). If users define the model via a
table, then no view is created and continual will directly use the table
A new model version is created every time a model is trained by Continual. The model version contains much of the information pertaining to the performance and interpretability of the winning experiment in the model version. Users can use the Web UI to better analyze a modern version.
Model YAML definition¶
Users will often interact with Models via the YAML file. Below is a skeleton yaml file.
type: Model name: <model_name> index: <column_name> time_index: [column_name] target: <column_name> split: <column_name> description: [description] url: [url] owners: - [email@example.com] ... documentation: [documentation] columns: - name: <feature_1_name> description: [feature_1_description] type: <feature_1_type> entity: [entity_name] exclude_columns: - [list_of_column_names] train: schedule: [schedule] metric: [metric] included_model_types: [list_of_model_types] excluded_model_types: [list_of_model_types] size: [size] plots: [list_of_plots] exclude_ensemble: [True/False] optimization: [optimization] disable_data_checks: [True/False] time_limit: [number_of_seconds] size_limit: [number_of_bytes] predict: schedule: [schedule] incremental: [True/False] promote: policy: [policy] ... query: [sql] table: [table_name]
See the YAML Reference for full information on constructing models YAMLs. The following sections are most important when constructing your model YAML file:
This is the name of the model. See Naming & Syntax for naming conventions.
The sql query to run in the data warehouse to generate the model spine.
index & time_index¶
The column(s) in the feature set that uniquely identify each row. Index is required, time_index is optional.
The column that contains the value you wish to predict.
This section contains all fields in the source dataset that you wish to use as input into your model. List all desired features here. Supported types are:
The main reason to use this for models is to link a column to an external entity
Specifying columns are optional. Continual will automatically include any columns it finds in the resulting query in your model as features. Specifying columns may be desirable if you wish to include a link to an external entity, override the types of the columns inferred by Continual, or if you wish to add descriptions to your columns.
This section includes options for configuring the Continual AutoML engine. Full details on all the options can be found in the YAML Reference, but the most common use case will be setting the schedule for your model retraining.
This section allows you to set the schedule for your batch prediction job, as well as specify whether to do a full table prediction (default) or set it as an incremental prediction, which will only predict new rows created since the last prediction job.
Working with models¶
Naming and syntax¶
Models can contain only alphanumeric characters (A-z, 0-9) as well as "_".
Model names must be unique within a project.
Creating a model¶
Models can easily be created via the Web UI or CLI.
If using the CLI, simply execute the following command on your yaml file (or directory containing multiple yaml files):
continual push <my_model>.yaml
You will then get a response back that shows the plan of actions that continual will execute based ont the resources in your YAML file(s).
For those who prefer to create models via the Web UI, you can start the feature set wizard by clicking the blue plus (+) next to your avatar in the top right corner of the Web UI.
Creating a model requires five steps:
- Enter the SQL query that will generate data for your model spine. Click "Preview" to run the query and validate the returned data. Advance by clicking "Configure Model" at the bottom of the page.
- Enter configuration details for the model. Name, index, and target are required, and you may optionally wish to provide a description and a time index. Click "Define Schema" at the bottom to advance.
- The next page will show all the columns in the table generated by the sql query entered in step 1. Continual will automatically infer all types for you. On this screen you may provide descriptions for any of the columns, change types (if needed), link a column to an external entity, or exclude the column from being included in the feature set. When finished, click "Review Changes" on the bottom.
- The next page will show all options for scheduling and promotion your model, and scheduling your batch prediction. Select the desired schedules. You may also select advanced configuration under "Training" where you can select advanced training configuration options, as needed. Click "Review Changes" to advance.
- Continual will provide a summary of the actions that it will take when
creating this model. When a model is first created you will see the following
predict. Since the model is new, it will run an initial training, promotion, and prediction to score all the existing data in the table. After that the model and prediction will be refreshed based on the schedules assigned. Click "Submit Changes" on the bottom to accept the plan and have Continual begin executing the steps.
Note, you can at any time click "View YAML" at the top of the screen to export the YAML of the model that you are creating.
Whenever a new model is created, Continual tracks all actions in the Changes section of the Web UI.
Editing Models in the Web UI is very straightforward. Users imply need to modify their YAML file with any desired updates and re-push it into the suystem:
continual push <my_model>.yaml
Continual will detect any changes to the yaml file and update the model accordingly. The system will then kick off a new training of the model.
In the Web UI, you can edit any model by navigating to the model overview and clicking the "Edit" button. This will open up the same model creation wizard that we used when creating the model. Modify the model as needed and re-submit the plan.
When using the CLI, users can quickly view all models in a project via:
continual models list
The CLI will print a table of models in your current project
It's also possible to view additional information about a particular model with the following command:
continual models get <model_id>
All models in a project can be viewed from the "Models" tab in a project. This also gives a quick overview of the training and prediction history of the model, as well as the model health.
From there, users can dig down into specific model details. For each model, Continual displays a variety of information:
- Performance of the currently promoted model version. Metrics can be shown based on train/validation/test data sets.
- Training History: bar graph showing duration & status of previous trainings.
- Prediction History: bar graph showing duration & status of previous predictions.
- Historical performance: shows the performance metric of each trained model version and the promoted model version, over time. Users can choose from available metrics and time frames.
- Data Dependency Graph: displays all entities and feature sets currently connected to this model. You can switch between graph and table view.
- Time Index Data Coverage: displays the time_index range for all feature sets used by this model.
- Schema: displays all columns in the model spine definition and the summary stats for each column.
- Query: The query generating data for the model spine.
- Data Preview: a preview of the model spine table.
- Versions: This displays all model versions. The model versions themselves contain a lot of information and we'll cover that MLOps. At the table level you can see the status of each model version, when it was created and last promoted, its duration, and the performance of the winning experiment. This allows you to quickly view differences between model versions on one page.
- Promotions: This displays all promotions in the model and the length of the promotion. Each promotion is tied to a model version.
- Batch Predictions: This displays all batch predictions and the duration of the job. Each batch prediction is tied to a model version.
- Activity: All events on this model.
Deleting a model¶
You can delete models via the CLI by issuing the following command.
continual models delete <my_model_id>
In the Web UI, Users with proper permissions can delete a model by opening the model and clicking the Delete button, as shown below:
Training a model¶
Models will be trained in the system via the initial push:
continual push my_model.yaml
After the initial push, a model will be retrained when the feature set definition changes, or users can optionally force a retraining via:
continual push --force my_model.yaml
Forcing a retraining can be useful when the data in your data warehouse has been updated and you wish to retrain a model outside of a normal schedule even though the model definition hasn't changed (This is typically a dev workflow)
From any model's overview page in the Web UI, users can kick off a new training session by clicking the "Train" button in the top right corner.
Training a model via scheduling¶
Over the lifetime of a model, it's expected that drift will naturally occur as data changes and the effectiveness of a model declines. To combat this, models need to periodically be retrained on fresh training data. One way to accomplish this in Continual is via scheduling.
You an specify a schedule in your feature set YAML description. This will instruct Continual how often to update the model. The syntax for this follows cron syntax. An example is below
type: Model ... train: schedule: 0 0 * * *
In the above example, the model would be rebuilt every day at midnight.
In the Web UI, users may specify the training scheduling in the
Refresh Polcies section of the Model Creation Wizard. You can select Manual,
Daily, Weekly, Monthly, or select a custom schedule.
Working with model versions¶
Viewing model versions¶
Each time you train a model, a model version is created in Continual. The model version corresponds to the top performing model created in the experiment generated by the AutoML framework. Continual tracks all model versions, and it's a simple process to view and interact with them.
In the CLI you can view model versions in a project via the following command:
continual model-versions list
The CLI will print a list of your model versions for your current project:
You may additionally wish to filter on a specific model, via:
continual model-versions list --model <my_model_id>
You can also view more information about a specific model version via the following:
continual model-versions get <my_model_version_id>
Users can also view a list of model versions for any model by opening the model in the Web UI and navigating to the "Versions" tab. This will display a list of all model versions, as well as graph showing the performance of the model over time.
Cancel a model version training¶
Users can cancel the training of a new model version very simply by issuing the following command in the CLI:
continual model-versions cancel <my_model_version_id>
In the Web UI, model versions can be cancelled from the model version list by simply clicking the cancel button:
Working with experiments¶
Each training of a model comprises one or more experiments that are run by Continual. Depending on your problem type, Continual will select one or more AutoML frameworks and then run multiple experiments through those frameworks. You can think of it as the system testing a number of different algorithms and parameters for those algorithms. The end result is a model version, i.e the top performing model in your experiment run. Users are always able to view details of the experiment run to see exactly what happened during the training.
When using the CLI, you can view all experiments in a project with the following command:
continual experiments list
The CLI will print a list of experiments:
You may also with to filter by model or model version:
continual experiments list --model <my_model_id> --model-version <my_model_version_id>
Users may also get more information about a specific experiment via the following command:
continual experiments get <my_experiment_id>
When using the Web UI, you can open up any model version from the model's "Version" tab to view the experiments.
The top box shows how the winning experiment compares to the currently promoted model version. It gives a quick comparison across metrics and a recommendation on whether or not this model version should be promote or not (i.e. is this model version's performance better than what is currently deployed?).
Below that is a list of all experiments that were run, as well as the metrics for each run, the state of the experiment, and the training configuration. The purple bar is a graphical representation of the performance metric across experiments. In this example, log loss is used as the peformance metric and smaller log less is better. As we go down the list we get a nice visually way to understand how the various experiments compare, using the purple bar: