Skip to content

Feature sets and entities

Feature sets and entities are the main objects users work with in order to collect, define, and organize features in Continual collect. Models can pull in data via entities and features to enrich training data sets. By utilizing feature sets and entities, users are able to register data, and optionally transformations, in Continual that can readily be reused by others. Continual remembers the relationships between features and entities, and will join them appropriately whenever they are needed for training, being careful to avoid common ML errors like data leakage. Below, we'll discuss in more detail how features and entities work.

Feature set

A feature set is one of the main objects in Continual. It describes a collection of features, as well as the underlying data associated with it (via a SQL query) and some metadata.

Features sets must include an index and may optionally include a time index. If a feature set contains a time index, it is said to be temporal. Each row in a temporal feature set is uniquely identified by the combination of index and time index, whereas non-temporal feature sets are uniquely identified by index alone. For example, a feature set called customer_account_info may contain demographic information about customers. We would very likely have a column called customer_id, or likewise, that would uniquely identify each customer. However, a separate feature set, customer_transactions, would contain information on all transactions with customers. In this featureset, we would need the column customer_id (index) in combination with ts (time index) to uniquely identify a transaction.

Feature Set

In the common use case of tabular data, you can view a feature set as a table or view in your data warehouse. Feature sets are used as inputs to machine learning models.

Feature sets can either by created via the Web UI in Continual or by applying a YAML file via the CLI.

Feature set Views

If feature sets are defined via a query, Continual will build a view in your feature store corresponding to the feature set definition. This is located at <feature_store>.<project_id>.featureset_<feature_set_id> (or <project_id>.feature_set<feature_set_id> for BigQuery). If users define the feature set via a table, then no view is created and continual will directly use the table provided.

Entity

An entity comprises a group of one or more related feature sets. Feature sets within an entity are implicitly related by index. The intended use for entities is to collect data around a common business objects, such as "customers", "products", and "sales". For example, I may have a customers entity with feature sets customer_transactions, customer_account_info, and customer_ratings. Each of these contains a column c_id (i.e. customer_id) which is the index for each feature set. The ids in one feature set uniquely identify a customer in each of the three feature sets (I.E. if I am uniquely identified by c_id = 10001 in one feature set in an entity, then I am uniquely identified by c_id = 10001 in all feature sets in that entitiy.).

Feature sets in an entity all share the same index

Entities are important because they separate the task of registering data in the feature store with that of creating models. This allows users to actually collaborate on AI use cases, as data experts can quickly register and connect datasets in Continual. Elsewhere, users can create models on top of these entities without having to worry about the relationships between tables. To build a model, users simply need to provide a SQL query that defines the indices that belong to the training data set, a target which we are trying to predict, and information regarding which columns link to entities in the system. Continual will automatically join the model with the other data in linked entities. This process of making sure your data is correctly joined can be complex and cumbersome, and Continual ensures that it's easy and error-proof. See our documentation on models for more information.

By linking our customer_churn model to the customers entity, Continual will use pull all data in that entity when building training data for the model.

It's important to note that the joins that Continual performs are time-consistent across any temporal feature sets. If your model definition contains a time index, Continual will select the most recent record of each index, relative to the time_index provided in the model definition. This join is performed on a record-by-record basis, not by relying on a global time.

When models contain a time index, Continual will select the data that was most recent as of that index for each id provided.

It's also perfectly acceptable to mix temporal feature sets with non-temporal feature sets in the same entity.

Feature set YAML definition

Users will most often interact with feature sets via the feature set YAML file. Below is a skeleton YAML file:

type: FeatureSet
name: # feature set name, e.g. customer_demographics
entity: # entity name, e.g. customers
index: # index, e.g. customer_date
time_index: # time index, e.g. date
description: #one line description
url: # reference url
owners:
  -  # owner@company.com
documentation:
  # extended documentation
columns:
  - name: #column name, e.g. customer_id
    description: # column description
    type: # column logical type
exclude_columns:
  -  # excluded column name
profile:
  schedule: # profiling schedule
query: # sql query
table: # table name

See the YAML Reference for full information on constructing feature set YAMLs. The following sections are most important when constructing your feature set YAML file:

name

This is the name of the feature set. See Naming & Syntax for naming conventions.

entity

This is the entity that the feature set belongs in. It's important to place feature sets in the right entity so that the system is better able to leverage connected feature sets when training models. The entity is optional. If not provided, continual will use the feature set name as the entity as well.

query

The sql query to run in the data warehouse to generate the feature set. Continual will manifest this into the feature store as a view on top of the source tables.

index & time_index

The column(s) in the feature set that uniquely identify each row. Index is required, time_index is optional.

columns

This section contains all fields in the source dataset that you wish to use as input into your model. List all desired features here. Supported types are:

  • Number
  • Text
  • Categorical
  • Boolean
  • Timestamp

Note

Specifying columns are optional. Continual will automatically include any columns it finds in the resulting query in your feature set. Specifying columns may be desirable if you wish to override the types of the columns inferred by Continual or if you wish to add descriptions to your columns.

excluded_columns

This is a list of columns in the resulting query to explicitly exclude from the feature set. This is mainly used if you wish to not explicitly list columns and instead let Continual infer them, but you also want to exclude a few columns from being used.

Working withfeature Sets

Naming and syntax

Feature Sets can contain only alphanumeric characters (A-z, 0-9) as well as "_".

Entity and feature set names must be unique within a project.

Entity names are optional when registering feature sets, as Continual will use the resource name as the entity name if none is explicitly given.

Creating feature sets

Feature Sets can easily be created via the Web UI or CLI.

If using the CLI, simply execute the following command on your yaml file (or directory containing multiple feature set yaml files):

continual push <my_feature_set>.yaml

You'll then get a respones back which shows the plan of actions that Continual will execute based on the resources in your YAML file(s).

For those who prefer to create feature sets via the Web UI, you can start the feature set wizard by clicking the blue plus (+) next to your avatar in the top right corner of the Web UI.

Creating a feature set requires 4 steps:

  1. Enter the SQL query that will generate data for your feature set. Click "Preview" to run the query and validate the returned data. Advance by clicking "Configure Feature Set" at the bottom of the page.
  2. Enter configuration details for the feature set. Name and index are required, and you may opitonally wish to provide a description, an entity, and a time index. Click "Configure Schema" at the bottom to advance.
  3. The next page will show all the columns in the table generated by the sql query entered in step 1. Continual will automatically infer all types for you. On this screen you may provide descriptions for any of the columns, change types (if needed), or exclude the column from being included in the feature set. When finished, click "Review Changes" on the bottom.
  4. Continual will provide a summary of the actions that it will take when creating this feature set. At the very least, you should see that a create and profile operation will occur for the new feature set. If the feature set is placed into an entity that has any existing models, you will also see that these models will be retrained, as the entity is receiving new data for which to use for training. Click "Submit Changes" on the bottom to accept the plan and have Continual begin executing the steps.

Note

You can at any time click "View YAML" at the top of the screen to export the YAML of the feature set that you are creating.

Whenever a new feature set is created, Continual tracks all actions in the Changes setion of the Web UI.

Editing Feature Sets

Editing feature sets is very straightforward. Users simply need to modify their YAML file accordingly and re-push it into the sytem.

continual push <my_feature_set>.yaml

If there are any substantive changes to your feature set definition, Continual will reprocess it, as well as any affected models. If there are no changes, you'll get a corresponding message in the system.

In the Web UI, you can edit any feature set by navigating to the feature set overview and clicking the "Edit" button. This will open up the same feature set wizard that we used when creating the feature set. Modify the feature set as needed and re-submit the plan. Continual will process it and execute any changes.

Viewing feature sets

If using the CLI, you can quickly view feature sets in a project via:

continual feature-sets list

The CLI will print a table of feature sets in your current project.

You may also view additional information about any single feature set by issuing the following command:

continual feature-sets get <feature_set_id>

Each project contains its own feature store. All users granted access to the project will be able to view feature sets in the project. You can view all feature sets and features in a project by navigating to Feature Sets.

By opening a feature set, users will be able to view all information corresponding to that feature set, including:

  1. Overview:

    • Feature Set Data Model: This is a visualization of downstream models affected by changed in this feature set.
    • Time Index Data Coverage (temporal feature sets only): This shows how the time_index range of the feature sets overlaps with any model spines that use it.
    • Schema Description: This shows all columns in the feature set and their types, as well as profile information for each column.
    • Query: The query generating data for the feature set.
  2. Data Preview: This shows a preview of the table in the data warehouse.

  3. Documentation: User created documentation.
  4. Activity: All events on this feature set.

Deleting feature sets

You can delete feature sets via the CLI by issuing the following command.

continual feature-sets delete <my_feature_set_id>

In the Web Web UI, Users with proper permissions can delete a feature set by opening the fearture set and clicking the Delete button, as shown below:

Limitations

Currently a Project is limited to 1,000 featuresets.

Back to top