Feature sets and entities¶
Feature sets and entities are the main objects users work with in order to collect, define, and organize features in Continual collect. Models can pull in data via entities and features to enrich training data sets. By utilizing feature sets and entities, users are able to register data, and optionally transformations, in Continual that can readily be reused by others. Continual remembers the relationships between features and entities, and will join them appropriately whenever they are needed for training, being careful to avoid common ML errors like data leakage. Below, we'll discuss in more detail how features and entities work.
A feature set is one of the main objects in Continual. It describes a collection of features, as well as the underlying data associated with it (via a SQL query) and some metadata.
Features sets must include an index and may optionally include a time
index. If a feature set contains a time index, it is said to be temporal.
Each row in a temporal feature set is uniquely identified by the combination of
index and time index, whereas non-temporal feature sets are uniquely identified
by index alone. For example, a feature set called
contain demographic information about customers. We would very likely have a
customer_id, or likewise, that would uniquely identify each
customer. However, a separate feature set,
customer_transactions, would contain
information on all transactions with customers. In this featureset, we would
need the column
customer_id (index) in combination with
(time index) to uniquely identify a transaction.
In the common use case of tabular data, you can view a feature set as a table or view in your data warehouse. Feature sets are used as inputs to machine learning models.
Feature set Views¶
If feature sets are defined via a
query, Continual will build a view in your
feature store corresponding to the feature set definition. This is located at
<project_id>.feature_set<feature_set_id> for BigQuery). If users define the feature set
table, then no view is created and continual will directly use the table
An entity comprises a group of one or more related feature sets. Feature
sets within an entity are implicitly related by index. The intended use for
entities is to collect data around a common business objects, such as
"customers", "products", and "sales". For example, I may have a
entity with feature sets
customer_ratings. Each of these contains a column
which is the index for each feature set. The ids in one feature set uniquely
identify a customer in each of the three feature sets (I.E. if I am uniquely
c_id = 10001 in one feature set in an entity, then I am uniquely
c_id = 10001 in all feature sets in that entitiy.).
Entities are important because they separate the task of registering data in the feature store with that of creating models. This allows users to actually collaborate on AI use cases, as data experts can quickly register and connect datasets in Continual. Elsewhere, users can create models on top of these entities without having to worry about the relationships between tables. To build a model, users simply need to provide a SQL query that defines the indices that belong to the training data set, a target which we are trying to predict, and information regarding which columns link to entities in the system. Continual will automatically join the model with the other data in linked entities. This process of making sure your data is correctly joined can be complex and cumbersome, and Continual ensures that it's easy and error-proof. See our documentation on models for more information.
It's important to note that the joins that Continual performs are time-consistent across any temporal feature sets. If your model definition contains a time index, Continual will select the most recent record of each index, relative to the time_index provided in the model definition. This join is performed on a record-by-record basis, not by relying on a global time.
It's also perfectly acceptable to mix temporal feature sets with non-temporal feature sets in the same entity.
Feature set YAML definition¶
Users will most often interact with feature sets via the feature set YAML file. Below is a skeleton YAML file:
type: FeatureSet name: # feature set name, e.g. customer_demographics entity: # entity name, e.g. customers index: # index, e.g. customer_date time_index: # time index, e.g. date description: #one line description url: # reference url owners: - # email@example.com documentation: # extended documentation columns: - name: #column name, e.g. customer_id description: # column description type: # column logical type exclude_columns: - # excluded column name profile: schedule: # profiling schedule query: # sql query table: # table name
See the YAML Reference for full information on constructing feature set YAMLs. The following sections are most important when constructing your feature set YAML file:
This is the name of the feature set. See Naming & Syntax for naming conventions.
This is the entity that the feature set belongs in. It's important to place feature sets in the right entity so that the system is better able to leverage connected feature sets when training models. The entity is optional. If not provided, continual will use the feature set name as the entity as well.
The sql query to run in the data warehouse to generate the feature set. Continual will manifest this into the feature store as a view on top of the source tables.
index & time_index¶
The column(s) in the feature set that uniquely identify each row. Index is required, time_index is optional.
This section contains all fields in the source dataset that you wish to use as input into your model. List all desired features here. Supported types are:
Specifying columns are optional. Continual will automatically include any columns it finds in the resulting query in your feature set. Specifying columns may be desirable if you wish to override the types of the columns inferred by Continual or if you wish to add descriptions to your columns.
This is a list of columns in the resulting query to explicitly exclude from the feature set. This is mainly used if you wish to not explicitly list columns and instead let Continual infer them, but you also want to exclude a few columns from being used.
Working withfeature Sets¶
Naming and syntax¶
Feature Sets can contain only alphanumeric characters (A-z, 0-9) as well as "_".
Entity and feature set names must be unique within a project.
Entity names are optional when registering feature sets, as Continual will use the resource name as the entity name if none is explicitly given.
Creating feature sets¶
Feature Sets can easily be created via the Web UI or CLI.
If using the CLI, simply execute the following command on your yaml file (or directory containing multiple feature set yaml files):
continual push <my_feature_set>.yaml
You'll then get a respones back which shows the plan of actions that Continual will execute based on the resources in your YAML file(s).
For those who prefer to create feature sets via the Web UI, you can start the feature set wizard by clicking the blue plus (+) next to your avatar in the top right corner of the Web UI.
Creating a feature set requires 4 steps:
- Enter the SQL query that will generate data for your feature set. Click "Preview" to run the query and validate the returned data. Advance by clicking "Configure Feature Set" at the bottom of the page.
- Enter configuration details for the feature set. Name and index are required, and you may opitonally wish to provide a description, an entity, and a time index. Click "Configure Schema" at the bottom to advance.
- The next page will show all the columns in the table generated by the sql query entered in step 1. Continual will automatically infer all types for you. On this screen you may provide descriptions for any of the columns, change types (if needed), or exclude the column from being included in the feature set. When finished, click "Review Changes" on the bottom.
- Continual will provide a summary of the actions that it will take when
creating this feature set. At the very least, you should see that a
profileoperation will occur for the new feature set. If the feature set is placed into an entity that has any existing models, you will also see that these models will be retrained, as the entity is receiving new data for which to use for training. Click "Submit Changes" on the bottom to accept the plan and have Continual begin executing the steps.
You can at any time click "View YAML" at the top of the screen to export the YAML of the feature set that you are creating.
Whenever a new feature set is created, Continual tracks all actions in the
Changes setion of the Web UI.
Editing Feature Sets¶
Editing feature sets is very straightforward. Users simply need to modify their YAML file accordingly and re-push it into the sytem.
continual push <my_feature_set>.yaml
If there are any substantive changes to your feature set definition, Continual will reprocess it, as well as any affected models. If there are no changes, you'll get a corresponding message in the system.
In the Web UI, you can edit any feature set by navigating to the feature set overview and clicking the "Edit" button. This will open up the same feature set wizard that we used when creating the feature set. Modify the feature set as needed and re-submit the plan. Continual will process it and execute any changes.
Viewing feature sets¶
If using the CLI, you can quickly view feature sets in a project via:
continual feature-sets list
The CLI will print a table of feature sets in your current project.
You may also view additional information about any single feature set by issuing the following command:
continual feature-sets get <feature_set_id>
Each project contains its own feature store. All users granted access to the
project will be able to view feature sets in the project. You can view all
feature sets and features in a project by navigating to
By opening a feature set, users will be able to view all information corresponding to that feature set, including:
- Feature Set Data Graph: This is a visualization of downstream models affected by changed in this feature set.
- Time Index Data Coverage (temporal feature sets only): This shows how the time_index range of the feature sets overlaps with any model spines that use it.
- Schema Description: This shows all columns in the feature set and their types, as well as profile information for each column.
- Query: The query generating data for the feature set.
Data Preview: This shows a preview of the table in the data warehouse.
- Documentation: User created documentation.
- Activity: All events on this feature set.
Deleting feature sets¶
You can delete feature sets via the CLI by issuing the following command.
continual feature-sets delete <my_feature_set_id>
In the Web Web UI, Users with proper permissions can delete a feature set by opening the fearture set and clicking the Delete button, as shown below:
Currently a Project is limited to 1,000 featuresets.