Automated feature creation and processing for machine learning

One of the essential tasks in machine learning is feature creation. It is the process of constructing new features from existing data on which a machine learning model is trained. This step is as important as the actual model used as the algorithm only learns from the data we give it, and creating features that are relevant to the task is crucial to gain insight.


Usually, feature engineering is a tedious manual process- relying on domain knowledge, intuition, and extensive data manipulation. It is time-consuming because each new feature usually requires several steps to build, using information from multiple tables.


We can broadly group the operations of feature creation into two categories:



 A transformation acts on a single table by creating new features out of one or more of the existing columns. Creating a logarithm of annual income table for a monthly salary table would be an example.


Aggregations are performed across tables to infer group observations and calculate statistics. An example would be taking the stands of mortgages and loans and calculating statistics on financial liabilities of a particular client. Performing these operations repeatedly on multi-table datasets is inefficient. Thus, automated feature engineering helps the data scientist by automatically creating many candidates features out of a dataset from which the relevant features can be selected and trained for. One good way to do this is to use the Feature tools Python library. This open-source Python library will automatically create many features from a set of related tables. It is based on a method known as “Deep Feature Synthesis (DFS).” DFS stacks multiple transformation and aggregation operations (called feature primitives) to create features from data spread across many tables.


Let us understand that by understanding some fundamental concepts of Feature tools first.


The two main concepts of Feature tools are Entities and EntitySets. An entity is merely a table. An EntitySet is a collection of tables and the relationships between them. A minimal input to DFS is a set of entities, a list of relationships, and the “target_entity” to calculate features for. The output of DFS is a feature matrix and the corresponding list of feature definitions. We show it’s working by taking a prototype example on publicly available data and replicating this in our environment.




In this dataset, there are 3 tables or entities



unique users with user ID who shares multiple posts. Each has a Zipcode, session count and join date.


posts shared by a user, each post has a post ID.


Likes, views, comments on each post shared. Each post has multiple interactions.


The tables are related (through the user_id and the post_id variables), We can create an empty Entityset in feature tools using the following:



Now we have to add entities. Each entity must have an index, which is a column with all unique elements. The index in the users dataframe is the user_id. The index in the post dataframe is the post_id.  However, for the interactions dataframe, there is no unique index. When we add this entity to the entity set, we need to pass in the parameter make_index = True and specify the name of the index.



Second, we specify how the entities are related. When two entities have a one-to-many relationship, we call the “one” entity, the “parent entity.” For example, in our dataset, the users dataframe is a parent of the posts dataframe which in turn is the parent of the interactions dataframe. To specify a relationship in Feature tools, we only need to specify the variable that links two tables together. The users and the posts table are linked via the user_id variable and posts and interactions are linked with the post_id.



Operations of feature creation are called feature primitives in Feature tools. These primitives can be used by themselves or combined to create features. To make features with specified primitives we use the ft.dfs function (standing for deep feature synthesis). We pass in the entity set, the target_entity, which is the table where we want to add the features, the selected trans_primitives (transformations), and agg_primitives (aggregations). We are interested in the time the user joined and the number of interaction count for example. Feature tools select features automatically by stacking these features.



We now have hundreds of new features to describe a user’s behavior. This was done by combining or stacking multiple primitive features - an example of deep feature synthesis. We can modify the code appropriately for even more second-order features.


Following steps

Automated feature engineering has solved the problem of creating features, but created another problem of having too many features- not all of them relevant to a task we want to train our model on. Having too many features can lead to poor model performance because the less useful features overwhelm those that are more important.

This is known as the curse of dimensionality. As the number of features increases, it becomes exponentially more difficult for a model to learn the mapping between features and targets. This problem is dealt with feature reduction techniques like Principal Component Analysis (PCA), SelectKBest, autoencoders, etc. An often used powerful technique is LASSO (least absolute shrinkage and selection operator) which is a regression analysis method that performs both feature selection and regularization.


A good data scientist can thus incorporate these two steps - automated feature creation along with feature selection (domain knowledge or feature reduction techniques) to achieve the desired results.

Latest posts

Mobile Re-targeting- How to succeed at Re-marketing with Audience Builder

In the earlier days of advertising, the name of the game actually reached and frequency. Brands usually preferred mass media vehicles like radio and television as they were the easiest means to reach large audiences as well as build brand awareness.

Revolutionary Ways Machine Learning Is Changing Marketing

Artificial Intelligence and machine learning have a lot to offer to the marketing arena. In addition to simplifying various mundane tasks, this expertise might finally help marketers attain relevance at scale.

How Artificial Intelligence Is Changing SEO

Artificial Intelligence is a quickly evolving technology that will make SEO techniques and tools even more informative and helpful for businesses in the coming future.

Vivid dimensions of machine learning

"Humans can typically create one or two good models a week; machine learning can create thousands of models a week”

Role of Artificial Intelligence in Banking Sector

Artificial Intelligence is no new term in the banking sector; in fact, several financial organizations have already found success with the support of Artificial intelligence.

Agile methodology and key principles

Agile methodology: The art of incremental software development

Smarter Healthcare How Artificial Intelligence is rewriting the rules

The milestone covered by Artificial Intelligence has brought the world on its toes.