Feathr: LinkedIn’s feature store is now on Azure | available Azure blog and updates – 71Bait

This blog post was co-authored by David Stein, Senior Staff Software Engineer, Jinghui Mo, Staff Software Engineer, and Hangfei Lin, Staff Software Engineer, all from the Feathr team.

Motivation for feature stores

With the advancement of AI and machine learning, companies are beginning to deploy complex machine learning pipelines in various applications such as recommender systems, fraud detection, and more. These complex systems typically require hundreds to thousands of functions to support time-sensitive business applications, and the function pipelines are maintained by different team members in different business groups.

In these machine learning systems, we see many problems that consume a lot of energy from machine learning engineers and data scientists, especially duplicated feature engineering, online-offline skew, and low-latency feature serving.

Illustration 1: Illustration of problems that the feature store solves.

Duplicate feature engineering

  • In an organization, thousands of functions are hidden in different scripts and in different formats; They are not captured, organized, or retained, and therefore cannot be reused and leveraged by teams other than the ones that created them.
  • Because feature engineering is so important to machine learning models and features cannot be shared, data scientists must duplicate their feature engineering efforts across teams.

Online-offline distortion

  • For functionality, offline training and online inference typically require different data delivery pipelines—ensuring consistent functionality across environments is expensive.
  • Teams are discouraged from using real-time data to draw conclusions because of the difficulty in providing the right data.
  • Providing a convenient way to ensure data is correct at a point in time is key to avoiding label leaks.

Provide low latency capabilities

  • For real-time applications, it can be challenging to get feature lookups from the database for real-time inference without impacting response latency and with high throughput.
  • Easy access to functions with very low latency is critical in many machine learning scenarios, and optimizations need to be made to combine different REST API calls to functions.

To solve these problems, a concept called feature store was developed so that:

  • Features are centralized in an organization and can be reused
  • Features can be deployed synchronously between offline and online environments
  • Functions can be provided in real time with low latency

Introducing Feathr, a battle-hardened feature store

Developing a feature store from scratch takes time, and it takes much more time to make it stable, scalable, and user-friendly. Feathr is the feature store used and battle tested in LinkedIn for over 6 years in production, serving the entire LinkedIn machine learning feature platform with thousands of features in production.

At Microsoft, the LinkedIn team and the Azure team worked very closely to open up Feathr, make it extensible, and build a native integration with Azure. It’s available in this GitHub repository and you can read more about Feathr on the LinkedIn Engineering Blog.

Some of the highlights for Featherr are:

  • Scalable with built-in optimizations. For example, based on an internal use case, Feathr can process billions of rows and PB-scale data with built-in optimizations like bloom filters and salted joins.
  • Extensive support for point-in-time joins and aggregations: Feathr has powerful built-in operators built for the feature store, including time-based aggregation, sliding window joins, search functions, all with point-in-time correctness.
  • Highly customizable user-defined functions (UDFs) with native PySpark and Spark SQL support to shorten the learning curve for data scientists.
  • Python APIs access everything with low learning curve; Integrated with model building so data scientists can be productive from day one.
  • Rich type system including support for embeds for advanced machine learning/deep learning scenarios. One of the most common use cases is to create embeds for customer profiles, and these embeds can be reused in all machine learning applications enterprise-wide.
  • Native cloud integration with simplified and scalable architecture presented in the next section.
  • Sharing and reusing features made easy: Feathr has built-in feature registration, making it easy to share features across teams and increase team productivity.

Spring on Azure architecture

The high-level architecture diagram below shows how a user would interact with Feathr on Azure:

Spring on Azure architecture.

Figure 2: Spring on Azure architecture.

  1. A data or machine learning engineer creates features using their favorite tools (like Pandas, Azure Machine Learning, Azure Databricks, and more). These features will be included in offline stores, which can be either:

    • Azure SQL Database (including serverless), Dedicated Azure Synapse SQL pool (formerly SQL DW).
    • object storage, like Azure BLOB storage, Azure Data Lake Store and more. The format can be Parquet, Avro or Delta Lake.

  2. The data or machine learning engineer can store the function definitions in a central registry that is created with Azure area.
  3. The data or machine learning engineer can use the Feathr Python SDK and Spark engines such as Azure synapse or data bricks.
  4. The data or machine learning engineer can materialize functions in an online store, such as Azure Cache for Redis with active-active, enabling a multi-primary, multi-write architecture that ensures ultimate consistency between clusters.
  5. Data scientists or machine learning engineers use offline capabilities with their favorite machine learning libraries, e.g. B. scikit-learn, PyTorch or TensorFlow to train a model in your favorite machine learning platform, e.g Azure Machine Learningand then place the models in their preferred environment with services such as Azure Machine Learning endpoint.
  6. The back-end system makes a request to the deployed model, which makes a request to the Azure Cache for Redis to get the online features Spring Python SDK.

A sample notebook containing all of the above flows is in the Feathr repository for further reference.

Feathr has native integration with Azure and other cloud services. The following table shows these integrations:











spring component

Cloud Integrations

Offline Storage – Object Storage

Azure blob storage
Azure ADLS Gen2
AWS S3


Offline Storage – SQL

Azure SQL Database
Azure Synapse dedicated SQL pools (formerly SQL DW)
Azure SQL in VM
snowflake

Online shop

Azure Cache for Redis

Feature Registration

Azure area

compute engine

Azure Synapse Spark pools
data bricks

Machine Learning Platform

Azure Machine Learning
Jupyter notebook

file format

parquet
ORC
Avro
Delta Lake

Table 1: Spring on Azure integration with Azure services.

Installation and first steps

Feathr has a Python interface for accessing all Feathr components, including feature definition and cloud interactions, and is open source here. The Feathr Python client can be easily installed with pip:

pip install -U feathr

For more details on how to get started, see the Feathr Quick Start Guide. The Feathr team can also be reached in the Feathr community.

Go forward

In this blog, we featured a battle-tested feature store called Feathr that is scalable, enterprise-ready, and includes native Azure integrations. We’re committed to bringing more functionality to Feathr and Feathr on Azure integrations, and you’re welcome to provide feedback by reporting issues in the Feathr GitHub repository.

Leave a Comment