Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Unleashing the Power of Feature Stores: How They Can Supercharge Your MLOps
Latest   Machine Learning

Unleashing the Power of Feature Stores: How They Can Supercharge Your MLOps

Last Updated on July 17, 2023 by Editorial Team

Author(s): Natalia Koupanou

Originally published on Towards AI.

Discover the Benefits of Feature Stores for Streamlined and Efficient MLOps

Edited Photo by Joshua Sortino

If you’re interested in Machine Learning Operations (MLOps), you’ve probably heard about feature stores. But what exactly are they, and why are they so important? Having worked with feature stores for several years and having helped my teams successfully adopt them, I had the honor of sharing my experiences and insights with fellow industry professionals at the MLOps Summit in London in November 2022 and I would love to share more in this blog post. In this post, we’ll provide an introduction to feature stores, explain why they’re needed, highlight their benefits and what to consider before buying or building one.

So, what is a feature store? Simply put, it’s a centralized repository of preprocessed data features that are used to train machine learning models. Feature stores allow data scientists and machine learning engineers to easily access and manage these features, rather than having to repeatedly preprocess and re-engineer the data for each model. But why do we need feature stores? In the past, data scientists would manually engineer features for each machine learning model, resulting in redundant work and wasted time. With the rise of big data and the increasing complexity of machine learning models, this process has become even more challenging. Feature stores streamline this process and enable more efficient and scalable machine learning.

Maximising ML Impact with MLOps

MLOps is the integration of Machine Learning (ML) into a product or business process. The key question is, why to invest in and improve MLOps? The answer lies in operationalizing ML to unlock its value. MLOps is becoming an essential aspect of businesses and digital products, offering benefits like personalization, enhanced efficiency, real-time insights, and better customer experience. The more we mature our operationalization of ML, the more value it brings to our business. For example, imagine we want to provide recommendations to website visitors. One-off recommendations based on static data samples could become irrelevant over time. We could update recommendations more frequently using batch processing. Still, if a visitor’s interactions within a session make weekly recommendations irrelevant, we’d need to find relevant content on the fly. Our capabilities increase as our operational ML infrastructure and expertise grow, enabling us to deliver more value as a data science team.

Journey for unlocking ML value with ML maturity using product recommendation as an example — Diagram inspired by Feature Store Tecton.

However, the operationalization of ML can be challenging, as indicated by a recent McKinsey Global Survey that showed only 36% of participants had their ML project deployed beyond the pilot stage. One of the main reasons for this low success rate is the difficulty of managing the ML process, including data leakage and training-serving skew, duplicated efforts in feature engineering, and managing and serving ML fast in production. A Feature Store is a fully managed and unified solution that can help alleviate these challenges by sharing and serving features at scale across the organization.

Supercharging MLOps with Feature Stores

Challenges solved by Feature Stores

Point-in-time correctness

Data leakage is a common mistake found when reviewing data science code. This happens when we fail to account for changes in feature values over time, making it difficult to accurately predict outcomes. This is where a feature store comes in handy.

A feature store allows us to ensure point-in-time correctness by providing the latest stored value of a feature at or before a given timestamp. For instance, let’s say we want to train a model for predicting whether a user will make a purchase on a website within a session. In the illustration below, we show two labels for two separate sessions of the same user and we can see how features, such as the number of items a user has in their basket change, have different values at different points in time. By requesting data from a feature store based on (an) entity identifier(s) (e.g., user ID, session ID), the feature names, and the timestamps (e.g., T1, T2), we can guarantee that no information from the future is used to create a prediction during training. This not only eliminates the risk of data leakage, but also provides a reliable point-in-time lookup for more accurate predictions.

Point-in-time lookup for retrieving feature vector from a features store using training data for predicting purchase conversion of the user as an example — Diagram inspired by Feature Store Vertex AI.

Consistency across development & production environments

When developing and deploying machine learning models, it’s crucial to avoid training-serving skew. This occurs when different source codes are used for generating features during development and production, leading to discrepancies between training and serving data. This can result in incorrect model behavior and make backtesting a lengthy and frustrating process. So how can we reduce the risk of training-serving skew?

By using a feature store, we can ensure consistency across development and production environments. The same code, data sources, and pipelines are used for both training and serving, making backtesting much easier. With a feature store, batch and real-time data can be easily melded, and any changes in data can be quickly incorporated using a streaming flow. Hence, by fetching all features from the same feature store during both development and production, we can avoid surprises during backtesting and reduce the likelihood of experiencing train-serve skewness in our data.

Re-usability of features across various applications

One of the challenges that data science teams often face is duplicated efforts in feature engineering. This occurs when there is no centralized place to manage and fetch features, which results in teams working in silos and reusing features not being straightforward. Additionally, there is an overhead cost associated with using many features. However, using a feature store can help alleviate these challenges by allowing teams to easily share and reuse features across different applications.

With a central repository for managing and organizing features, duplicated efforts can be avoided, making it faster to create, iterate, and deploy. By using feature stores across several projects, we can reduce the cost of ML applications, as well as break down silos within our organization. Furthermore, feature stores enable practices related to data catalog and lineage and the addition of metadata for each feature, making them reusable across different teams and inv. This leads to faster deployment times, consistency in architecture and infrastructure, and a head start in the development of MVP projects or POCs. With a feature store, sharing is caring (and it also saves time and resources)!

Serving models fast in production

As more and more applications require real-time capabilities, it’s becoming increasingly challenging to extract data from multiple sources and serve inferences quickly. Let’s face it, nobody likes waiting 20 seconds for a response! Fortunately, feature stores come with two storage options — online and offline — that are obviously priced differently. The offline storage is great for training and batch predictions, while the online storage is essential for real-time serving. By using a feature store, we can easily retrieve features in milliseconds and scale our computational resources as needed. We can even code the logic to generate features but let the feature store handle the task of serving them in real-time. With a feature store, we can say goodbye to slow response times and hello to happy users!

High level architecture of feature store

Ease and reliable management of features

Managing ML features can be a challenge, especially when it comes to reliability and ease of use. However, using version control for the code used in a feature store can help with this. So we can easily go back to previous versions of data, just like we do with source code in GitHub. Additionally, detecting drift in the data is possible through feature stores, as they can track the distribution of feature values imported. This makes data monitoring much easier. We also set an expiry date for features limiting cost and making data retention management a breeze. Furthermore, we can control costs by setting limits on computational resources such as quotas on the number of online serving nodes or the number of online serving requests per minute. With these features, managing ML features becomes much more manageable and cost-effective.

To buy or To build?

Whether your team needs a feature store depends on various factors, such as the number and complexity of ML applications in your technology roadmap, budget, team size, and expertise. Despite the many benefits of using a feature store, it’s crucial not to rush into it and instead take an agile approach that gradually proves its value in improving MLOps. For instance, a feature store can provide more significant returns to a large organization with multiple ML applications in production and globally distributed teams than to a startup with a small data team trying to deploy its first ML model. If your organization falls into the former category, then it’s worth considering buying or building a feature store.

To buy or to build?

Buy

When it comes to getting a feature store for your organization, you can either build one yourself or buy one from a company that specializes in providing such solutions. There are several options available, including Tecton, which is founded by some of the same people who contributed to Uber’s Michelangelo and I have personally used it in the past and have been generally pleased. H2o, Databricks, Google Cloud Platform’s Vertex, and Amazon Web Services’ SageMaker also offer feature store solutions. Buying a feature store can be a time and cost-efficient option, as it is less complex than building it in-house. You can save on resources that would be needed to build, maintain, and further develop the feature store infrastructure, as the vendor takes care of these tasks and may even provide 24/7 support. And therefore, you can leave the task of building a feature store to the experts and focus on exciting work that’s in line with your company’s vision!

Build

Companies like Airbnb, Uber, and Spotify had the resources and expertise to successfully build their own feature stores. Building a feature store internally has its own perks. For instance, it gives you complete control over the feature store roadmap and ensures that it aligns with your company’s goals and requirements. There is also no vendor lock-in, which gives you the flexibility to tailor the solution to your specific needs. In addition, the availability of open-source feature stores like Hopsworks, Feast, and Feathr makes it easier to get started without incurring vendor costs. On the downside, building an internal feature store can be time-consuming and resource-intensive, and you might need to have specialized expertise in-house. Ultimately, the decision to build or buy a feature store depends on your company’s specific circumstances and priorities.

Putting it all together

In summary, a feature store can be a valuable solution for companies looking to address challenges in operationalizing ML and unlock the full potential of their data. The benefits of a feature store include point-in-time correctness, consistency across environments, feature reusability, fast model serving, and easy feature management. When deciding whether to use a feature store, consider factors such as the size of your organization and the number of ML applications in production. You can either build a feature store in-house for complete control and flexibility or buy one from a specialized vendor for cost-efficiency and convenience. Lastly, with the potential of GPT to revolutionize feature stores, it’s an interesting time for the field, but it’s crucial to weigh all factors before incorporating it into your MLOps ecosystem.

Thanks for reading! If you’d like to stay updated with my latest articles, provide feedback or discuss further ML and AI, you can follow me on Medium or connect with me on LinkedIn.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓