1
头图

Image credit: https://unsplash.com/photos/ZiQkhI7417A

Author: Kamui

Overview

In the life cycle of the whole process of machine learning, Feature Store is the bridge between Data and Model. He reduces the duplication of feature engineering by storing and managing datasets and data pipelines in the ML process to achieve efficient feature data development and shorten model iteration cycles.

From ML-Ops to Feature-Ops

A standard machine learning system consists of three parts : "data", "model", and "code" , which correspond to the three stages of "feature engineering" , "model training" , and "model deployment" . They are interrelated and depend on each other and have important responsibilities and functions in their respective phases to accomplish the mission of the entire machine learning process.

data model code

With the rapid development of AI applications and large-scale applications in the fields of face recognition, advertising, search, and personalized recommendation, people have begun to pay attention to the basic construction of AI system capabilities. Major cloud platform manufacturers have successively launched some general AI platforms to accelerate the "model training" and "model deployment" processes, such as AWS SageMaker, Google Vertex AI, Alibaba PAI, etc. We can collectively call this process and system for ML-Ops 1 .

the extension of the DevOps methodology to include Machine Learning and Data Science assets as first class citizens within the DevOps ecology.

With the popularization and application of AI platforms, the efficiency of "model training" and "model deployment" has been greatly improved, while "feature engineering", as the initial step of the entire machine learning process, is still in the stage of applying traditional data development processes . In order to meet the various customized requirements of machine learning for data development, the AI ​​field has gradually begun to explore data development solutions for machine learning scenarios, so Feature-Ops, which undertakes ML-Ops, was born, and the industry has also launched a series of oriented A system of feature engineering, called Feature Store, for example: Feast, Tecton, AWS SageMaker Feature Store, Databricks Feature Store.

Feature Store Definition

The concept of Feature Store was first proposed and clarified, from Uber's Michelangelo Platform 2 in 2017. He described that the main purpose of the Feature Store is to facilitate the registration, discovery and reuse of features in the machine learning process, and to ensure the consistency of feature data between offline batch processing and online application reading. It can provide high-performance, low-latency data services (for online prediction scenarios) and high-throughput, large-capacity data services (for offline training and batch prediction scenarios) for model use.

A simple and standard Feature Store looks like this:

Feature Store

Music FeatureBox

Problem solved by FeatureBox

In Cloud Music, we created the Feature Store - Music FeatureBox, which is self-developed by Cloud Music, by identifying the unique business problems of cloud music algorithm scenarios. Committed to addressing the following issues:

  • Feature discovery/governance/reuse : Without centralized management, different algorithm teams usually cannot reuse feature data. Feature engineering takes up a lot of time for algorithm engineers, and also wastes computing and storage resources. We help feature discovery/governance by implementing registration and centralized management of feature metadata to facilitate feature reuse and accelerate feature engineering efficiency in the machine learning process.
  • High-performance feature storage and services : Feature data storage engines have completely different application requirements in different scenarios (training/batch estimation requires good scalability and large storage space; real-time estimation requires low latency and high response). Self-developed storage engines with different kernels (MDB/RDB/FDB/TDB), and encapsulates the logical storage layer to route different physical storage engines, and uses different physical storage engines in different scenarios to meet individual application requirements.
  • Consistency of feature data used for model training/estimation : The feature data used for training and estimation are often heterogeneous or inconsistent due to different data implementations, which will lead to biased model estimates. We abstract a single data access layer in the Datahub system to isolate and decouple the model and physical storage. Ensure the consistency of feature data used for training/estimation through unified data access API and automated data synchronization tasks.
  • Feature extraction & operator reuse : Because the computing environment and data context are different, usually the offline training and online prediction of the model will each implement a set of feature extraction logic, which will not only bring additional development workload , it will also cause problems such as inconsistent calculation accuracy, quality risks and increased maintenance costs caused by factors such as cross-language and cross-environment. We designed a set of cross-language, cross-platform operator library & feature extraction calculation engine, so that a set of operator code base + unified DSL syntax configuration can take effect in various online/offline computing environments.
  • Training sample production/management : From feature data to the final sample data set fed to the model training, it often goes through processes such as feature screening, feature extraction, sample sampling, and sample splicing. FeatureBox regulates the input and output of this process through standard APIs , and supports custom data pipelines and hosts the data pipeline tasks of the entire process to achieve seamless connection between feature data and model training.
  • Feature quality monitoring and analysis : A large part of the errors generated by machine learning systems come from data problems. FeatureBox can help algorithm engineers discover and monitor the quality problems of these data by counting some indicators in storage and services. These include, but are not limited to, feature quality, feature importance, performance of services, etc.

In summary, FeatureBox is a data system customized for machine learning scenarios to solve the problems described in Feature-Ops, mainly including the following three aspects:

  • Store data and manage metadata.
  • Create and manage feature extraction pipelines.
  • Provide consistent data services for model training/estimation.

FeatureBox overall architecture

FeatureBox is not a single service or code base, but a complete data system for machine learning processes.

FeatureBox is built based on Cloud Music's self-developed data service management system - "Datahub" . The overall architecture diagram is as follows:

FeatureBox System

What are the functions of the modules here? What is the relationship between them? Below we will introduce some of the core modules in detail:

Datahub

"Datahub" is the core module in FeatureBox, which can be said to be the cornerstone of the entire FeatureBox. It constructs a set of abstract feature metadata and encapsulates APIs of various physical storages, abstracting all reading and writing of physical data into operations on features. We can get the Schema and Storage metadata of the feature through Datahub, and you can use the Datahub API to access the feature data you need in any language and environment. Through Datahub, FeatureBox enables algorithm engineers to maintain a consistent experience when operating feature data in offline/real-time/online environments.

At the same time, as a proxy for accessing Storage, Datahub also includes aspects such as serialization, compression, and monitoring of buried points to help users shield some technical optimization items and achieve higher read and write efficiency. In addition, Datahub can also be used as an interception processing pipeline for data and physical storage interactions, adding various custom processing procedures (syntax filtering, security processing, cache optimization, etc.).

datahub

Schema & Serialization

​ In order for all stored data to have metadata, the first step is to design a standard table schema that can express the format of all current business data. For the schema implementation, the most important thing is the selection of the serialization scheme of the value. We need to consider the following goals:

  • The schema should be easy to understand and can easily expand the fields
  • Support cross-language serialization
  • Has efficient codec performance and high compression ratio

​ According to the above points, we can easily think of two alternatives, one is json and the other is protobuff. These two options have their own advantages and disadvantages, let’s analyze them.

json :

​ Advantages - easy to understand, very extensible, and compatible with various languages.

​ Disadvantage - String plaintext storage, compression ratio and codec performance are not high.

protobuff :

​ Advantages - As Google's old serialization method, it has very good codec performance and compression ratio, and also has good cross-language support capabilities.

​ Disadvantage - .proto needs to be generated to maintain the schema, which is not conducive to dynamic field expansion. (Adding fields to a table may involve schema changes in online applications, flink applications, etl applications, spark training scripts, etc.).

​So is there a way to have both the efficient performance of pb and the scalability of json? The answer is yes!

​ We investigate through the PB library com.google.protobuf.DynamicMessage and com.google.protobuf.Descriptors.Descriptor classes to achieve protobuff-based metadata management and conversion, and through the open source library protostuff to achieve dynamic compilation of .proto files, thus The convenience of making the protobuff format like json can be directly operated by Map<String,Object>, and there is no need to update and publish the .proto file at the same time.

After determining the serialization method of the value, it is much easier to build the table schema. Since Datahub only provides KV/KKV data interfaces for feature services, the table schema we defined only needs to add the most pk and sk columns, and the remaining columns are the pb schema of value. In this way, we can not only ensure the storage engine's requirements for efficient reading and writing, but also ensure the business system's requirements for simplicity and ease of use.

​ Example: music_alg:fm_dsin_user_static_ftr_dpb

schema

Automatically generate protobuff

 syntax = "proto3";
package alg.datahub.dto.proto;
message UserStaticFeature {
  repeated float userTag = 1;
  repeated float userLan = 2;
  repeated float userRedTag = 3;
  repeated float userRedLan = 4;
  SparseVector userMultiStyleSparseVector = 5;
  repeated float userRedSongTimespan = 6;
  repeated int32 userBaseFeatureStr = 7;
  float userAgeType = 8;
  float userRank = 9;
  repeated float userSong2VectorEmbedding = 10;
  repeated float userChineseTag = 11;
  repeated float userTagPlayEndRate = 12;
  repeated float userLanPlayEndRate = 13;
  repeated float userPubTimePlayEndRateAll = 14;
  SparseVector artistPlayEndRatioSparseVector = 15;
  repeated float dsUserTag = 16;
  repeated float dsUserLan = 17;
  repeated float dsUserRedTag = 18;
  repeated float dsUserRedLan = 19;
  repeated float fatiRatio = 20;
}
message SparseVector {
  int32 size = 1;
  repeated int32 indices = 2;
  repeated double values = 3;
}

Transform

"Transform" is another core module of FeatureBox besides Datahub. It mainly manages the entire process from feature reading to model input, and is the link between feature engineering and model engineering in machine learning systems. Transform is arranged and configured by feature metadata and operator metadata registered in FeatureBox. It can express the execution process of feature extraction across languages ​​and engines.

Different from the definition of Transform in the industry, the Transform here is just a configuration description of a custom DSL. It represents the entire calculation process of feature extraction, and does not include specific tasks and task pipelines (relevant parts are in the Job Generator and Web Console. task management function).

Transform can be divided into three situations according to the actual application scenarios:

Scenes describe Feature acquisition Operator language (compatible) output type
Offline training Batch Transform for offline environment model training Get a DataSet from Hive/Hdfs java/scala/c++ TFRecord file
Online prediction Transform of the specified set of features for online environment model prediction Query feature collection by Key from Redis/Tair java/scala/c++ Vector<Tensor> object
Real-time features (planning) Streaming Data Transform for Real-Time Feature Production Get Streaming data from Kafka/Nydus java/scala/c++ Dynamic ProtoBuf objects

We can express the feature calculation execution process of different environments and computing engines through the same Transform syntax (MFDL) to produce the final required feature values:

Transform

For how the MFDL in our Transform module is implemented and applied, you can read the previous article on the construction and practice of cloud music prediction system , which describes the use of MFDL in the online prediction system in detail.

Monitor

When machine learning systems go wrong, most of the reason comes from data problems. Because FeatureBox contains all features such as feature storage, feature metadata, feature service information, etc., it can become a very good feature monitoring center service to help the entire machine learning process locate and discover various feature data problems. Under normal circumstances, we mainly count and monitor the following three types of indicators:

  • Feature basic indicators: "Feature basic indicators" refer to some metrics statistics based on the feature data of the storage engine, such as feature coverage, storage capacity, freshness, distribution, etc. These basic indicators can be used to help us quickly understand the basic information of a feature, so that specific algorithm engineers/data development engineers can use or operate the feature data.
  • Characteristic service indicators: "characteristic service indicators" refer to the real-time operation information of online systems such as DataService/Storage, such as storage indicators (availability/capacity/utilization, etc.), service indicators (QPS/RT/error rate, etc.) and other related indicators. These indicators can help you observe and analyze in real time whether the entire online system of FeatureBox is stable and available, to ensure that the upstream business and the services provided by the APP are stable and available.
  • Feature/Model Offset Metrics: "Feature/Model Offset Metrics" refers to the quality of feature data expressed through metrics such as feature importance, model training/prediction data bias, and so on. Because with the passage of time or some unexpected external events, there may be a large deviation between the training data of the model deployed online and the actual prediction data, which will cause the model effect to decline, so we need to count the "features" /model offset metrics" to help maintain the performance of machine learning models in production.

Regarding feature basic indicators and offset detection, the Monitor module of FeatureBox mainly integrates the Data Validation component in TFX to analyze and monitor the dataset. We mainly provide the following three analysis and monitoring functions:

  • Visual analysis of statistics on static datasets.
  • Statistical analysis of datasets is verified against a priori expected schema.
  • Two-sample comparison was used to detect data bias and drift.

The following figure details the position and role of the Monitor module in the entire machine learning process.

Data Validation In ML

Example: Provide a visual view of the basic statistical information and distribution of the data set to facilitate algorithm students to troubleshoot data anomalies. (The native TFDV executes the script through the jupyter notebook to generate visual information, and we can also collect the stats data of each statistic to display it in the FeatureBox interface)

The statistics view divides features into two categories: continuous values ​​and discrete values, both of which will have distribution statistics (continuous values ​​use standard histogram distribution), and continuous values ​​will have statistics such as median, variance, and standard deviation.

TFDV_1

Storage

" Storage " is the physical storage layer in FeatureBox, responsible for storing real feature data and providing data read and write services to the upstream data service layer. According to different characteristic application scenarios, the Storage module can be divided into offline storage and online storage.

Offline storage : Offline storage is usually used in training or batch prediction scenarios, storing TB-level feature data in recent months/recent years, and providing hourly/day-level batch read and write capabilities. Common offline storages include HIVE/HDFS, etc.

Online storage : Online storage is generally used in real-time prediction scenarios, only the latest value of feature data is stored, and it has high response and low latency requirements. Common online storages include Redis/Tair/MySQL, etc. In Cloud Music, in order to meet different types of feature storage requirements and response requirements of different scenarios, we also customized the storage engine core based on the Tair architecture, they are:

  • MDB: An in-memory storage engine based on an in-memory Hash table. It has the characteristics of high response, low latency, and high storage resource cost. It is usually used for online prediction scenarios that store small-capacity feature data with high response requirements.
  • RDB: A disk-based storage engine based on RocksDB. The response and latency are slightly inferior to MDB, but the cost of storage resources is lower. It can support bulk update of data in batches, and is usually used for online prediction scenarios that store large-capacity feature data. For details, you can read the previous article: The practice of self-developed disk-based feature storage engine RDB in cloud music .
  • FDB: RocksDB storage engine based on FIFO Compaction strategy. Because of FIFO Compaction, it is very suitable for storing log data without write amplification. It is usually used to store Snapshot feature snapshot data.
  • TDB: A self-developed time series storage engine that can aggregate computing data according to different time granularities, but has lower response and latency than MDB/RDB, and is usually used to store statistical feature data with time field aggregation.

FeatureBox uses Datahub/DataService as a routing agent to route the read and write of feature data by upper-layer services and convert it to the actual corresponding Storage connection for operation. Therefore, users are actually unaware of the underlying Storage API and operation and maintenance. They just use the Web Console to define the Schema and select the Storage that is more suitable for their feature data. This also makes FeatureBox more convenient for feature storage management, operation and maintenance, data migration, fast failure, and capacity expansion.

FeatureBox Storage

Epilogue

The above is the whole content of this article. We briefly introduced the definitions of FeatureOps and FeatureStore and the problems they solved, and then described the main design and module functions of Cloud Music's self-built Feature Store - FeatureBox. Friends who are interested in feature engineering bring inspiration and help. Due to the space problem, there are still many details in the entire Featur Store that have not been expanded. You can pay attention to the follow-up articles.


云音乐技术团队
3.6k 声望3.5k 粉丝

网易云音乐技术团队