By: Yashal Kanungo – Application Scientist, Kamran Khan – Senior Technical Product Manager, Shubha Kumbadakone – Senior Specialist in ML Frameworks

Amazon Advertising uses PyTorch, TorchServe, and Amazon Inferentia to drive horizontal scaling while reducing inference costs by 71%.

Amazon Ads helps companies build brands and connect with shoppers through ads displayed on and off Amazon stores, including websites, apps, and streaming TV content in more than 15 countries. Businesses and brands of all sizes, including registered sellers, vendors, book vendors, Kindle Direct Publishing (KDP) authors, app developers, and agencies can upload their own ad creative, which can include images, video, audio , and, of course, items sold on Amazon.

image.png

To promote an accurate, safe and enjoyable shopping experience, these ads must adhere to content guidelines. For example, ads flash on or off, products must be displayed in the appropriate context, and images and text should be appropriate for a general audience. To help ensure that ads comply with the required policies and standards, we need to develop scalable mechanisms and tools.

As a solution, we use a machine learning (ML) model to display ads that may need to be modified. As deep neural networks have flourished over the past decade, our data science team has begun to explore more general deep learning (DL) methods that can process text, images, audio, or video with minimal human intervention. To this end, we used PyTorch to build computer vision (CV) and natural language processing (NLP) models that automatically flag potentially non-compliant ads. PyTorch is intuitive, flexible, and user-friendly, and has given us a seamless transition to working with DL models. Deploying these new models on Amazon Inferentia-based Amazon EC2 Inf1 instances instead of GPU-based instances, we saw 30% lower inference latency and 71% lower inference cost for the same workload.

to Deep Learning

Our machine learning system pairs classical models with word embeddings to evaluate ad texts. But our needs are constantly changing, and as the volume of submissions continues to expand, we need a method that is flexible enough to scale with our business. Additionally, our models must deliver ads quickly and within milliseconds to provide the best customer experience.

Over the past decade, deep learning has become very popular in many fields, including natural language, vision, and audio. Because deep neural networks transport datasets through multiple layers -- incrementally extracting higher-level features -- they can make finer inferences than classical ML models. For example, deep learning models can reject ads that make false claims, rather than simply detecting banned language.

Furthermore, deep learning techniques are transferable - a model trained for one task can be adapted to perform related tasks. For example, a pretrained neural network can be optimized to detect objects in images and then fine-tuned to identify specific objects that are not allowed to appear in advertisements.

Deep neural networks can automate two of the most time-consuming steps of classical ML: feature engineering and data labeling. Unlike traditional supervised learning methods that require exploratory data analysis and hand-engineered features, deep neural networks learn relevant features directly from the data. DL models can also analyze unstructured data, such as text and images, without the preprocessing required in ML. Deep neural networks scale efficiently with more data and perform particularly well in applications involving large datasets.

We chose PyTorch to develop our model because it helped us maximize the performance of our system. With PyTorch, we can better serve our customers while leveraging Python's most intuitive concepts. Programming in PyTorch is object-oriented: it combines handler functions with the data they modify. Therefore, our codebase is modular and we can reuse code snippets in different applications. Additionally, PyTorch's Eager pattern allows looping and control structures, so operations in the model are more complex. The Eager pattern makes it easy to prototype and iterate on our models, and we can use a variety of data structures. This flexibility helps us quickly update our models to meet changing business needs.

"We've tried other 'Pythonic' frameworks before this, but PyTorch is the clear winner here for us," says application scientist Yashal Kanungo. "Using PyTorch is easy because the structure feels native to Python programming, and data scientists are very familiar with it".

training pipeline

Today, we build our text model entirely in PyTorch. To save time and money, we often skip the early stages of training by fine-tuning pretrained NLP models for language analysis. If we need a new model to evaluate images or videos, we first browse PyTorch's torchvision library, which provides pre-training options for image and video classification, object detection, instance segmentation, and pose estimation. For specialized tasks, we build custom models from scratch. PyTorch is great for this, as the Eager pattern and user-friendly frontend make it easy to experiment with different architectures.

To learn how to fine-tune neural networks in PyTorch, see this tutorial .

Before starting training, we optimize the model's hyperparameters , variables that define the network architecture (such as the number of hidden layers), and training mechanisms (such as learning rate and batch size). Choosing appropriate hyperparameter values is critical as they will shape the training behavior of the model. In this step, we rely on the Bayesian search function in Amazon Cloud's ML platform SageMaker. Bayesian search treats hyperparameter tuning as a regression problem: it proposes combinations of hyperparameters likely to yield the best results and runs a training job to test these values. After each trial, the regression algorithm determines the next set of hyperparameter values to test, and the performance gradually improves.

We use SageMaker Notebooks to prototype and iterate on our models. Eager mode lets us rapidly prototype models by building a new computational graph for each training batch; the order of operations can be changed between iterations to accommodate different data structures or combined with intermediate results. This allows us to tune the network during training without having to start from scratch. These dynamic graphs are particularly valuable for recursive computations based on variable sequence lengths, such as words, sentences, and paragraphs in advertisements analyzed using NLP.

When we complete the model architecture, we deploy the training job on SageMaker . PyTorch helps us develop large models faster by running a large number of training jobs simultaneously. PyTorch's Distributed Data Parallel (DDP) module replicates a single model across multiple interconnected machines within SageMaker, with all processes running concurrently forward on its own unique portion of the dataset. During backpropagation, the module averages the gradients of all processes, so each local model is updated with the same parameter values.

Model Deployment Pipeline

When we deploy models in production, we want to ensure that inference costs are reduced without compromising prediction accuracy. Several features of PyTorch and Amazon cloud technology services help us meet this challenge.

The flexibility of dynamic graphs enriches training, but in deployment we want to maximize performance and portability. One of the advantages of developing NLP models in PyTorch is that out of the box, they can be traced into static sequences of operations by TorchScript, a subset of Python specialized for ML applications. Torchscript converts PyTorch models into production-friendly Intermediate Representation (IR) graphs that are more efficient and easier to compile. We run an example input through the model, and TorchScript records the actions performed during the forward pass. The generated IR graphs can be run in high-performance environments, including C++ and other Python-free multithreaded contexts, and optimizations such as operator fusion can speed up runtimes.

Neuron SDK and Amazon Inferentia Driven Computing

We deployed the model on Amazon EC2 Inf1 instance by Amazon Inferentia, Amazon's first ML chip designed to accelerate deep learning inference workloads. Inferentia has been shown to reduce inference costs by up to 70% compared to Amazon EC2 GPU-based instances. We use the Amazon Neuron SDK, a set of software tools used with Inferentia, to compile and optimize our models for deployment on EC2 Inf1 instances.

The following code snippet shows how to compile the Hugging Face BERT model using Neuron. Like torch.jit.trace(), neuron.trace() records the model's actions on example inputs during the forward pass to build a static IR graph.

import torch

from transformers import BertModel, BertTokenizer

import torch.neuron

tokenizer = BertTokenizer.from_pretrained("path to saved vocab")

model = BertModel.from_pretrained("path to the saved model", returned_dict=False)

inputs = tokenizer ("sample input", return_tensor="pt")

neuron_model = torch.neuron.trace(model,

                                  example_inputs = (inputs['input_ids'], inputs['attention_mask']),

                                  verbose = 1)

output = neuron_model(*(inputs['input_ids'], inputs['attention_mask']))

automatic conversion and recalibration

Under the hood, Neuron optimizes the performance of models by automatically converting them to smaller data types. By default, most applications represent neural network values in 32-bit single-precision floating-point (FP32) number format. Automatic conversion of models to 16-bit formats—half-precision floating point (FP16) or brain floating point (BF16)—reduces model memory footprint and execution time. In our case, we decided to use FP16 to optimize performance while maintaining high accuracy.

In some cases, automatic conversion to a smaller data type can trigger subtle differences in model predictions. To ensure that the model's accuracy is not compromised, Neuron compares the performance metrics and predictions of the FP16 and FP32 models. When automatic conversion reduces the accuracy of the model, we can tell the neuron compiler to convert only the weights and certain data inputs to FP16, leaving the rest of the intermediate results in FP32. In addition, we often recalibrate our auto-casting model with several iterations over the training data. This process is much less training than the original.

deploy

To analyze multimedia advertisements, we ran a set of DL models. All ads uploaded to Amazon run through specialized models that evaluate every type of content they include: images, video and audio, headlines, text, context, and even syntax, grammar, and possibly inappropriate language. The signals we receive from these models indicate whether the ad meets our criteria.

Deploying and monitoring multiple models is incredibly complex, so we rely on SageMaker's default PyTorch model serving library TorchServe . TorchServe was jointly developed by Facebook's PyTorch team and Amazon Cloud Technologies to simplify the transition from prototyping to production, helping us deploy trained PyTorch models at scale without writing custom code. It provides a set of secure REST APIs for reasoning, management, measurement and interpretation. With features like multi-model serving, model versioning, integration support, and automated batching, TorchServe is perfect for supporting our huge workload. You can read more about deploying Pytorch models on SageMaker and integrating native in this blog post 1623950d168bac.

In some use cases, we leverage PyTorch's object-oriented programming paradigm to wrap multiple DL models into a single parent object (PyTorch nn.Module) and serve them as a whole. In other cases, we use TorchServe to serve separate models on separate SageMaker endpoints running on Amazon Inf1 instances.

custom handler

We especially appreciate that TorchServe allows us to embed model initialization, preprocessing, inference, and postprocessing code into a single Python script handler.py on the server. This script (handler) preprocesses the unlabeled data in the ad, runs that data through our model, and provides the resulting inferences to downstream systems. TorchServe provides several default handlers for loading weights and architectures, and preparing models to run on specific devices. We can bundle all needed additional artifacts (like vocabulary files or tag maps) with the model in one archive file.

We design custom handlers in TorchServe when we need to deploy models that have complex initialization procedures or originate from third-party libraries. This lets us load any model from any library using any desired procedure. The following code snippet shows a simple handler that can serve a Hugging Face BERT model on any SageMaker managed endpoint instance.

import torchimport torch.neuronfrom ts.torch_handler.base_handler import BaseHandlerimport transformersfrom transformers import AutoModelForSequenceClassification,AutoTokenizer

class MyModelHandler(BaseHandler):def initialize(self, context):self.manifest = ctx.manifestproperties = ctx.system_propertiesmodel_dir = properties.get("model_dir")serialized_file = self.manifest["model"]["serializedFile"]model_pt_path = os.path.join(model_dir, serialized_file)

    self.tokenizer = AutoTokenizer.from_pretrained(
            model_dir, do_lower_case=True
        )
    self.model = AutoModelForSequenceClassification.from_pretrained(
                model_dir
            )

def preprocess(self, data):

    input_text = data.get("data")
    if input_text is None:
        input_text = data.get("body")
        inputs = self.tokenizer.encode_plus(input_text, max_length=int(max_length), pad_to_max_length=True, add_special_tokens=True, return_tensors='pt')
    return inputs

def inference(self,inputs):
    predictions = self.model(**inputs)
    return predictions

def postprocess(self, output):
    return output

batch

Hardware accelerators are optimized for parallelism, and batching (feeding multiple inputs to the model in one step) helps saturate all available capacity, often resulting in higher throughput. However, a batch size that is too high will increase latency with a small increase in throughput. Experimenting with different batch sizes helped us determine the sweet spot for the model and hardware accelerators. We conduct experiments to determine the optimal batch size for model size, payload size, and request traffic patterns.

The Neuron compiler now supports variable batch sizes. Previously, tracking models were hardcoded with predefined batch sizes, so we had to pad the data, which wastes computation, reduces throughput, and exacerbates latency. Inferentia is optimized to maximize throughput for small batches, reducing latency by offloading the system.

batch

Hardware accelerators are optimized for parallelism, and batching (feeding multiple inputs to the model in one step) helps saturate all available capacity, often resulting in higher throughput. However, a batch size that is too high will increase latency with a small increase in throughput. Experimenting with different batch sizes helped us determine the sweet spot for the model and hardware accelerators. We conduct experiments to determine the optimal batch size for model size, payload size, and request traffic patterns.

The Neuron compiler now supports variable batch sizes. Previously, tracking models were hardcoded with predefined batch sizes, so we had to pad the data, which wastes computation, reduces throughput, and exacerbates latency. Inferentia is optimized to maximize throughput for small batches, reducing latency by offloading the system.

Parallelism

Model parallelism on multiple cores also improves throughput and latency, which are critical for our heavy workloads. Each Inferentia chip contains four NeuronCores, which can either run separate models simultaneously, or can be pipelined to transfer a single model. In our use case, the data-parallel configuration provides the highest throughput at the lowest cost because it scales concurrently processing requests.

Data parallelism:

image.png

Model Parallelism:

image.png

monitor

Monitoring the accuracy of inference during production is critical. Models that initially make good predictions end up degrading in deployment as they are exposed to a greater variety of data. This phenomenon, called model drift, usually occurs when the input data distribution or prediction target changes.

We use SageMaker Model Monitor to track parity between the training and production data. Model Monitor notifies us when predictions in production begin to deviate from the training and validation results. Thanks to this early warning, we can restore accuracy — by retraining the model if necessary — before our advertisers are affected. To track performance in real time, Model Monitor also sends us metrics about the quality of predictions, such as accuracy, F-scores, and the distribution of the predicted classes.

We use SageMaker Model Monitor to track parity between training and production data. Model Monitor notifies us when predictions in production start to deviate from training and validation results. Thanks to this early warning approach, we can restore accuracy before our advertisers are impacted - and retrain the model if necessary. To track performance in real-time, Model Monitor also sends us metrics about the quality of predictions, such as accuracy, F-score, and distribution of predicted classes.

To determine if our application needs to scale, TorchServe periodically records resource utilization metrics for CPU, memory, and disk; it also records the number of requests received and the number of services. For custom metrics, TorchServe provides a Metrics API .

A beneficial outcome

Developed in PyTorch and deployed on Inferentia, our deep learning models accelerate ad analytics while reducing costs. From our first explorations in DL, programming in PyTorch felt very natural. Its user-friendly features facilitated deployment from our early experiments to multimodal integration. PyTorch allows us to rapidly prototype and build models, which have been critical in the development and expansion of our advertising services. For added benefit, PyTorch works seamlessly with Inferentia and our Amazon Cloud Machine learning stack. We look forward to building more use cases with PyTorch so that we can continue to provide accurate, real-time results to our customers.


亚马逊云开发者
2.9k 声望9.6k 粉丝

亚马逊云开发者社区是面向开发者交流与互动的平台。在这里,你可以分享和获取有关云计算、人工智能、IoT、区块链等相关技术和前沿知识,也可以与同行或爱好者们交流探讨,共同成长。