How to solve the problem of unbalanced data in regression tasks?

Abstract: existing methods for dealing with unbalanced data/long-tailed distribution are mostly aimed at classification problems, and the problem of data imbalance in regression problems is indeed rarely studied.

This article is shared from the " How to Solve the Problem of Unbalanced Data in Regression Tasks?" ", original author: PG13.

Most of the existing methods for dealing with unbalanced data/long-tail distribution are aimed at classification problems, and the problem of data imbalance in regression problems is indeed rarely studied. However, in reality, many industrial forecasting scenarios need to solve the regression problem, which involves continuous or even infinite target values. How to solve the problem of data imbalance in the regression problem? An ICML2021 paper that was accepted as a Long oral presentation: Delving into Deep Imbalanced Regression, promoted the traditional imbalanced classification problem paradigm, extended the data imbalance problem from the discrete value domain to the continuous value domain, and proposed two resolution depths Methods of unbalanced regression problems.

The main contributions are in three aspects: 1) Propose a Deep Imbalanced Regression (DIR) task, which is defined as learning from imbalanced data with continuous targets and generalizing to the entire target range; 2) Two new methods to solve DIR are proposed, label distribution smoothing (LDS) and feature distribution smoothing (FDS), to solve the problem of learning imbalanced data with continuous targets; 3) established 5 A new DIR data set, including unbalanced regression tasks on CV, NLP, and healthcare, is dedicated to helping future research on unbalanced data.

Data imbalance problem background

Real-world data usually does not have an ideal uniform distribution for each category, but a long-tailed skewed distribution, where some target values have significantly fewer observations, which poses a greater challenge for deep learning models . Traditional solutions can be divided into two types: data-based and model-based: data-based solutions are nothing more than over-sampling the minority groups and down-sampling the majority groups, such as the SMOTE algorithm; model-based solutions include refactoring the loss function. Re-weighting or using relevant learning techniques, such as transfer learning, meta-learning, two-stage training, etc.

However, the existing data imbalance solutions are mainly aimed at target values with categorical indexes, that is, discrete categorical label data. The target value belongs to different categories, and has strict hard boundaries, and there is no overlap between different categories. Many prediction scenarios in the real world may involve continuous target value label data. For example, according to the visual image of the face to predict the age, the age is a continuous target value, and may be highly imbalanced within the target range. In the industrial field, similar problems also occur. For example, in the cement field, the quality of cement clinker is generally a continuous target value; in the coal blending field, the thermal strength index of coke is also a continuous target value. The target variables that need to be predicted in these applications often have many rare and extreme values. The imbalance problem in the continuous domain exists in both linear models and deep models, and it is even more serious in deep models. This is because the predictions of deep learning models are often over-confident, which can lead to this imbalance. The problem is severely magnified.

Therefore, this article defines the Deep Unbalanced Regression Problem (DIR), that is, learning from unbalanced data with continuous target values, and at the same time, it is necessary to deal with the potential actual data of certain target regions, and the final model can be generalized to the entire Supports all target value ranges.

The challenge of unbalanced regression

The three challenges to solve the DIR problem are as follows:

For continuous target values (labels), the hard boundary between different target values no longer exists, and the unbalanced classification method cannot be directly adopted.
The continuous label essentially shows that the distance between different target values is meaningful. These target values directly tell which data are closer together, and guide us how to understand the degree of data imbalance in this continuous interval.
For DIR, some target values may have no data at all, which provides requirements for extrapolation and interpolation of target values.

Solution 1: Label distribution smoothing (LDS)

First, let's use an example to show the difference between classification and regression problems when the data is unbalanced. The author compared two different data sets: (1) CIFAR-100, a 100-category image classification data set; (2) IMDB-WIKI, an image data set used to estimate age (regression) based on portraits. . Simulate data imbalance through sampling processing to ensure that the two data sets have exactly the same label density distribution, as shown in the following figure:

Then, train a ResNet-50 model on the two data sets, and draw the distribution of their test errors. It can be seen from the figure that on the unbalanced classification data set CIFAR-100, the distribution of test errors is highly negatively correlated with the distribution of label density. This is easy to understand because categories with more samples are easier to learn. However, the IMDB-WIKI test error distribution in the continuous label space is smoother and no longer correlates well with the label density distribution. This shows that for continuous labels, the empirical label density cannot accurately reflect the imbalance seen by the model. This is because the data samples of adjacent labels are related and interdependent.

label distribution smoothing : Based on these findings, the author proposes a kernel density estimation (LDS) method in the field of statistical learning. Given a continuous empirical label density distribution, LDS uses a symmetric kernel function k, using empirical density convolution with the distribution, to obtain a valid tag kernel-smoothed density distribution for visual reflected data samples having adjacent label information overlap , calculated by LDS valid tag error distribution density distribution results of correlation Significantly enhanced. With the effective label density estimated by LDS, the method of solving the problem of category imbalance can be directly applied to solve the DIR problem. For example, the simplest way to make sence is to use the re-weighting method, which is weighted by multiplying the loss function by the reciprocal of the LDS estimated label density of each target value.

Solution 2: Feature Distribution Smoothing (FDS)

If the model predictions are normal and the data is balanced, then the statistical information of the corresponding features of the samples with similar labels should also be close to each other. Here the author also cited an example to verify this intuition. The author also uses the ResNet-50 model trained on IMDB-WIKI. The main focus is on the feature space learned by the model, not the label space. The minimum age difference we are concerned about is 1 year, so we divide the label space into equally spaced intervals, and group elements with the same target interval into the same group. Then, calculate the corresponding characteristic statistics (mean, variance) for the data in each interval. The similarity between the statistics of features can be visualized as the following figure:

The red interval represents the anchor interval. Calculate the cosine similarity between this anchor label and the feature statistics (ie mean and variance) of all other labels. In addition, different color areas (purple, yellow, pink) indicate different data densities. Two conclusions can be drawn from the figure:

The feature statistics of the anchor label and its adjacent interval are highly similar. The anchor label = 30 happens to be in an area with a lot of training data. This shows that when there is enough data, the statistics of the features are similar at nearby points.
In addition, in areas with little data, such as the age range of 0-6 years, the characteristic statistics of the 30-year-old age group are highly similar. This unreasonable similarity is caused by data imbalance. Because there are few data for 0-6 years old, the characteristics of this range will inherit its a priori from the range with the largest amount of data.

feature distribution smoothing: inspired by these, the author proposed feature distribution smoothing (FDS). FDS is the smoothing of the distribution of the feature space, essentially transferring the statistical information of the feature between adjacent intervals. The main function of this process is to calibrate potentially biased estimates of feature distributions, especially for target values with few samples.

Specifically, there is a model where f represents an encoder that maps input data to features of the hidden layer, and g acts as a predictor to output continuous prediction target values. FDS will first estimate the statistical information of each interval feature. Here, the covariance of the feature is used instead of the variance to reflect the relationship between the internal elements of the feature z. Given the feature statistics, again use the symmetric kernel function k to smooth the distribution of the feature mean and covariance, so that you can get a smooth version of the statistics. Using estimation and smoothing statistics, follow the standard whitening and re-coloring process to calibrate the feature representation of each input sample. Then the entire FDS process can be integrated into the deep network by inserting a feature calibration layer after the final feature map. Finally, a momentum update is used in each epoch to obtain a more stable and accurate estimate of the feature statistical information during the training process.

Benchmark DIR data set

IMDB-WIKI-DIR(vision, age): Based on the IMDB-WIKI data set, infer and estimate the corresponding age from images containing human faces.
AgeDB-DIR(vision, age): Based on the AgeDB data set, it also estimates the age based on the input image.
NYUD2-DIR (vision, depth): Based on the NYU2 data set, it is used to construct the DIR task of depth estimation.
STS-B-DIR (NLP, test similarity score): Based on the STS-B data set, the task is to infer the similarity score of the semantic text between two input sentences.
SHHS-DIR (Healthcare, health condition score): Based on the SHHS data set, this task is to infer a person's overall health score.

The specific experiment can be viewed in the paper, here is the original paper and the code address:

[Paper]: https://arxiv.org/abs/2102.09554

[Code]: https://github.com/YyzHarry/imbalanced-regression

Click to follow, and get to know the fresh technology of

How to solve the problem of unbalanced data in regression tasks?

Data imbalance problem background

The challenge of unbalanced regression

Solution 1: Label distribution smoothing (LDS)

Solution 2: Feature Distribution Smoothing (FDS)

Benchmark DIR data set

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

深度学习五大模型：CNN、Transformer、BERT、RNN、GAN详细解析

AlphaFolding填补蛋白质动态结构预测空白！复旦大学等提出4D扩散模型，成果入选AAAI 2025

用PyTorch从零构建 DeepSeek R1：模型架构和分步训练详解

DeepSeek R1 与 V3 大语言模型对比分析及 API Key 获取教程

DeepSeek 多模态大模型Janus-Pro本地部署教程

2025消费趋势及增长策略洞察报告汇总PDF洞察（附原数据表）