Let the pre-trained language model read numbers: Supersymmetric technology releases 1 billion parameters BigBang Transformer [Qianyuan] Financial large-scale pre-trained language model

Introduction: Supersymmetric technology company released 1 billion parameter financial pre-training language model BigBang Transformer [Qianyuan]. Based on the time series-text cross-modal architecture, the BBT large model integrates training text and time series data. The accuracy of downstream tasks is nearly 10% higher than that of T5 models at the same level, and the R2 score of time series prediction is greatly improved. A cross-modal architecture enables language models to identify changes in time series data and analyze and interpret their findings through human language. The BBT model can be used for factor mining in financial quantitative investment, supporting multi-factor strategies, as well as extensive data visualization and time series data analysis of the Internet of Things. The goal of the BBT model is to achieve a pre-trained large model with human-level analysis capabilities and build a general artificial intelligence architecture that can be implemented in the industry.

Disadvantages of the generic large model

OpenAI's GPT-3, Google's LaMDA, PaLM and other language models and multi-modal models with more than 100 billion parameters can approach or even surpass the level of human intelligence in tasks such as writing, text generation, and dialogue. But the above big models have some common flaws:

① The large model is pre-trained with general corpus and data, and performs well in general scenarios, but has obvious defects in the professional field. Therefore, models such as GPT-3, Enlightenment, and Pangu often use novels, poems, or human-computer dialogue to demonstrate the capabilities of large models. When it comes to serious work scenes, there is only thunder but no rain. So far, there has been no large-scale application of products based on large models in the industry, and the reasons behind it still need to be further explored. What is the capability boundary of a large model that only uses general corpus and is not pre-trained with industry data? If the supersymmetric team proves that models trained with industry data are more accurate, does it mean that the overall design of the existing large model needs to be re-adjusted in order to obtain the generality of the large model in different industries?

② Pre-trained multi-modal models such as Dalle 2 have achieved amazing results in the application of text-generated images, but multi-modal models have made little progress in more practical and complex modalities such as time series data and tabular document data, and these modalities have made little progress. Occupies a large number of scenes of actual work. In addition to processing the three common modalities of language, speech, and images, being able to read and analyze data is also a prominent ability of human intelligence, and humans can process language and data in parallel to obtain conclusions. Whether the large model can also realize the ability of human intelligence to analyze data, so as to effectively realize the wide application in industrial scenarios.

Super Symmetry Technologies focuses on developing algorithms and data products to serve industries such as finance, media, and manufacturing. Supersymmetric company designed and trained a large-scale parameter pre-training language model Big Bang Transformer Qianyuan (BBT) for the application in the field of financial investment. Currently, the Base version has 220 million parameters and the Large version has 1 billion parameters. The same supersymmetric team also released a set of evaluation data sets BBT-FinCUGE for pre-training models in the financial industry, which is open sourced on Github. The BBT model refers to the Encoder+Decoder structure of T5 to integrate the downstream tasks of NLU and NLG. The Supersymmetric team compiled a dataset from the financial industry and built a Transformer-based architecture that jointly trains text and time series data across modalities.

The Big Model is a road to Artificial General Intelligence (AGI). Super Symmetry believes that the ability to analyze data is one of the foundations of AGI. Super Symmetry Technology Co., Ltd. cooperated with Xiao Yanghua Knowledge Factory Laboratory of School of Computer Science, Fudan University, Xu Renjun Laboratory of Zhejiang University, and teachers from Nankai University and Beijing Normal University School of Artificial Intelligence to promote the research and development of AGI underlying algorithms in terms of basic theory, architecture, and algorithm implementation. Build the base for AGI's industrial application. This research was supported by Gansu Gaotai's "East and West Calculation" project in terms of computing power infrastructure.

Taking Google's T5 framework as a reference benchmark, the experiments of the BBT model verify the following conclusions:

The large model pre-trained based on domain specialized datasets can improve the average downstream task accuracy by nearly 10% compared to the T5 parameter model of the same level.
The proportion of corpus datasets for different downstream tasks has an impact on the accuracy of downstream tasks.
Prompt learning of Source Prompt based on downstream task categories can greatly improve the accuracy of downstream tasks.
The BBT time series model performs multivariate time series prediction, which greatly improves the R2 score compared to the ordinary Transformer.
Combined with text and time series data for training, the model can read the real world corresponding to digital changes.

Algorithm Architecture of Pre-trained Model Focused on Fusion Training Sequence-Text Cross-modality

Traditional time series models often only rely on the information of the time series itself to complete various tasks, while ignoring the dependence of time series data on external information. For example, the fluctuation of data such as stock prices and economic indicators at a certain moment is not completely determined by the data before this moment. The language model has a strong ability to represent textual information. Combining the language model with the time series model can not only enable the world information to support the completion of time series tasks in the form of text, but also strengthen the language model's understanding of information through the information contained in the time series data. Comprehension.

To this end, the supersymmetric team designed a time series-text cross-modal pre-training model based on Transformer, which is one of the earliest pre-training algorithm architectures in the industry focusing on joint training of time series-text two modes. The pre-training method is to predict the time series data at time T through the text information and time series information before time T. Time series data and text image data are simultaneously input to the Encoder as the Embedding layer, a bidirectional Transformer, and the Decoder that the output vector enters has three types: NLU, NLG, and Time Series.

The BBT model designs a general module for inputting time vectorization into the Embedding layer. The multivariate time series is affected by signal pulses in both spatial and temporal dimensions. The activated time and space range is a continuous spectrum, which can be roughly divided into low-frequency local pulses, low-frequency global pulses, high-frequency local pulses and high-frequency global pulses. Four aspects of pulse analysis of this effect. Among them, "low frequency"/"high frequency" refers to the activation range that describes the influence from a temporal view, while "global"/"local" describes the activation range from a spatial view.

"Low frequency" means that the pulse changes smoothly and tends to remain stable for a longer period of time;
"High frequency" means that the pulse changes sharply;
"Global" means that this impulse has a similar effect on all time series;
"Local" means that the pulse affects only a single time series, or exerts different effects on different time series.

Based on this, Supersymmetry proposes a general, model-agnostic, learnable vector-time representation component DWT-ST2Vec, which can be applied to a variety of model structures and downstream tasks. This component can decompose the high-frequency and low-frequency components of the sequence from the two dimensions of space and time, so as to learn the sequence information more fully.

The most complete and largest financial investment dataset covering academic and industrial fields

The quality, quantity and diversity of the corpus directly affect the effect of language model pre-training. The existing Chinese financial pre-training language models, such as FinMegatron released by FinBERT and NVIDIA, have very limited quantity and diversity of pre-training corpora.

In order to better promote the development of Chinese financial natural language processing (NLP), Supersymmetric collected and crawled almost all public and other available Chinese financial corpus data, including the financial and political data released by all mainstream media platforms in the past 20 years. Economic news, announcements and financial reports of all listed companies, tens of millions of research reports published in the history of research institutes and consulting institutions, millions of books on social sciences such as finance, economics and politics, and more than 40 government websites and local government websites. Announcements and documents, posts by social media platform users, cleaned and sorted out the large-scale Chinese financial corpus BBTCorpus, covering five categories with a total of more than 300 GB and high-quality diversified corpus data of 80 billion Tokens. The largest financial investment dataset, the specific size distribution is shown in Table 1.

Table 1: The size distribution of BBTCorpus corpus, in which the original files of public company announcements and research reports are in PDF format.

Innovative Pre-training Approaches Dramatically Improve Language Model Accuracy: Similarity Sampling and Source Prompt

In order to verify the effectiveness of domain corpus pre-training, the supersymmetric team used the pre-trained model t5-v1_1-base-chinese-cluecorpussmall on the general corpus CLUECorpus-samll to compare with the model of the supersymmetric team. The experimental results are shown in Table 2. Show.

The supersymmetric team made innovative improvements to the pre-training method of T5 for specific problems.

The first is the corpus source similarity weighted sampling algorithm proposed for the pre-training corpus sampling problem. Since the corpus of the supersymmetric team is so large that only about 10% of the text can be sampled for training in the whole process of model pre-training, the model is bound to randomly sample corpora from different sources. If all the corpus are simply randomly sampled, in fact, the corpus from different sources is mixed according to the size and scale, that is, in the subset of the corpus for which the model is pre-trained, the proportion of announcement: research report: news: stock bar: snowball is about 105:11:30:74:44. The supersymmetric team proposed that, compared with simple random sampling, weighted sampling according to the similarity between the text in the evaluation benchmark and the corpus from different sources is a more reasonable choice. The model trained on a subset of the corpus sampled by the weighted average can achieve an average improvement of 0.7% on the evaluation benchmark. The experimental results are shown in Table 2.

This innovation is not only applicable to the pre-training of language models in the financial field, but its ideas can also be extended to other fields with a variety of heterogeneous corpus sources, such as biomedicine, law and other fields. After that, on this basis, the supersymmetric team further expanded the model scale to the Large level of one billion parameters. The experimental results are shown in Table 2.

Table 2: The score is the average score of the model on the evaluation benchmark. T5-base stands for t5-v1_1-base-chinese-cluecorpussmall. ss represents the first innovative point corpus source similarity weighted sampling algorithm of the supersymmetric team (Similarity weighted Sampling of corpus source). The parameters of the base model are both 220 million, and the parameters of the large model are 1 billion.

The supersymmetric team also pioneered the source prompt method (Source Prompt, SP) for the problem of heterogeneous corpus mixing, that is, during pre-training, a prompt representing its source is placed in front of the corpus.

For the corpus: "According to the National Bureau of Statistics, in May 2022, the national consumer price rose by 2.1% year-on-year." During pre-training, a source prompt is placed in front of it: [News] becomes: "[News] According to the National Bureau of Statistics According to the news, in May 2022, the national consumer price will increase by 2.1% year-on-year.", MLM pre-training will be carried out normally after that. The Source Prompt is 3.21% higher than the Similarity Sampling model in the Base model.

Table 3: The performance of different T5-base and BBT models on 8 downstream tasks.

Universal time vector representation component DWT-ST2Vec can connect different models

The basic capabilities of the BBT model for processing time series data include:

A general, model-independent, learnable vector time representation component DWT-ST2Vec is provided, which can input time as an Embedding into the Encoder and learn jointly with the text.
More accurate multivariate time series forecasting can be achieved.
Time series data can be decomposed according to "global-local", "period-trend" and "low frequency-high frequency".
Through fusion learning with text, large models can generate text for changes in time series data.

Randomly select 40 domestic listed companies, take the time series of the opening stock price as the main evaluation object, take the sequence data with a length of 4000 since the stock opening as the training set, and use the sequence data of 4000-4200 as the test set for training. The sum of MSE, RMSE, MAE, MAPE indicators is the evaluation indicator. Taking Transformer as the baseline, the trained model has an average improvement of 0.5%-2% in MSE, RMSE, MAE, and MAPE on the evaluation benchmark.

BBT's time series-text cross-modal architecture can trigger NLU's ability to generate comments similar to analysts and retail investors by identifying stock price changes.

Enter stock price:

Based on the massive amount of news it has learned, the model can write comments similar to professional journalists, such as:

Can also talk about market trends like a retail investor:

The BBT time series-text cross-modal architecture enables the model to read the company's financial report and news to write a company development trend analysis report, and also allows the model to learn the brand's years of sales data and product characteristics on the e-commerce platform. Predict the future sales of products and then write targeted marketing reports, or let the model learn the monitoring data of manufacturing production machines, and write operation and maintenance failure reports that non-professionals can understand.

BBT-KG: The Affective Graph of Dynamic Tracing

The supersymmetric team constructed knowledge graphs of 200,000 primary market companies and 4,500 A-share listed companies in China for knowledge-enhanced language model learning. The difference between BBT-KG and the financial knowledge graph on the market is that the supersymmetric team uses the ability of the language model to construct the dynamic relationship between news events and enterprises and the causal relationship between events, so that the model has the ability to judge new The impact of events on the company and the market, and traces the source of market fluctuations.

Applying the BBT model to build a new factor BBT model for quantitative investment to help develop multi-factor strategies

The supersymmetric team uses the BBT model to calculate the sentiment index of individual stocks, and then monitors the emotional changes in adjacent periods, and selects prominent changes as long-short factors to construct a quantitative factor strategy, and the ultimate return far exceeds the market. The Super Symmetry team retrospected the outstanding stock selection ability of the sentiment index, and found that the model can effectively learn financial and financial texts, and quantitatively reflect market information, and creatively provide alternative factors. In addition to calculating market sentiment, the multi-dimensional capabilities of the BBT model can also be used in the financial and financial fields.

For example, using BBT's event extraction capability, similar events or news can be extracted and compared with volume and price data to study the speed at which different events are transmitted to the market; BBT can also learn the supply chain through the unique financial knowledge graph of the supersymmetric team. The interrelationship between economic individuals uses machine learning to eliminate the collinearity between factors, bringing subversive innovation to the traditional linear regression multi-factor model for the traditional linear regression multi-factor model.

In addition, BBT's ability to identify negative news can also add real-time public opinion monitoring to the credit risk assessment system, and its news classification ability can help financial analysts and financial analysts quickly process a large amount of information to obtain more comprehensive and objective conclusions.

Benchmark evaluation dataset: the first Chinese financial NLP evaluation dataset

Evaluation benchmarks play an important guiding role in the development of natural language processing (NLP). While the research and application of Chinese financial NLP is booming, the industry lacks an authoritative evaluation benchmark. To solve this problem, the supersymmetry team proposed BBT-FinCUGE, open source address:

GitHub.com/ssymmetry/BBT-FinCUGE-Application

This is a Chinese financial natural language understanding and generation benchmark with the following characteristics:

① Professionalism: Financial experts are involved in the screening and labeling of all datasets.
② Practicality: All tasks are scored by financial experts for practicality as the basis for task selection and final scoring. The evaluation benchmark consists of the following eight datasets:

Forum Sentiment Analysis FinFE

In stock forums such as Stock Bar and Snowball, stockholders will produce a large amount of comment texts every day, which contain perceptual emotional output and rational predictions of ups and downs. For these texts, this dataset requires the model to learn and predict the sentiment index of the text (0, 1, 2, representing negative, neutral, and positive, respectively).

Event extraction FinQA

Event extraction refers to an algorithm that automatically identifies the occurrence of events from text, extracts event parameters and organizes them into structured data, including the detection and parameter extraction of corporate investment and financing, listing, acquisitions and other events. (In order to better compare different models horizontally, the supersymmetric team organized this dataset into the form of reading comprehension question and answer QA).

Causal Event Extraction FinCQA

Different from regular event extraction, causal event extraction focuses on identifying two events with causal relationship and their event parameters in the text and organizing them into structured data. The supersymmetric team's causal events dataset contains the identification of causal events in the commodity sector. The types of events identified include typhoons/earthquakes, supply increases/decreases, demand increases/decreases, price increases/decreases, etc. may be the cause and effect of events and their Correspondence and corresponding products, regions and other parameters (in order to better compare different models horizontally, the supersymmetric team organizes this data set into the form of reading comprehension question and answer QA).

News text summaries FinNA

Chinese financial news summary generation task. The dataset is taken from the large-scale Chinese short news of Sina Finance and contains 20,000 real Chinese short text data and corresponding abstracts.

Relation extraction FinRE

An artificially fine-labeled dataset in the field of finance and economics. Given a sentence and the head and tail entities in it, the model is asked to predict the relationship between the head and tail entities. The dataset is annotated by Sina Financial News Corpus, in which the named entity is a commercial company, and 44 relationship categories (two-way) in the financial field are designed in terms of relationship, including ownership, shareholding, competition, acquisition, transaction, cooperation, and shareholding reduction. and other specific relationship categories in the financial and financial fields.

Negative news identification and subject judgment FinNSP

This dataset contains two tasks:

Negative information determination: determine whether the text contains negative information about financial entities. If the text does not contain negative information, or contains negative information but the negative information does not involve a financial entity, the negative information determination result is 0.

Negative subject determination: If task 1 contains negative information of financial entities, continue to determine which entities in the entity list are the subject objects of the negative information.

News CategoryFinNL

Categorize financial news into one or more categories related to what it describes. The news is sampled from Sina Finance. Currently, there are 14 categories including companies (individual stocks), industries (sectors), broad market, China, international, economy, policy, futures, bonds, real estate, foreign exchange, virtual currency, new crown, and energy.

event body extraction

The main goal of this evaluation task is to extract subjects of specific event types from real news corpora. That is, given a text T and the event type S to which the text belongs, extract the event body of the specified event type S from the text T. That is, input: a piece of text, event type S; output: event body.

Developer services: Open APIs to developers in financial and non-financial industries to build a BBT large model developer ecosystem

The supersymmetric team opens 11 API capabilities to developers in the financial and non-financial industries to build a BBT large model developer ecosystem. The first batch of open API capabilities include: Knowledge Graph, Article Summary, Social Media Sentiment Recognition, News Sentiment Recognition, News Classification Labeling, Named Entity Recognition, Relation Extraction, Event Extraction, Event Causal Extraction, Announcement Extraction, Negative News and Subject Recognition .

API Documentation:

https://www.ssymmetry.com/newproduct/bbtlink

A cornerstone model for finance and economics

The goal of the BBT 1.0 version model is to establish a unified artificial intelligence algorithm framework for financial investment, and build an architecture based on transformers that can integrate different modal data involved in training financial investment. A large-scale parameter pre-training model is trained on the basis of a unified architecture. As the model parameters and training data sets continue to increase, the supersymmetric team is hopeful to develop a model that is close to the level of human intelligence in the financial field.

As a cornerstone model in the financial field, the BBT model provides fine-tuning services for deep learning downstream tasks in all financial investment, economic analysis, business consulting and other scenarios. There are a large number of practitioners and personnel in the field of financial investment. Large factories have the financial resources to hire algorithm engineers, but small teams cannot afford basic text extraction algorithms. As the algorithmic infrastructure in the financial field, the BBT model equips all practitioners with weapons of the same level, allowing the entire industry to stand on the same starting line to compete for better investment strategies, thereby promoting more efficient information and factor flow in the financial and economic markets.

Making the model understand numbers is the ability of a time series-text cross-modal architecture developed by the BBT model, which is one of the core capabilities of general artificial intelligence pursued by humans. The model can identify changing patterns and laws in massive time series data, and accurately correspond to the real world through the pre-trained language model, thus establishing a bridge between the data world and the human language world, which will give a wider range of digital technologies. Bringing revolution, including business data analysis, data visualization, database technology, etc. The BBT model can not only be applied to finance, but also has the potential to be applied in manufacturing, Internet of Things, smart cities, and Internet big data analysis that require time series data processing.

That's all for today's sharing, thank you all.

Let the pre-trained language model read numbers: Supersymmetric technology releases 1 billion parameters BigBang Transformer [Qianyuan] Financial large-scale pre-trained language model

Disadvantages of the generic large model

Algorithm Architecture of Pre-trained Model Focused on Fusion Training Sequence-Text Cross-modality

The most complete and largest financial investment dataset covering academic and industrial fields

Innovative Pre-training Approaches Dramatically Improve Language Model Accuracy: Similarity Sampling and Source Prompt

Universal time vector representation component DWT-ST2Vec can connect different models

BBT-KG: The Affective Graph of Dynamic Tracing

Applying the BBT model to build a new factor BBT model for quantitative investment to help develop multi-factor strategies

Benchmark evaluation dataset: the first Chinese financial NLP evaluation dataset

Developer services: Open APIs to developers in financial and non-financial industries to build a BBT large model developer ecosystem

A cornerstone model for finance and economics

亚马逊云开发者

引用和评论

Amazon Bedrock 助力 SolveX.AI 构建智能解题 Agent，打造头部教育科技应用

入选AAAI 2025，浙江大学提出多对一回归模型M2OST，利用数字病理图像精准预测基因表达

得物增长兑换商城的构架演进

LLM增强语义嵌入的模型算法综述

得物业务参数配置中心架构综述

ClkLog埋点系统基于ClickHouse的百万日活测试报告

分析型数据库入门指南：如何选择适合你的实时分析工具？