Recently, the first phase of the 2022 TechDay "Numerics Training Camp" series of live broadcasts was successfully held. The senior big data R&D engineer of Getui introduced the past and present of the data warehouse and the common methods of data modeling for everyone.
This article summarizes the dry content of the first phase of "Data Warehouse and Dimensional Modeling" of the "Numerics Training Camp", and also selects the wonderful questions in the live broadcast room for Q&A sorting, and takes everyone to review the first course.
Click to watch the live playback video >> https://mp.weixin.qq.com/s/NK9AvOQGVDlmvuhPqjIasg
1. Quick Start of Data Warehouse
Data Warehouse, referred to as "Data Warehouse", is a concept that big data practitioners cannot avoid. Bill Inmon, the "father of data warehouse", first proposed the concept of data warehouse and believed that "data warehouse is a subject-oriented, integrated, relatively stable data collection that reflects historical changes and is used to support management decisions".
In addition, big data architecture expert Ralph Kimball also defines data warehouse in the book "The Data Warehouse Tookit": "Data warehouse is a data warehouse that extracts, cleans, normalizes data from source systems, and then submits it to dimensional data storage. The system provides the support and realization of query and analysis functions for decision-making.”
Bill Inmon's definition of data warehouse emphasizes the overall characteristics, while Ralph Kimball defines data warehouse from the perspective of implementation process. No matter which definition, we can see from it that it is of great significance for enterprises to build data warehouses. By building data warehouses, enterprises can not only centrally manage data scattered in various business systems and break data silos, but also lay the foundation for subsequent efficient analysis and application of data, and empower business development through big data.
2. Data warehouse construction and data modeling
So, how does an enterprise build a data warehouse? How to build an efficient, stable and easy-to-use data warehouse that fits business needs? This requires consideration of the choice of data model and data modeling.
"Data modeling" refers to the process of data description and abstraction of entities and the relationships between entities and entities. A "data model" refers to a method of organizing and storing data.
At present, there are two mainstream data modeling methods, namely paradigm modeling and dimensional modeling:
Paradigm modeling <br>Paradigm modeling was proposed by Bill Inmon, which refers to the abstraction of the subject from the perspective of the enterprise. We generally use the ER entity relationship model to abstract things into "entities", "attributes" and "relationships" to represent things and events. association. Paradigm modeling is not aimed at the abstraction of entity-object relationships in a specific business process. It requires modelers to comprehensively and holistically understand the business and data of the enterprise. Not only does the implementation cycle take a long time, but also the ability of modelers is relatively high.
Dimensional modeling <br>Dimensional modeling was proposed by Ralph Kimball, which advocates building models from the needs of analysis and decision-making to serve the needs of analysis. Therefore, it focuses on how to enable users to complete data analysis more quickly, while maintaining better response performance for large-scale complex queries. Compared with paradigm modeling, dimensional modeling has a shorter construction period, supports agile iteration, and generally does not do too complicated design of data warehouse architecture.
When building a data warehouse, we need to choose the corresponding data modeling method according to the specific data analysis scenario and business processing system. For example, as far as OLTP system (On-line Transaction Processing) is concerned, since it is mainly oriented to random read and write data operations and pays attention to transaction processing, we recommend that companies using OLTP systems and traditional databases build a paradigm through the A modular approach to designing data models to address data redundancy and consistency issues in transaction processing. The OLAP system (On-line Analytical Processing: Online Analytical Processing) is oriented to the operation of batch reading and writing data, and does not pay attention to the consistency of transaction processing. It mainly focuses on the integration of data and the performance in big data query and processing. model method.
How to perform paradigm modeling and dimensional modeling? Let's look at each case separately.
3. Paradigm modeling method and example analysis
Let's first look at the basic process of paradigm modeling.
When carrying out paradigm modeling, we often have to design a reasonable model according to different specification requirements, and these different specification requirements are called "paradigm". At present, there are different model construction norms in the industry, such as one paradigm, two paradigms, and three paradigms. The higher the paradigm brings, the smaller the database redundancy, but the more complicated the data calculation. Most enterprises adopt three-paradigm modeling to reduce the complexity of data processing while ensuring flexibility and data computing speed.
The process of paradigm modeling can be broken down into the following four steps:
- abstract the subject
- Sort out the relationship between subjects
- Sort out the attributes of the subject
- Draw the ER relationship diagram
For example, we want to use paradigm modeling to design the data model of a course management system.
The system is mainly used to manage the relevant data of teachers, students and courses in a school, involving course selection, test scores, teacher teaching, student classes and so on. Then we first need to sort out the entities, including teachers, courses, students, and classes; secondly, we need to sort out the relationships between entities, including the courses taught by teachers, elective courses for students, and the classes students belong to, etc.; again, we need to list the entities and relationships. Attributes, for example, the attributes of the entity "student" include name, gender, age, etc., and the attributes of the relationship "student elective courses" include elective time, total class hours, etc.; the fourth step is to draw an ER diagram, which is represented by a rectangle " Entities”, use diamonds to represent “relationships” and ovals to represent “attributes” to clearly show the relationship between subjects and subjects in a visual way.
4. Dimensional modeling method and example analysis
Compared with paradigm modeling, dimensional modeling is slightly more complicated, including fact tables and dimension tables.
fact table
First look at the fact sheet. There are three types of fact tables, including transactional fact tables, periodic snapshot fact tables, and cumulative snapshot fact tables.
Transactional fact tables typically use a record to represent an event or behavior that occurred at a point in time. For example, in the order payment business in the e-commerce business scenario, transactional fact tables are generally used to organize and store data.
A record in the periodic snapshot fact table describes the status or status of an entity in a certain period of time. For example, the monthly point balance of a customer belongs to a typical periodic snapshot fact table record.
A record of the cumulative snapshot fact table is a cumulative record of multiple events that occur in a business process, generally to meet the statistical requirements of the operation efficiency of a process node.
Let's take the design process of a transactional fact table as an example to understand the design method of the fact table:
- Select business processes related to data analysis needs. A "business process" refers to a behavioral event that is inseparable within a business process. For example, in an e-commerce business scenario, the business process of shopping includes additional purchases, order placement, payment, merchant delivery, and user confirmation of receipt. If we want to analyze sales, "payment" is a required business process.
- Declarative granularity. We should try to choose the most fine-grained, and precisely define the business meaning represented by each row of the fact table to ensure the maximum flexibility of the fact table. For example, a user may purchase multiple products in an order, and each purchased product is a sub-order. We generally choose sub-orders as the declaration granularity.
- Determine dimensions. Dimension refers to the information of the environment in which the business process is located. For example, if a user buys a certain product in a certain store at a certain time, the industry of the store and the category of the product can be considered as dimensions.
- Identify the facts, that is, identify the metrics of the business process. For example, the measurement indicator of the business process of "payment" is the payment amount. In more complex e-commerce business scenarios, it may also include indicators such as postage distribution and discount amount.
It should be noted that each data warehouse contains one or more fact tables. The fact table is a measure of the analysis subject. It contains foreign keys associated with each dimension table, and is associated with the dimension table through Join.
dimension table
The dimension table is a window for users to analyze data, and records the attributes and attribute meanings of related transactions and events in the fact table.
The design process of the dimension table is mainly divided into the following four steps:
- Choose a dimension. For example, to generate a product dimension table, the dimension we choose is the product dimension.
- Determine the main dimension table. For example, to build a commodity dimension table, the main dimension table is the commodity table from the business system.
- Identify the relevant dimension tables. After the main dimension table is determined, other related dimension tables are determined accordingly. For example, the relevant dimension tables of the product dimension table include a product category table, a brand table, and an industry table to which the product belongs.
- Determine dimension attributes. These attributes generally come from the main dimension table and related dimension tables. We integrate the attributes of the main dimension table and related dimension tables, and merge the same attributes (for example, the product category table and the brand table may both have the industry attribute, then we can merge the industry attribute), Then put the resulting attributes into the dimension table to be generated.
In addition, this issue of TechDay's "Number Ruler Training Camp" also discusses the basic principles of paradigm modeling and dimensional modeling, and common problems in modeling (such as the transitive dependency problem in paradigm modeling, and the slow change in dimensional modeling. dimensional problems, etc.), data warehouse stratification, etc. are elaborated.
For more exciting content, click on the live review video Get it!
Easter eggs: Q&A selections in the live room
1. What kind of data warehouse can be considered as a relatively complete construction?
Data warehouse is mainly for data analysis services. A complete data warehouse must first meet the needs of enterprises for business analysis.
The construction of data warehouses is a process of continuous iteration and optimization, which needs to be considered comprehensively from the aspects of cost, performance, and efficiency to find a balance. Generally speaking, in the stable development stage of enterprise business, our work is mainly to maintain the existing data warehouse, and at the same time do some improvement and optimization work; for enterprises with relatively fast business iterations, for some exploratory business analysis needs, we need to New development investment in data warehouses.
2. In the design of data warehouses, the analysis granularity of the demander will change greatly with time and personnel understanding and changes in requirements. How to take this problem into account in the early stage of data warehouse construction as much as possible?
This requires the data warehouse construction team to do a good job in demand research, business research, data research and other work in the early stage.
When using dimensional modeling, we must strictly follow the design principles of fact tables. When declaring granularity, we must try to choose the latest granularity, and retain as much data dimension information as possible to ensure maximum flexibility for fact tables.
3. For the problem of slowly changing dimensions, can the dimension table be turned into a zipper table to solve it?
The zipper table is also a common solution, which belongs to the solution of inserting a new row, which is very similar to the example of inserting a new row mentioned in the live broadcast.
The difference is that in our live broadcast, the proxy primary key of "commodity key" is used as the field of the associated fact table, so the dimension change can be processed by modifying the "commodity key"; while the zipper table is directly in the table without modifying the associated field. Add two fields start and end to mark the valid time of the data row. When using it, you need to filter the valid data row first.
Relatively speaking, the method of modifying the surrogate primary key is easier to understand and more convenient for users to use.
4. In terms of data indicators, when there are multiple business departments, how to manage indicators with the same or similar names? How to deal with indicators with the same name but different definitions?
This problem involves the design of the indicator naming specification in the data warehouse construction specification. This problem can be avoided in the indicator naming, such as adding department dimensions when naming.
In addition, there will be corresponding content introductions in the follow-up courses of our "Numerics Training Camp", so stay tuned.
Recommended bibliography
When a company strategically decides to provide cloud computing and big data services, how to gradually decompose the strategy and finally implement it? This involves a series of activities such as technology construction, operation management, and organizational capacity building. What methodologies and practices can be used for reference? I believe this book can bring you inspiration!
**Follow the WeChat public account of Getui Technology Practice,
The background reply "number warehouse",
Get this live broadcast courseware~**
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。