Abstract: CarbonData acts as an intermediary service between Apache Spark and the storage system, and provides 4 important functions for Spark.

This article is shared from the HUAWEI cloud community " Make Apache Spark better with CarbonData ", the original author: big data practitioner.

Spark is undoubtedly a powerful processing engine and a distributed cluster computing framework for faster processing. Unfortunately, Spark also has shortcomings in some aspects. If we combine Apache Spark with Apache CarbonData, it can overcome these shortcomings:

  1. Does not support ACID transaction
  2. No quality enforcement
  3. Small file problem
  4. Inefficient data skipping

What is ACID?

image.png

Spark and ACID

ATOMICITY

The A in ACID stands for atomicity. Basically, this means either all succeed or all fail. Therefore, when you use spark data frame writer API, it should write complete data or no data. Let's take a quick look at the Spark documentation. According to Spark documentation: "It is important to realize that these save mode (overwrite) do not utilize any locking and are not atomic. Additionally, when performing an Overwrite, the data will be deleted before writing out the new data."
image.png

Although the whole situation looks a bit scary, it is actually not that bad. The Spark dataframe API performs job-level submission internally, which helps to achieve a certain degree of atomicity, which works with the "append" mode of FileOutputCommitter using Hadoop. However, the default implementation will bring performance overhead, especially when using cloud storage [S3/OBS] instead of HDFS.

We can now run the following code to prove that Spark overwrite is not atomic, and it may cause data corruption or data loss. The first part of the code mimics homework 1. It creates 100 records and saves them in the ACIDpath directory. The second part of the code mimics homework 2. It tries to overwrite existing data, but throws an exception during the operation. The result of these two tasks is data loss. In the end, we lost the data created by the first job.
image.png

Due to the exception, job-level submission will not occur, so no new files will be saved. Since Spark deleted the old files, we lost the existing data. The Spark data frame writer API is not atomic, but its behavior is similar to the atomic operation of append operations.

CONSISTENCY

Distributed systems are usually built on machines with lower availability. Consistency is a key issue in high-availability systems. If all nodes see and return the same data at the same time, the system is consistent. There are several consistency models, the most commonly used in distributed systems is strong consistency, weak consistency, and eventual consistency. We learned that the overwrite mode of the Spark writer API deletes old files first, and then places new files. Therefore, between these two states, there will be a period of time when no data is available. If our work fails, then we will lose data. This means that there is no smooth transaction between these two operations. This is a typical atomic problem of Spark overlay operations. And this problem also destroys the consistency of the data. The Spark API lacks consistency. Therefore, the Spark write mode does not support consistency.

Isolation and Durability in Spark

Isolation means separation. Separated from any other concurrent operations. Suppose we are writing to a data set that has not yet been committed, and there is another concurrent process that is reading/writing the same data set. According to the isolation characteristics, in this case, it should not affect others. A typical database will provide different isolation levels, such as committed read and serializable. Although Spark has task-level submission and job-level submission, due to the lack of atomicity of write operations, Spark cannot provide proper isolation.

Finally, Durability is the submitted state/data saved by the system so that even in the event of a failure and system restart, the data can be used in the correct state. Persistence is provided by the storage layer. In the case of Spark applications, it is the function of HDFS and S3/OBS. However, when Spark does not provide proper submission due to lack of atomicity, we cannot expect persistence without proper submission.

If we look closely, all these ACID attributes are interrelated. Because of the lack of atomicity, we lose consistency and isolation, and because of the lack of isolation, we lose persistence.

Lack of Schema Enforcement

We know that Spark means Schema when reading. Therefore, when we write any data, if any pattern does not match, it will not throw an exception. Let's try to understand this with an example. Let us have an input array containing the following records. The following program will read csv and convert to DF
image.png
image.png

The program reads from a CSV file, writes back and displays the data in parquet format. The output is as follows
image.png

Let's read another input CSV file where the "Cost" column has a decimal value instead of an integer (as shown below) and perform an append operation on the above file
image.png

In this case, our program will read CSV and write to Parquet format without exception. When we want to display/display the data frame, our program will throw an error
image.png

This is because Spark never validates the mode during write operations. The pattern of the "Cost" column is inferred to be an integer during the first load, and during the second write, it will append double data without any problem. When we read the additional data and call the operation, it will raise an error due to incompatible modes.

How to overcome the above drawbacks of Spark

If we use Apache Spark to plug in CarbonData as an additional layer of storage solutions, we can manage the above problems.
image.png

What is CarbonData

Since Hadoop Distributed File System (HDFS) and object storage are similar to file systems, they are not designed to provide transaction support. Realizing transactions in a distributed processing environment is a challenging problem. For example, implementations must generally consider locking access to the storage system, which comes at the cost of overall throughput performance. Storage solutions such as Apache CarbonData effectively solve these ACID requirements of the data lake by pushing these transaction semantics and rules into the file format itself or the combination of metadata and file format. CarbonData acts as an intermediary service between Apache Spark and the storage system. CarbonData is now responsible for compliance with ACID. The underlying storage system can be anything like HDFS, Huawei OBS, Amazon S3 or Azure Blob Storage. Several important functions provided by CarbonData for Spark are:

  1. ACID transactions.
  2. Schema enforcement/Schema validation.
  3. Enables Updates, Deletes and Merge.
  4. Automatic data indexing.

CarbonData in Apache Spark: ACID

image.png
In the above code snippet, the first part of the code imitates job-1, creating 100 records and saving them in the ACIDpath directory. The second part of the code mimics job-2, it tries to overwrite existing data but throws an exception during the operation.

The result of these two tasks is data loss. In the end, we lost the data created by the first job. Now let us change the code shown below to use CarbonData.
image.png

Perform the first job and count the number of rows. As expected, you will get 100 rows.
If you check the data directory, you will see a snappy compressed CarbonData file. The data file saves 100 rows in a columnar encoding format. You will also see a metadata directory containing tablestatus files. Now execute the second job. What are your expectations for your second job? As mentioned earlier, this work should try to do the following things.

  1. Delete the previous file.
  2. Create a new file and start writing records.
  3. A runtime exception was thrown in the middle of the job.

Due to an exception, job-level submission will not occur, and we have lost the existing data observed above without using CarbonData.

But now if you execute the second job, you will still get an exception. Then, count the number of rows. The output you get is 100, and old records will not be lost. It looks like CarbonData has made Overwrite atomic. If we look at the data directory, you will find two CarbonData files.
image.png

One file was created by the first job, and the other file was created by job 2. Job 2 did not delete the old file, but directly created a new file and started writing data to the new file. This method keeps the old data state unchanged. This is why we did not lose the old data, because the old files remain unchanged. The new incomplete file is also there, but the data in the new incomplete file is not read. This logic is hidden in the metadata directory and managed using the tablestatus file. The second job cannot create a successful entry in the tablestatus file because it failed in the middle. The read API will not read files whose entries in the tablestatus file are marked as deleted.
image.png

This time, let us write code logic without exception, overwrite the old 100 records with 50 records.
image.png

Now the record count shows 50. As expected. So, you have overwritten the older data set of 100 rows with a new data set of 50 rows.
image.png

CarbonData introduces metadata management into Apache Spark and makes the Spark data writer API atomic, thus solving the problem of data consistency. Once the consistency issue is resolved, CarbonData will be able to provide update and delete functions.

Spark With CarbonData: Schema Enforcement

Let us consider a simple user scenario where data arrives in multiple batches for conversion. For the sake of simplicity, let us assume that there are only 2 batches of data. The second batch of data carries some column data that is different from the first batch of data.
image.png

To start the experiment, let's read the data from Table 1, and write the data with and without CarbonData. We can use the "Overwrite" mode to write data with and without CarbonData.
image.png

Now let's read the second table with double-type data of cost column type, and then write the data frame to Parquet and CarbonTables (note: _c2 is an integer type, we are trying to append double-type data). There is no problem using parquet to append data that does not match the pattern, but when the program tries to append the same data to the CarbonData table, it throws an error:
image.png

Therefore, based on the above experiment, we can see that CarbonData verifies the mode before writing to the underlying storage, which means that CarbonData uses mode verification when writing. If the types are not compatible, CarbonData will cancel the transaction. This will help track down the problem in the beginning, rather than being confused with good data, and then try to find the root cause.

English link: https://brijoobopanna.medium.com/making-apache-spark-better-with-carbondata-d37f98d235de

Author: Brijoobopanna

Click to follow, and learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量