3

The technical architecture of ideal car in the Hadoop era

First of all, I briefly review the development of big data technology. Based on my personal understanding, the development of big data is divided into 4 periods:

The first period: 2006 to 2008. Around 2008, Hadoop became a top-level Apache project and officially released version 1.0. Its foundation was mainly defined by Google's troika, GFS, MapReduce, and BigTable.

The second period: the period from 2009 to 2013. Enterprises such as Yahoo, Ali, and Facebook are using more and more big data. At the end of 2013, Hadoop officially released version 2.0. I was fortunate enough to start working with big data in 2012 and experienced it with Hadoop 1.0 plus Hive. At that time, it was amazing that big data can quickly solve problems that could not be solved with SQL Server or MySQL with just a few machines.

The third stage: From 2014 to 2019, the development was very fast during this period, during which Spark and Flink became the top Apache projects. During this fast climb period, we also tried Storm, which was later replaced by Flink.

The fourth stage: From 2020 to the present, after Hudi graduated from Apache and became a top-level project in 2020, I personally understand that the data lake has entered the mature stage of the entire development, and has reached the data lake 2.0 stage of big data. The data lake has three main characteristics, the first is unified and open storage, the second is open format, and the rich computing engine.

In the overall development process, big data mainly has several characteristics, which are the four "Vs" that are often referred to: volume, velocity, variety, and value. There is now a fifth "V" (Veracity), the accuracy and trustworthiness of the data. The quality of data has always been criticized. I hope that there will be a set of standards in the industry to improve the quality of data lakes. This may be the standard for the emergence of data lake 2.0, because projects such as Hudi and Iceberg appear, all of which want to improve the quality of data lakes. Data lake management is done well.

Personally, Hadoop is synonymous with big data, but big data is not only Hadoop. Big data is a set of solutions formed after the integration of multiple components in the development process to solve the processing and use of large amounts of data . In recent years, everyone basically believes that Hadoop is going downhill. First, the merger and delisting of Hadoop commercialization companies Cloudera and Hortonworks. The original business model cannot be continued; Usability challenges and the growing complexity of the Hadoop ecosystem itself.

The current architecture of the ideal automobile big data platform

At this stage, Li Auto's big data platform is shown in the figure above. The ideal car uses a lot of open source components.

  • Transport layer: Kafka and Pulsar. Kafka was used as a whole in the early stage of platform construction. Kafka’s cloud-native capabilities are relatively poor. Pulsar was designed according to cloud-native architecture at the beginning of its design, and has some capabilities that are very suitable for IoT scenarios, which also match our business scenarios. Therefore, We recently introduced Pulsar.
  • The storage layer is HDFS + JuiceFS.
  • The main computing engines of the computing layer are Spark and Flink. These computing engines are running on the current Yarn. The three computing engines are managed by Apache Linkis. Linkis is open sourced by WeBank. Currently, we use Linkis heavily.
  • There are three databases on the right. The first is MatrixDB, which is a commercial version of the time series database. TiDB is mainly used for mixed scenarios of OLTP and OLAP. At present, we mainly use it for TP scenarios. StarRocks is responsible for the OLAP scene.
  • ShardingSphere wants to use its Database Plus concept to unify the underlying database to do a gateway layer management. It's still in the exploratory phase, and there are a lot of new features that we're very interested in.
  • Further to the right, Thanos is a cloud-native monitoring solution, and we have integrated the monitoring of components, engines and machines into the Thanos solution.
  • The application layer is our current four main middle office products, including data application, data development, data integration and data governance.

Features

From the status quo of big data platforms, you can find some characteristics:

  • First, there are many components in the whole solution, and users have strong dependence on these components, and the mutual dependence between components is also relatively strong. It is recommended that you try to choose the more mature cloud-native components when selecting components in the future .
  • Second, our data has clear peaks and valleys. The travel scene is generally the morning peak and the evening peak, and there will be more people on Saturdays and Sundays.
  • The third feature is that the popularity of our data is basically the hottest. Generally, we only access the data of the last few days or the last week. However, a large amount of data is generated, and sometimes a large amount of backtracking may be required, so the data also needs to be stored for a long time, so the utilization rate of the data is much worse.

Finally, the entire data system currently lacks some effective management methods from the file level. From the construction till now, HDFS is still mainly used, there is a large amount of useless data, resulting in waste of resources, which is a problem that we need to solve urgently.

Pain points of big data platforms

  • First, there are many components, high deployment difficulty and low efficiency . There are more than 30 big data components around Hadoop, and there are as many as 10 commonly used ones. There are strong and weak dependencies between some components, and unified configuration and management become very complicated.
  • Second, the machine cost and maintenance cost are relatively high . For the stable operation of the business, offline and real-time clusters are deployed separately. However, in the business characteristics mentioned above, our business has obvious peaks and valleys, and the overall utilization rate is not high. Many cluster components require specialized personnel to manage and maintain.
  • Third, cross-platform data sharing capabilities . Currently, shared data across clusters can only be synchronized to other Hadoop clusters through DistCp. It cannot be easily and quickly synchronized to other platforms and servers.
  • Fourth, data security and privacy compliance . Based on different data security requirements, ordinary users are managed through Ranger. Special security requirements can only be met by building different clusters and setting separate VPC policies, resulting in many data islands and maintenance costs.

The evolution and thinking of ideal car cloud native

First of all, let me briefly share my personal understanding of cloud native:

First, cloud native is derived from cloud computing. Nowadays, cloud vendors such as Alibaba Cloud, AWS, Tencent Cloud, Baidu Cloud, etc., initially provided technical services at the IaaS layer, helping enterprises to encapsulate and manage the most basic things such as storage, computing, and network in a unified manner. Enterprises only need to apply for a server on it. After applying for servers, these servers are still managed by cloud vendors, that is, everyone's traditional cloud operations.

Cloud native is inseparable from cloud computing. Generally speaking, cloud native belongs to the PaaS layer service of cloud computing, and is mainly a type of application for developers. Cloud native must be installed on the cloud and is a software development and application method based on cloud computing. Cloud + native, cloud is cloud computing, native is to abandon the traditional operation and maintenance development framework, realize application elastic scaling and automatic deployment through containerization, DevOps, and micro-service architecture, and make full use of cloud computing resources to achieve in the least space Do the biggest thing. It can also solve some of the pain points of our current big data system, such as poor scalability and maintenance, requiring a lot of manpower and time.

The above figure briefly lists several time points of cloud native.

  • In the first stage, AWS proposed the concept of cloud native and launched EC2 in 2006. This stage is the server stage, the cloud computing stage mentioned above.
  • The second stage, the cloudification stage, is mainly after the release of open source Docker and the open source of Kubernetes by Google. Kubernetes is a lightweight and extensible open source platform for managing containerized applications and services. Kubernetes enables automated deployment and scaling of applications.
  • In the third stage, the CNCF Foundation was established in 2015 to promote the concept of cloud native and help the overall development of cloud native to be better. Finally, Knative is open source. A very important goal of Knative is to develop cloud-native, cross-platform serverless orchestration standards. Derivation to the present, it is already the cloud native 2.0 stage, that is, the serverless stage. I personally understand that the development of big data should also develop in the direction of serverless . For example, the entire online services of AWS are basically serverless.

Big Data Cloud Native Architecture

Next, we will introduce the changes in the components of the ideal automobile big data platform after cloud native:

  • In the storage layer, after cloud native, all storage is basically object storage. The above architecture diagram leads to Lustre, which will be described in detail below. You can understand that the layer of "cloud storage" mainly uses JuiceFS to manage object storage and Lustre parallel distributed file system (Note: Due to the single copy problem of Lustre, we are also considering using the parallel file system provided by cloud service providers. product).
  • The container layer, mainly on computing, storage, and network, is all replaced by Kubernetes and Docker, and all components are grown on this.
  • For the component part, the first is the big data computing framework. We may abandon Hive, directly use Spark and Flink, and use Hudi to support the underlying capabilities of Data Lake 2.0 and gradually replace HDFS.
  • In the middleware part, in addition to Pulsar, there is Kafka. At present, the cloud nativeization of Kafka is not very good. I personally prefer to replace Kafka with Pulsar. At present, Linkis has been used to adapt all Spark engines online, and the adaptation and integration of Flink will be carried out later. ShardingSphere just supported cloud native in version 5.1.2, and we will conduct scenario verification and capability exploration as planned.
  • The database layer is still TiDB, StarRocks, and MatrixDB. At present, these three databases already have cloud-native capabilities, and they all support object storage. But this piece has not been tested separately, and we are still using physical machines. Because for the database, the IO capabilities provided by the current object storage cannot meet the performance requirements of the database, which will greatly reduce the overall performance of the database.
  • In terms of operation and maintenance, an additional Loki is added to the Thanos solution, mainly for cloud-native log collection. However, Loki and Thanos are only two of them. In the future, I understand that we should align with Alibaba's open source SREWorks capabilities, and encapsulate the entire quality, cost, efficiency and security in the comprehensive operation and maintenance capabilities, so that the entire cloud native management can be stand up.
  • Observability, a popular concept in the cloud native field recently. Some of the components that everyone makes now are cloud-native before they become popular. They are not born on the cloud at the beginning, they just hope to grow on the cloud later. In this case, it will encounter some problems. The first problem is that there is no comprehensive visibility monitoring. We consider how to develop a plan for these components as a whole in the future, so that all components can be effectively monitored after they are native to the cloud.

To sum up, I personally think that the future cloud native of big data is basically:

  1. Unified use of cloud-native storage as the underlying storage for all components (including databases)
  2. All components run in containers
  3. Serving upper-layer applications using serverless architecture

However, this also brings challenges to the current data platform products, that is, how to design products with serverless capabilities for users to use.

Advantages of Big Data Cloud Native

The first point is the separation of storage and computation, and elastic scaling . After using physical machines to deploy Hadoop, if you need to expand or shrink, you need to contact the operator, and there may be a long cycle. The separation of storage and computing solves this problem well.
The second is pay-as-you-go, no need to purchase idle resources. At present, the data of our entire business scenario has peaks and troughs. We need to prepare machines during peaks, and we need to withdraw machines during troughs, but we cannot do it now. Now we basically stack all the machines to the peak. The demand can be met during the peak, and it is stable without failure, but it is idle for at least 12 hours during the trough. In this case, the resources are also paid. After cloud native, we can no longer pay for it.

The second point is automated deployment and operability . Kubernetes supports DevOps integrated deployment solutions. In this way, our components can be deployed quickly (for example, through Helm chart), and the ability of component operation and maintenance can be lowered to the cloud native platform, so that big data does not need to consider component operation and maintenance scenarios.

The third point is object storage . Object storage is the core and most important product launched by cloud computing. The benefits of object storage are self-evident, easy to expand, unlimited storage space, and relatively low unit price, and object storage is also divided into low-frequency storage, archive storage and other storage types, further reducing storage costs, data can be stored more long time. At the same time, the cost is controllable, high reliability, and low operational complexity are also the advantages of object storage.

Fourth, security and compliance . After cloud native, dedicated namespaces, multi-tenant isolation, and remote authentication can be implemented. At present, what we have done is basically isolation at the network level. The widely recognized solution for HDFS file management is Ranger. Through Ranger to manage the directory permissions of HDFS, it can also manage some permissions such as Hive server, HBase, Kafka, but these permissions are relatively weak.

Another solution is Kerberos, the security of the entire big data component will be much improved, but it has a lot of costs, and any request of it needs to be verified. We have not used this solution at present, and it has something to do with our cluster environment and scenarios. We are basically on the intranet and do not provide external services. If your big data project needs to provide some services to the external network, you still need strong authentication, otherwise the data will be easily leaked.

Difficulties of Big Data Cloud Native

The difficulties of big data cloud native also exist.

First, there are many components related to big data. At the same time, the update of Kubernetes is relatively fast. After the components are crossed, there will be problems in compatibility, complexity and scalability.

Second, the allocation and reallocation of resources. Kubernetes is a general container resource scheduling tool, and it is difficult to meet the resource usage scenarios of different big data components. In the big data scenario, the resource usage will be relatively large, the request frequency will be high, and the number of pods to be started each time will be relatively large. In this case, there is no good solution at present. We are currently looking at the Fluid solution. Fluid also implements the runtime of JuiceFS. This is what we will do in-depth research later. Fluid currently claims to support big data and AI, not only AI scenarios, because big data and AI The scenarios are similar, and they are all data-intensive operations. Fluid has made some breakthroughs in computing efficiency and data abstraction management.

Third, object storage also has some disadvantages. The disadvantages of object storage are low metadata operation performance, poor compatibility with big data components, and eventual consistency.

Last but not least, data-intensive applications. The separation mode of storage and computing cannot meet the requirements of data-intensive applications such as big data and AI in terms of computing efficiency and data abstraction management.

Exploration and implementation of JuiceFS in big data cloud native solutions

Before the open source version of JuiceFS, we have paid attention to it and did some landing tests. After the open source version is launched, we will use it immediately. When we went online, we also encountered some permission problems and a few small bugs. The community was very helpful and quickly helped us solve them all.

The reason for taking HDFS offline is that it has poor scalability, and at the same time, our data volume is relatively large, and the storage cost of HDFS is relatively high. After a few batches of data are stored, the physical machine runs out of space and requires a lot of computation. We were still in the early stages of business development, and we wanted to keep as much data as possible in order to get as much value from it as possible. And HDFS needs three copies, we later changed to two copies, but two copies are still risky.

On this basis, we tested JuiceFS in depth, and after the test was completed, we quickly introduced JuiceFS to our online environment. Migrating some of the larger tables from HDFS to JuiceFS relieved our immediate needs.

We value three points about JuiceFS:

  • First, JuiceFS is multi-protocol compatible . It is fully compatible with POSIX, HDFS and S3 protocols. Currently, it is 100% compatible and has not encountered any problems.
  • Second, the ability to cross clouds . When enterprises have a certain scale, in order to avoid systemic risks, they will not use only one cloud service provider. It will not be tied to one cloud, it will be multi-cloud operation. In this case, JuiceFS's ability to synchronize data across clouds comes into play.
  • Third, cloud-native scenarios . JuiceFS supports CSI. At present, we have not used CSI in this scenario. We basically use POSIX to mount, but using CSI will be simpler and more compatible. We are also developing towards cloud native, but the entire component Haven't really gotten to Kubernetes yet.

The application of JuiceFS in the ideal car

Persist data to object storage from HDFS

After JuiceFS was open source, we started to try to synchronize the data on HDFS to JuiceFS. DistCp is used at the beginning of synchronization. It is very convenient to synchronize with the Hadoop SDK of JuiceFS, and the overall migration is relatively smooth. The reason for migrating data from HDFS to JuiceFS is due to some problems.

The first is that the storage-computation coupling design of HDFS has poor scalability, and there is no way to solve this. My personal perception of big data from the very beginning is that big data must be deployed on physical machines, not cloud hosts. Including various EMR systems launched by cloud vendors later, they are actually encapsulating Hadoop. In the past year or two, these EMR systems have been gradually de-Hadoop.

The second is that HDFS is difficult to adapt to cloud native. The current HDFS is difficult to adapt to cloud native because it is relatively heavy. Although the community has been focusing on making cloud native, I personally think that the development trend of Hadoop is going downhill, and object storage should be the main focus in the future.

Third, object storage also has some drawbacks. It cannot be well adapted to the HDFS API. Due to network and other reasons, the performance is much different from that of local disks. In addition, metadata operations such as list directories are also very slow. We do some acceleration through JuiceFS, and the measured performance is very impressive. In the case of cache, it is basically comparable to the local disk. Based on this, we quickly switch the current scene directly to JuiceFS.

Platform-level file sharing

The second scenario is platform-level file sharing. All the data of our current scheduling system, real-time system, and shared files of the development platform are stored on HDFS. If we stop using HDFS in the future, we need to migrate these data away. The current solution is to use JuiceFS to connect to object storage, and to mount all of them in POSIX mode through application layer services, so that everyone can request files in JuiceFS without feeling.

JuiceFS meets most of our application needs in this scenario, but there are still some problems in some small scenarios. The original idea was to put everything in the Python environment, but it was found that the actual operation was too difficult, because there were a lot of small files in the Python environment, and there would still be problems when loading them. Scenarios like the Python environment that contain a large number of fragmented files still need to be stored on the local disk to operate. In the future, we are going to hang a block of storage to do this specifically.

Share a few problems we encountered with HDFS before:

First, when the NameNode is under high pressure or Full GC, there will be a download failure. At present, there is no perfect solution. Our solution is to increase the memory as much as possible, or add some retries when downloading the package to avoid its peak period, but it is difficult to completely solve the problem of HDFS in this case, because it is written in Java after all, and the GC There is no way to avoid the scenario.

Second, when using HDFS across systems, for example, if we have two clusters, it is basically unrealistic to use one cluster to share files, because the network needs to be opened to connect the two clusters. Or through the application, there is no way to guarantee the security. At present, we basically have two clusters that maintain their own shared files independently. Now the real-time platform (such as the Flink platform) has been switched to JuiceFS, and it is still very smooth, and no problems have been encountered.

Third, at present, we have a large number of physical machine deployments. The physical machine deployments are all single-clustered, and there is no disaster recovery strategy. If some catastrophic problems occur in the computer room one day, our entire service will be unavailable. However, the object storage itself is a cross-machine room, and it is in the same region. There should be at least three copies. The cloud vendor helped us to do backup. In the future, we may develop multi-cloud, and hope to share some high-level files and core databases through JuiceFS, including some core backup files, and do backup in multi-cloud. In this way, multi-cloud, multi-region, and multi-region are realized, which can solve the current problem of single-point disaster recovery.

Cross-platform use of massive data

In another scenario, the platform and the platform all share massive data through JuiceFS. The first type of shared data on our side is the road test data. There will be a large amount of video, voice and image data uploaded in the road test. After these data are uploaded, they will go directly to JuiceFS, which is convenient for downstream to do some synchronization and sharing. Including some data screening, and then getting PFS is a parallel file system, and the SSD is mounted under it. This can make the GPU utilization higher, because the ability of object storage is relatively weak, otherwise there will be a lot of waste of GPU capacity.

The remaining data types include some logs reported by vehicles for analysis, buried point data, and some vehicle-related signal data required by national platforms. These data will go into the data warehouse for some analysis. We will also do some feature data extraction on these data, and do model training for the algorithm team, or do some NLP retrieval and other more scenarios.

Cloud native storage acceleration - Lustre as read cache (in testing)

Now what we are testing is another scenario where a Lustre is attached to the object storage layer to serve as a read cache for JuiceFS, and the Lustre cache is used to help JuiceFS to improve the read speed and cache hit rate.

One advantage of this is that we are all using physical machines, which have physical disks, and physical disks can be used to cache data. However, because the computing tasks are executed on multiple nodes, the cache hit rate is not very high. This is because the community version of JuiceFS currently does not support P2P distributed caching, but only supports local caching of a single node, and each node may read a lot of data. In this case, some disk pressure is also caused on the computing node, because the cache will occupy a certain amount of disk space.

At present, our solution is to use Lustre as the read cache of JuiceFS. Specifically, according to the size of the data to be cached, a Lustre file system with a capacity of about 20~30TB is mounted to the local computing node, and then the Lustre mount point is used as the cache directory of JuiceFS. In this case, after JuiceFS finishes reading the data, it can be cached in Lustre asynchronously. This solution can effectively solve the problem of low cache hit rate and greatly improve the read performance.

If we write data directly to the object storage in the Spark scenario, there will be bandwidth and QPS limitations. If the writing is too slow, the upstream tasks may jitter. In this case, the write cache function of JuiceFS can be used. Write data to Lustre first, and then asynchronously write it to object storage. This solution is applicable in some scenarios. But there is a problem that Lustre is not a cloud-native solution. It is perceptible to users. When users start a pod, they need to explicitly write a command to mount it. Therefore, we also hope to make some modifications to JuiceFS in the future to automatically identify object storage and Lustre, and then automatically implement some caching mechanisms, so that users do not need to perceive the existence of Lustre.

At present, the PoC of this solution has been completed and the basic test has been passed. Next, we will do a lot of stress testing in the production environment. It is expected that Q3 this year should be officially launched to cover some edge services.

JuiceFS's overall solution for big data cloud native

As can be seen from the architecture diagram of the overall solution, we can use all three methods provided by the JuiceFS client at present.

As shown in the left half of the above figure, we will have independent Spark and Flink clusters. We directly mount JuiceFS to the entire cluster through CSI Driver, so that when users start Spark and Flink, they will not perceive JuiceFS at all. Existing, the reading and writing of computing tasks are completed through object storage.

This part currently has a question about shuffle. Because Spark tasks require a large amount of data to be placed on the disk in the shuffle phase of the calculation process, the large number of file read and write requests generated during this period have high performance requirements for the underlying storage. Flink is relatively better because it is streaming and doesn't require a lot of drop-offs. In the future, we hope that JuiceFS can directly write to Lustre, but this requires some modifications in JuiceFS. Through client integration, JuiceFS can directly read and write Lustre, which is insensitive to users and can also improve shuffle. Stage read and write performance.

The application in the right half of the image above has two scenarios. One is to simply query the data of JuiceFS, such as data preview through HiveJDBC. In this scenario, JuiceFS can be accessed through the S3 gateway.

The second is the scenario of linkage between big data platform and AI platform. For example, colleagues on the AI platform often need to read sample data, feature data, etc. in their daily work, and these data are usually generated by Spark or Flink tasks on the big data platform and have been stored in JuiceFS. In order to share data between different platforms, when the pod of the AI platform is started, JuiceFS will be directly mounted to the pod through FUSE, so that colleagues of the AI platform can directly access the data in JuiceFS through Jupyter to make some models Instead of duplicating data between different platforms like traditional architectures, it improves the efficiency of cross-team collaboration.

Because JuiceFS uses POSIX standard users and user groups to control permissions, and the default container startup is the root user, which makes permissions difficult to control. Therefore, we made a modification to JuiceFS to mount the file system through an authentication token, which contains the connection information of the metadata engine and some other permission control information.
In some scenarios where multiple JuiceFS file systems need to be accessed at the same time, we use the JuiceFS S3 gateway combined with IAM policies for unified permission management.

Some problems encountered with JuiceFS currently

The first point is that the permission management function based on users and user groups is relatively simple. In some scenarios, the container startup defaults to the root user, and the permissions are not easy to control.

The second point is about the configuration optimization of JuiceFS Hadoop SDK. At present, there are three main ways to optimize the JuiceFS Hadoop SDK: juicefs.prefetch , juicefs.max-uploads and juicefs.memory-size . Among them, some problems were encountered during the configuration of tuning juicefs.memory-size . The default value of this configuration is 300MB. The official recommendation is to set the off-heap memory 4 times the default value, which is 1.2GB. At present, most of our tasks are configured to 2GB of off-heap memory, but some tasks occasionally fail to write even if more than 2GB of memory is configured (HDFS can write stably). However, this is not necessarily a problem with JuiceFS, it may also be caused by Spark or object storage. Therefore, at present, we are also planning to deeply adapt Spark and JuiceFS, and then find the reason step by step, and strive to go through these pits and reduce the memory while ensuring the stability of the task.

Third, due to the complexity of the overall architecture (JuiceFS + object storage + Lustre), there are more possible failure points, and the stability of the task may be somewhat reduced, which requires other fault-tolerant mechanisms. For example, Spark tasks may report errors such as "lost task" during the shuffle write phase, and the specific cause of the error has not yet been located.

The aforementioned architecture combination of JuiceFS + Object Storage + Lustre improves read and write performance to a certain extent, but it also makes the architecture more complex, and accordingly increases some possible failure points. For example, Lustre does not have a strong disaster recovery copy capability. If Lustre suddenly hangs a node, can the running tasks continue to read and write data in Lustre stably, or if the data in Lustre is accidentally lost, can it still be stable? It is currently uncertain and we are currently doing this kind of catastrophic test.

future and outlook

Real-time data lake solution based on Flink + Hudi + JuiceFS

One of the recent projects we will do is a real-time data lake solution for Flink + Hudi + JuiceFS. The left side of the above figure is the data source. Through Flink and Kafka/Pulsar, the data is written to Hudi in real time. At the same time, the data of Hudi will fall into JuiceFS to replace our current real-time data warehouse.

Long-term planning for big data cloud native

Finally, I will introduce the long-term plan of Li Auto's big data cloud native, which is also a prospect.

The first point is a unified data management and governance system. We believe that in the era of data lake 2.0, the biggest problem that needs to be solved is to solve the problem of data swamp in the data lake 1.0 scenario. But now there seems to be no better open source products for unified metadata management, data catalog management, and data security control, similar to AWS Glue and AWS Lake Formation. At present, we are working on a project of "origin system". The first step of this system is to make unified catalog management, unified security management and control, and unified data management of all metadata in the above database and object storage. We are groping forward.

The second point is faster, more stable, and lower-cost underlying storage capabilities. The biggest difficulty in all current scenarios is object storage. The advantages of object storage are stability and low cost, and object storage is also constantly iterating. For now, I think if big data cloud native is to develop, object storage must provide better performance while ensuring stability.

At the same time, S3 may claim to support strong consistency, but at present I understand that the architecture design based on object storage may be difficult to achieve strong consistency, or it is bound to sacrifice some things in order to achieve strong consistency, which may be a need question of balance. JuiceFS natively supports strong consistency, which is very friendly to big data platforms.

The third point is a smarter, more efficient, and easier-to-use query engine. Extending the thinking about the integration of lakes and warehouses mentioned above, the integration of lakes and warehouses is still in the early stage of development, and it may take 5 to 10 years of development. Both Databricks and Microsoft are trying to build a vectorized MPP engine on the data lake, hoping to push the lake-warehouse integrated architecture. This may be a future development direction, but it seems that there is no way to use one engine to meet the needs of all scenarios in a short time.

Our current architecture is basically equipped with all query engines, such as Spark, Flink, relational databases (for OLTP scenarios), time series databases, and OLAP databases. In principle, whoever uses it best, we will manage it through unified middleware at the upper level. Another example is Snowflake. Although it now supports querying structured and semi-structured data at the same time, how should unstructured data (such as pictures, voice, and video) be supported in the future like artificial intelligence? clear. However, I think this is definitely a development direction in the future. Li Auto also has similar artificial intelligence scenarios, so we will explore and build together with various business parties.

Finally, the ultimate goal of the entire big data development is to complete data analysis with the lowest cost and highest performance, so as to achieve real business value .

If it is helpful, please follow our project Juicedata/JuiceFS ! (0ᴗ0✿)


JuiceFS
183 声望9 粉丝

JuiceFS 是一款面向云环境设计的高性能共享文件系统。提供完备的 POSIX 兼容性,可将海量低价的云存储作为本地磁盘使用,亦可同时被多台主机同时挂载读写。