【论文】数仓分析和adhoc共存相关论文总结

背景：一份数据一个系统综合解决多种分析和adhoc的场景。具体四个方面，解决reporting/dashboard（万级别query,ms延时，插入数据多，实时，filters,aggregation，类似）。emmeded statis(simply bug high query,百万/s实时更新)。monitoring(时序)。ad-hos analysis（几个，s复杂查询在trillion data下查询分析，snowflake）。从以下四个方向分别列出一些相关研究现状

1.dashboard:mesa/bigtable

1.mesa

A. Gupta, F. Yang, et al. Mesa: Geo-Replicated, NearReal-Time, Scalable Data Warehousing.PVLDB,7(12):1259–1270, 2014.

https://segmentfault.com/a/11...

2.bigtable

F. Chang, J. Dean, et al. Bigtable: A distributedstorage system for structured data.TOCS, 26(2):4,2008.

https://segmentfault.com/a/11...
3.druid 不支持adhoc

2.monitor:monarch

1.Monarch

R. Lupi. Monarch, Google’s Planet Scale MonitoringInfrastructure, 2016.
https://www.datacouncil.ai/talks/monarch-googles-planet-scale-streaming-monitoring-infrastructure

2.gorilla
https://segmentfault.com/a/11...
3.influxdb等

3.ad-hoc

1.dremel

S. Melnik, A. Gubarev, J. Long, G. Romer,S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel:Interactive Analysis of Web-Scale Datasets.PVLDB,3(1):330–339, 2010.

改进：However, there are significant design differences, for example,

extensive use of stateful caching(Dremel is mostly stateless),
separate instances where each is optimized for different use cases (Dremel is a global shared service across all users)
use of indexable columnar formats optimized for lookups (Capacitor, unlike Artus, does not have indexes), etc.

These make Procella suitable for many additional workloads (e.g high QPS reporting and lookup queries)

bigquery
adhoc,trival-and-error分析 tens of seconds 35billion rows
解决方法：relation olap,需要创建indices, mutli-demension olap 需要划分demonsion.
full scan: 不需要Indices和pre-aggregation :in-memory/flash,columar storage,parallel disk IO
While MapReduce is suitable for long-running batch processes such as data mining, BigQuery is the best choice for ad hoc OLAP/BI queries that require results as fast as possible
dremel
文章重点是pb格式，执行等
1.根据record生成columns，

2.根据key-value的record生成原始record。维护fsm。根据当前和r的值判断下一个

3.FSM构造
4.scan后根据reader执行expression生成结果算法。

2.presto

看了prestodb的doc和几个blog、OLAP,运行分为cor和worker;数据connector可以连接各种数据源（hive等）,schema,table;查询语句-query-stage-task-split-driver operation-exchange
扫盲文章：https://tech.meituan.com/2014...
几篇博客和一个论文
1）.scan的优化，调整filter顺序，
https://prestodb.io/blog/2019...
The ideal table scan can be described in terms of first principles as follows:

Do not materialize data that is not part of the result
Only read the data that is strictly necessary for producing the result
Filter early, run the most efficient filters first.
Produce output in predictable sized chunks and run in fixed memory.

The cost function of a filter is time / (rows_in / rows_out). The lower the value, the better the filter. 调整顺序，简单-》multi-》非filter widest first
2）.内存限制：
https://tech.meituan.com/2014...
when there is a worker that has exhausted its general pool the reserved pool comes into the play. In that state, the coordinator selects the query with the largest total (user + system) memory reservation across the cluster, and assigns that query to th
e reserved pool on all workers.
3）.group调度
FixedCountScheduler 不能partion的不再split更多task
FixedSourcePartitionedScheduler
SourcePartitionedScheduler
https://github.com/prestodb/p...
Exchange materialization for memory-intensive queries
将不能group执行的（不分区的），做成exchange materialization。有算法来决定何时Recoverable grouped execution for long-running queries (multiple hours)
超出内存的落盘，小块执行

4）.top level还是嵌套结构（不需要改外层）
https://engineering.fb.com/da...
Pruning complex types while extracting data from ORC files
shifting filter evaluation from the engine into the Hive connector
read stream直接过来column，根据record过滤效果调整顺序

论文
Presto: SQL on Everything

use case:
adhoc analytics 分析，bi。每个cluster支持50-100,s级别
etl cpu密集和内存（agg,join）吞吐比延时重要
a/b test
结论
大概性能，resource利用

给了一些工程开发的建议 1.配置少个性化，背压，2.性能的可见性，自己库的性能统计，没饿query的stage level统计，data-driven优化啊系统。3.静态配置。4.开发自己的库，debug等快
相关
hive/mapreduce,spark sql、spark 延时。vertica/redshift需要load data。impala只支持hadoop生态
要点
1。sql => ANTLR-based parser->syntax tree => IR encoding in the form of a tree of plan nodes
2。predicate/limit pushdown, column pruning, decorrelation ，并发(shuffle内存中数据，transfer等消耗内存和cpu)
cost-based evalution of plans . data layouts(connectors return) ，

3。调度（stage：all-at -once,phased(etl等消耗内存的)），task（leaf,intermedia state）,split(leaf接受分成splits，connectors分配到更小的queue的splits)

4。local data用driver loop控制（不是pull模式，把数据推给任何可以继续进行的operators）
shuffles:http，监控output buffer,http的input buffer，背压
etl这种write-heavy的任务，动态调整write并发数
5。一个split 在thread最多运行s，cpu分成5个level。

memory spilling到disk，有两个池子（general和reserved,reserved只能有一次）

6.容错：监控系统减少unavailability时间，但是还是只能靠重试，在持续优化中



query的优化：JVM代码生成，file format（column with flat in-memory），lazy blocks在connectors后只有用到才解压，compressed data(connector过来的就是？？)

3.snowflake

1.解决问题和相关领域
1.1. 在云以前的数仓系统，不是很好支持云架构，基于small/static集群。数据量变大，结构变复杂
1.2. saas,acid事务，关系型数据库，半结构，列式存储
1.3. Also, following a pure service principle, Snowflake requires no physical tuning, data grooming, manual gathering of table statistics, or table vacuuming on the part of users.

2.1、parallel database system：redshift，share-nothing结构，需要数据迁移（A. Gupta et al. Amazon Redshift and the case for simpler datawarehouses. InProc. SIGMOD, 2015.）
2.2、BigQuery sql-like语言，tricky for sql-based。tables are append-only and require schemas

2.3、Document Stores and Big Data.Document storessuch as MongoDB, Couchbase Server, and ApacheCassandra ，challenge ：simple key-value and CRUD (create, read, update, and delete) APIof these systems is the difficulty to express more complexqueries.
Additionally, many “Big Data” engines now support queries over nested data,for example Apache Hive, Apache Spark, ApacheDrill, Cloudera Impala , and Facebook Presto.We believe that this shows a real need for complex analyticsover schema-less and semi-structured data,

When Snowflake was founded in 2012, the database worldwas fully focused onSQL on Hadoop, with over a dozensystems appearing within a short time span. At that time,the decision to work in a completely different direction, tobuild a “classic” data warehouse system for the cloud, seemeda contrarian and risky move. After 3 years of developmentwe are confident that it was the right one. Hadoop has notreplaced RDBMSs; it has complemented them. People stillwant a relational database, but one that is more efficient,flexible, and better suited for the cloud

2.解决方案：
1.架构
S3（数据，本地计算spill）
EC2

2.查询优化

Snowflake’s query optimizer follows a typical Cascades- style approach,无索引。
结构：a well-known scheme called PAX or hybrid columnar  1，2，15;
执行：push(T. Neumann. Efficiently compiling efficient query plans for modern hardware.PVLDB, 4(9):539–550, 2011.)
prune:
静态：每个file的metadata过滤，plain and semi-data
动态：join:[40]  J. K. Mullin. Optimal semijoins for distributed database systems. IEEE TSE, 16(5):558–560, 1990.

3.ACID,MVCC,快照
4.VARIANT(文档型), ARRAY, and OBJECT. ： ETL=>ELT，更改结构不需要协调改很多组件，T可以并发，支持UDFs
5.no-schema

Cloudera Impala [21] (using Parquet [10]) and Google Dremel [34] have demonstrated that columnar storage of semi-structured data is possible and beneficial. 
自动统计提取common_path的列单独存储，static，materialized aggregates(semi-data,plain both)。增加基于path的bloom fitler，用于基于path的查询
json等的转化采用乐观转化方式，维护转化的和原始的两份结果（未使用的不会loaded,accessed）

3.最终性能
文章没有提及，对比了no-schema和有的损耗可接受
性能可以通过云节点简单解决，重点放在Building a metadata layer that can support hundreds of users concur- rently,handling various types of node failures, network failures, and supporting services is a never-ending fight，secure等

4.sparks sql(spark)
https://segmentfault.com/a/11...

5.reshift

4.embedded statics

1.vitess

Google, Inc. Vitess: Database clustering system for horizontal scaling of MySQL, 2003.

在mysql上scale和控制的(https://vitess.io/zh/docs/sch...
2.hbase
https://segmentfault.com/a/11...
3.hyperloglog

5.格式

Capacitor
Mosha Pasumansky. Inside Capacitor, BigQuery’s next-generation columnar storage format, 2016.