[clickhouse column] Statistical analysis of JSON data documents of the mongodb storage class

The need for document storage

Many developers have used mongodb, where data records exist in the form of documents (similar to a form of multi-level nested SQL). For example, the following JSON data structure: dev_ip represents the ip of a server, the location field stores the latitude and longitude, and meminfo.total and meminfo.userd represent the total memory and usage, respectively.

 [{
  "dev_ip": "123.46.5.111",
  "location": [39.916527, 116.397128],
  "meminfo": {
    "total": 64,
    "used": 23.2
  }
},
{
  "dev_ip": "123.46.5.112",
  "location": [39.916144, 116.392582],
  "meminfo": {
    "total": 64,
    "used": 27.8
  }
}]

2. Build the table

The data type of Map needs to be used to store documents. In some older versions, the Map data type is still experimental and cannot be used directly. If you want to use it, you need to execute set allow_experimental_map_type = 1; .
Then we can build a table according to the JSON data structure, location is an array data type, and meminfo is a Map data type.

 CREATE TABLE dev_meminfo (
  dev_ip String,  
  location Array(Float64),
  meminfo Map(String, Float32)
)
ENGINE = MergeTree()
ORDER BY dev_ip;

clickhouse provides an array type Array(T), which consists of multiple T elements. T can be any underlying data type, or it can be an array type. If it is an array type, a multidimensional array is formed. Cickhouse has limited support for multidimensional arrays, so it is not recommended to use multidimensional arrays. All T elements in the array must have the same data type, otherwise an exception will be thrown.

3. Data storage

In the Linux environment, we can save JSON data as a document, named such as: dev_meminfo.json. Then use the following command line to complete the storage of the document JSON data. The function of jq is to strip the JSON array and Map from the structure, so as to meet the input format required by JSONEachRow.

 cat dev_meminfo.json |jq -c .[] | clickhouse-client --database acaidb -m -u acai --password '<你的密码>'  \
--query="INSERT INTO dev_meminfo FORMAT JSONEachRow"

Or we can directly use the INSERT statement to complete a single insertion of data, as follows:

 INSERT  INTO dev_meminfo FORMAT JSONEachRow {"dev_ip": "123.46.5.112", "location": [39.916144, 116.392582],"meminfo": {"total": 64,"used": 27.8}};

4. Query data

select * Query data:

 #  clickhouse-client --database acaidb -m -u acai --password '4rfv$RFV'  --query="SELECT * FROM dev_meminfo"
123.46.5.111    [39.916527,116.397128]  {'total':64,'used':23.2}
123.46.5.112    [39.916144,116.392582]  {'total':64,'used':27.8}

When querying data by field name, it should be noted that the subscript of Ciickhouse's array starts from 0, which is inconsistent with the specification in our traditional programming development.

 # clickhouse-client --database acaidb -m -u acai --password '4rfv$RFV'  --query="SELECT dev_ip,location[1],location[2],meminfo['total'],meminfo['used'] FROM dev_meminfo"
123.46.5.111    39.916527       116.397128      64      23.2
123.46.5.112    39.916144       116.392582      64      27.8

V. Summary

We see that clickhouse can replace mongodb to store JSON-like document data and support SQL query statistical analysis, which is quite attractive. I just introduced simple queries. In fact, clickhouse provides a lot of statistical analysis functions, window functions, etc. for various statistical analysis scenarios. Of course, there are also many statistical analysis functions for array data types.
What is more noteworthy is: I have seen some evaluation articles, the statistical analysis performance of clickhouse is nearly 100 times that of mongodb (the author has not verified it myself, please pay attention to the follow-up articles of my column). But it does not mean that clickhouse can replace all application scenarios of mongodb, at least the user-oriented concurrent request scenario clickhouse is not satisfied, because clickhouse is positioned as a data warehouse, mainly for data analysis OLAP scenarios, not for users with high concurrency Online Transaction Processing OLTP.

[clickhouse column] Statistical analysis of JSON data documents of the mongodb storage class

The need for document storage

2. Build the table

3. Data storage

4. Query data

V. Summary

Recommended reading

字母哥博客

引用和评论

高频时序数据的储存与统计方案

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性