Preface
data visualization
essence of data visualization is to convert data into visual codes. Visualization is good at exploring data, scientific insights, communication and education. Visualization and statistics are different and related. The difference is that the former does not need to clarify the problem, while the latter studies a specific problem; the connection lies in the partnership. Visualization attracts the audience's attention through visual coding, and then transmits the data to the observer, and interactively explores and analyzes the data through media such as computers. Good visual coding will make full use of humans have innate processing ability , that is, processing space, color, shape and other information in parallel. But almost all too much information can’t be reasonably displayed in a static graphic. Therefore, design is not only about how to display something, but about what we think is important to the imaginary reader. And don't show anything. Through the computer, we customize the graphics according to the reader's interest. When designing the interaction, we refer to Ben Shneiderman [1] to propose a good human-computer interaction guide: Overview (Overview First), Zoom and Filter (Zoome1f11) ) And provide detailed information (Details on Demand) on demand. The first overview is the initial form of the graph. Its purpose is not to display all content, but to provide a "macro" view of all data; zooming and filtering are methods to eliminate displayed content to focus on topics of interest; detailed information provided on demand allows readers to extract accurate information from the chart value.
Machine learning visualization
The application of data visualization in the subject of machine learning is called machine learning visualization. According to the usage data of different user groups at different stages, it can be roughly divided into four categories: training data (Training Data), model performance (Model Performance) ), (Interpretability + Model Inspection) and high-dimensional data (High-dimensional Data). We expose patterns to our eyes through data visualization. Visualization tools use machine learning to extract patterns for us, help find deeper patterns, and provide us with new ways to navigate data. The patterns extracted by machine learning constitute high-dimensional data in the form of feature vectors. Embedding Projector [2] is a tool for interactive visualization and high-dimensional data analysis. It provides four kinds of tools that are very useful for visualizing high-dimensional data. dimension reduction (dimensionality data reduction) method, UMAP [. 3] , T-the SNE [. 4] , the PCA and custom linear projection [. 5] (the custom linear Projections).
- UMAP, is a dimensionality reduction algorithm based on manifold learning technology and topological data analysis ideas. It provides a very general framework for approximating manifold learning and dimensionality reduction, but it can also provide specific implementations;
- t-SNE, can be used to explore local neighbor values and find clusters;
- PCA , which can usually effectively explore the embedded internal structure and reveal the most influential dimensions of the data;
- The custom linear projection can help to find meaningful directions in the data set;
Problems and challenges
In the machine learning application scenario, we encountered a series of challenges: (1) The research object has uncertainty . It is impossible to design a static image that can express all the content clearly in advance, and of course it is not necessary to do so. (2) current group ecology lacks high-dimensional data visualization related construction , and the group’s G2 [6] is a set of underlying visualization engines based on the theory of graph grammar. It is data-driven and provides graph grammar and interactive grammar. High ease of use and scalability. G2 and related ecological concerns allow users to build various interactive statistical charts Canvas or SVG without paying attention to the various tedious implementation details of the chart. As of G2 4.0 version, G2 and related ecology pay more attention to single statistical chart visualization problem . In addition to focusing on the distribution of feature spaces, machine learning scenarios also focus on the comparative analysis of multiple feature subspaces, and the comparative analysis of feature spatiotemporal distribution. When we explore the unknown, the advanced visualization syntax used for exploration, such as Vega-Lite [7] can help to quickly analyze data and create a series of expressive visualizations, but Vega-Lite cannot use the existing group’s Visualization capabilities. (3) unit visualization physical limitations , when the amount of data is large, during the visualization coding period, each data item has a unique visual mark, which will cause performance problems; lack of necessary interactive support.
framework design
In response to the above challenges, we use visualization combined with human-computer interaction to solve the uncertainty of the research object. For high-dimensional data space visualization and subspace comparative analysis, we use multi-view visualization technology. Finally, we provide a visual analysis framework with high-level syntax that can intuitively express the visual design space, support multi-data, multi-attribute, and multi-view visual analysis of high-dimensional massive data, covering time series, geographic space and other comparative analysis scenarios.
- Data, a data table containing multiple rows, each row contains multiple columns or multiple attributes, in order to facilitate data processing, the data generally adopts a flat structure;
Containers, a container is a geometric abstraction, including the location and area where a Group will be placed
- bin, select the attribute you are interested in, and perform the bin operation on the attribute
- layout, custom calculation of visual elements or grouping location information
- Groups, a subset of row data, Groups can also be nested types, Groups include other Groups
- Cells, a specific instance of Container related to a row in the data set
- Units, a graphical representation of a row of data. They can have visual attributes such as color, shape, size (relative to the peripheral cell) and opacity
- View, a view is a specific visualization of a data table; it can be linked with other views of the same data
- Interaction, running through this data visualization pipeline, data-level filtering and sorting operations; bin operation attribute selection; layout method selection; visualization coding selection; visualization unit selection, prompt, Hover and other interactive methods, and even linkage analysis between views
- Animation, the animation of the visual elements in the add, update, and delete phases
key step
Raw data
In order to better explain this framework, we analyze a specific business data. The actual data is provided in an array. A record in the array is an entity description, including fields basic information (base_info), selection (selection) , features (feats), details (details). The visualization scheme determined after many discussions adopts multi-view visualization technology to support longitudinal comparison of the feature data of different entities, and the feature data is arranged in descending order of time.
// 业务数据
[
// 一份实体描述
{
// 基础信息
basic_info: {
"id": "1", // 分组id
...
},
selection: {
...
},
// 实体特征空间
feats: {
"feature_1": 0, // bool类型
"feature_2": 1,
"feature_3": 1,
"feature_4": 0,
"feature_5": 1,
"feature_6": 0,
"feature_7": 1,
"feature_8": 0,
"feature_9": 0,
"feature_10": 1,
"feature_11": 104, // 数字类型
"feature_12": 104
},
details:{
...
},
},
...
]
// end
Advanced syntax configuration
{
width: 600,
height: 200,
margin: {
top: 10,
right: 30,
bottom: 30,
left: 100,
},
autoFit: true, // 如果设置 false, 需要手动设置 width 和 height
layouts: [
{
name: 'layout1',
type: 'gridxy',
aspect_ratio: 'fillY', // 布局方式
align: 'TB', // 从上向下
subgroup: { // 子空间
type: 'groupby', // bin | groupby | flatten
key: 'basic_info.id', // 按照ID聚类
},
size: {
type: 'count', // 统计count,按照子空间元素数量绘制大小
},
sort: null, // 子空间元素排序方式
padding: { // layout padding
top: 0,
left: 0,
bottom: 0,
right: 0,
},
margin: { // layout margin
top: 0,
left: 0,
bottom: 0,
right: 0,
},
box: { // box 样式
fill: 'white',
stroke: 'gray',
'stroke-width': 1,
opacity: 0.5,
},
}, {
name: 'layout2',
type: 'gridxy',
subgroup: {
type: 'flatten', // 平铺布局
key: 'feats', // 特征
},
aspect_ratio: 'fillX',
size: {
type: 'count',
},
align: 'LR', // 自左向右
interactions: [], // 交互
padding: { // padding
top: 5,
left: 5,
bottom: 5,
right: 5,
},
margin: { // margin
top: 5,
left: 5,
bottom: 5,
right: 5,
},
box: { // box 样式
fill: 'white',
stroke: 'gray',
'stroke-width': 1,
opacity: 0.5,
},
},
],
mark: {
shape: 'rect', // 单元形状
isColorScaleShared: true,
size: { // 根据单元形状决定
type: 'uniform', // 统一大小
width: 20, // rect width
height: 20, // rect height
rx: 2, // rx
ry: 2, // ry
},
},
filters: [ // 过滤字段 cross 处理
'feature_1',
'feature_2',
'feature_3',
'feature_4',
'feature_5',
'feature_6',
'feature_7',
'feature_8',
'feature_9',
'feature_10',
'feature_11',
'feature_12',
],
chart: undefined, // 采用自定义chart,没明确指定图表名称
};
In conjunction with the FIG. 4, starting from the root container, containing all the data; and the entity data in accordance with a basic information -ID packet , sub container according to size by the number of elements determined by the length, from top to bottom layout Layout ; Then according to the feature attributes in the entity data, the size is equal, layout from left to right; finally draw the order multi-view visualization.
Advanced syntax analysis
To generate the target visualization, our grammar constructs a root container and applies unit visualization operations recursively until all containers become units. In other words, the presentation becomes a traversal of the tree, where the root container is the root node of the tree, and the unit container is the leaf node. Once all the cells have been generated, the layout is complete and the cells can be visualized. Before parsing the syntax, we first build the root container RootContainer, including the original data, predecessor nodes, label, visual space (width, height, padding, and position) and other information. The layouts configuration layout will be parsed into a hierarchical nested structure, and then the nested structure will be laid out from the RootContainer to generate the sub-containers ChilrenContainer at their respective levels.
Applications
Single user order heat map
Heat map of multiple user orders
Interactive behavior
Support click, mouseover, mouseout and other interactive methods, among which, click get all order information, mouseover and mouseout highlight and cancel the current focus .
In addition, it also supports attribute feature filtering. When the user only pays attention to feature 1 and feature 2 , the effect is as shown in the figure below.
Reference
[1] The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations
[2] https://projector.tensorflow.org/
[3] https://umap-learn.readthedocs.io/en/latest/how_umap_works.html
[4] https://distill.pub/2016/misread-tsne/
[5] [Visualization of Labeled Data Using Linear Transformations]()
[6] https://g2.antv.vision/zh/docs/manual/about-g2
[7] https://vega.github.io/vega-lite/
Author: ES2049 / Teacher Li
The article can be reprinted at will, but please keep this link to the original text.
You are very welcome to join ES2049 Studio if you are passionate. Please send your resume to caijun.hcj@alibaba-inc.com
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。