Design and Implementation of Visual Analysis Framework for Machine Learning

Preface

data visualization

essence of data visualization is to convert data into visual codes. Visualization is good at exploring data, scientific insights, communication and education. Visualization and statistics are different and related. The difference is that the former does not need to clarify the problem, while the latter studies a specific problem; the connection lies in the partnership. Visualization attracts the audience's attention through visual coding, and then transmits the data to the observer, and interactively explores and analyzes the data through media such as computers. Good visual coding will make full use of humans have innate processing ability , that is, processing space, color, shape and other information in parallel. But almost all too much information can’t be reasonably displayed in a static graphic. Therefore, design is not only about how to display something, but about what we think is important to the imaginary reader. And don't show anything. Through the computer, we customize the graphics according to the reader's interest. When designing the interaction, we refer to Ben Shneiderman [1] to propose a good human-computer interaction guide: Overview (Overview First), Zoom and Filter (Zoome1f11) ) And provide detailed information (Details on Demand) on demand. The first overview is the initial form of the graph. Its purpose is not to display all content, but to provide a "macro" view of all data; zooming and filtering are methods to eliminate displayed content to focus on topics of interest; detailed information provided on demand allows readers to extract accurate information from the chart value.

Machine learning visualization

The application of data visualization in the subject of machine learning is called machine learning visualization. According to the usage data of different user groups at different stages, it can be roughly divided into four categories: training data (Training Data), model performance (Model Performance) ), (Interpretability + Model Inspection) and high-dimensional data (High-dimensional Data). We expose patterns to our eyes through data visualization. Visualization tools use machine learning to extract patterns for us, help find deeper patterns, and provide us with new ways to navigate data. The patterns extracted by machine learning constitute high-dimensional data in the form of feature vectors. Embedding Projector [2] is a tool for interactive visualization and high-dimensional data analysis. It provides four kinds of tools that are very useful for visualizing high-dimensional data. dimension reduction (dimensionality data reduction) method, UMAP [. 3] , T-the SNE [. 4] , the PCA and custom linear projection [. 5] (the custom linear Projections).

UMAP, is a dimensionality reduction algorithm based on manifold learning technology and topological data analysis ideas. It provides a very general framework for approximating manifold learning and dimensionality reduction, but it can also provide specific implementations;
t-SNE, can be used to explore local neighbor values and find clusters;
PCA , which can usually effectively explore the embedded internal structure and reveal the most influential dimensions of the data;
The custom linear projection can help to find meaningful directions in the data set;

图 1 嵌入投影（Embedding Projector）

Problems and challenges

In the machine learning application scenario, we encountered a series of challenges: (1) The research object has uncertainty . It is impossible to design a static image that can express all the content clearly in advance, and of course it is not necessary to do so. (2) current group ecology lacks high-dimensional data visualization related construction , and the group’s G2 [6] is a set of underlying visualization engines based on the theory of graph grammar. It is data-driven and provides graph grammar and interactive grammar. High ease of use and scalability. G2 and related ecological concerns allow users to build various interactive statistical charts Canvas or SVG without paying attention to the various tedious implementation details of the chart. As of G2 4.0 version, G2 and related ecology pay more attention to single statistical chart visualization problem . In addition to focusing on the distribution of feature spaces, machine learning scenarios also focus on the comparative analysis of multiple feature subspaces, and the comparative analysis of feature spatiotemporal distribution. When we explore the unknown, the advanced visualization syntax used for exploration, such as Vega-Lite [7] can help to quickly analyze data and create a series of expressive visualizations, but Vega-Lite cannot use the existing group’s Visualization capabilities. (3) unit visualization physical limitations , when the amount of data is large, during the visualization coding period, each data item has a unique visual mark, which will cause performance problems; lack of necessary interactive support.

图 2 G2 统计图表官方案例

framework design

In response to the above challenges, we use visualization combined with human-computer interaction to solve the uncertainty of the research object. For high-dimensional data space visualization and subspace comparative analysis, we use multi-view visualization technology. Finally, we provide a visual analysis framework with high-level syntax that can intuitively express the visual design space, support multi-data, multi-attribute, and multi-view visual analysis of high-dimensional massive data, covering time series, geographic space and other comparative analysis scenarios.

Data, a data table containing multiple rows, each row contains multiple columns or multiple attributes, in order to facilitate data processing, the data generally adopts a flat structure;
Containers, a container is a geometric abstraction, including the location and area where a Group will be placed
- bin, select the attribute you are interested in, and perform the bin operation on the attribute
- layout, custom calculation of visual elements or grouping location information
Groups, a subset of row data, Groups can also be nested types, Groups include other Groups
Cells, a specific instance of Container related to a row in the data set
Units, a graphical representation of a row of data. They can have visual attributes such as color, shape, size (relative to the peripheral cell) and opacity
View, a view is a specific visualization of a data table; it can be linked with other views of the same data
Interaction, running through this data visualization pipeline, data-level filtering and sorting operations; bin operation attribute selection; layout method selection; visualization coding selection; visualization unit selection, prompt, Hover and other interactive methods, and even linkage analysis between views
Animation, the animation of the visual elements in the add, update, and delete phases

图 3 机器学习可视分析框架

key step

Raw data

In order to better explain this framework, we analyze a specific business data. The actual data is provided in an array. A record in the array is an entity description, including fields basic information (base_info), selection (selection) , features (feats), details (details). The visualization scheme determined after many discussions adopts multi-view visualization technology to support longitudinal comparison of the feature data of different entities, and the feature data is arranged in descending order of time.

// 业务数据
[
  // 一份实体描述
  {
    // 基础信息
    basic_info: {
      "id": "1", // 分组id
      ...
    },
    selection: {
      ...
    },
    // 实体特征空间
    feats: {
        "feature_1": 0, // bool类型
      "feature_2": 1,
      "feature_3": 1,
      "feature_4": 0,
      "feature_5": 1,
      "feature_6": 0,
      "feature_7": 1,
      "feature_8": 0,
      "feature_9": 0,
      "feature_10": 1,
      "feature_11": 104, // 数字类型
      "feature_12": 104
    },
    details:{
      ...
    },
  },
  ...
]
// end

Advanced syntax configuration

{
  width: 600,
  height: 200,
  margin: {
    top: 10,
    right: 30,
    bottom: 30,
    left: 100,
  },
  autoFit: true, // 如果设置 false, 需要手动设置 width 和 height
  layouts: [
    {
      name: 'layout1',
      type: 'gridxy',
      aspect_ratio: 'fillY', // 布局方式
      align: 'TB', // 从上向下
      subgroup: { // 子空间
        type: 'groupby', // bin ｜ groupby ｜ flatten
        key: 'basic_info.id', // 按照ID聚类
      },
      size: {
        type: 'count', // 统计count，按照子空间元素数量绘制大小
      },
      sort: null, // 子空间元素排序方式
      padding: { // layout padding
        top: 0,
        left: 0,
        bottom: 0,
        right: 0,
      },
      margin: { // layout margin
        top: 0,
        left: 0,
        bottom: 0,
        right: 0,
      },
      box: { // box 样式
        fill: 'white',
        stroke: 'gray',
        'stroke-width': 1,
        opacity: 0.5,
      },
    }, {
      name: 'layout2',
      type: 'gridxy',
      subgroup: {
        type: 'flatten', // 平铺布局
        key: 'feats', // 特征
      },
      aspect_ratio: 'fillX',
      size: {
        type: 'count',
      },
      align: 'LR', // 自左向右
      interactions: [], // 交互
      padding: { // padding
        top: 5,
        left: 5,
        bottom: 5,
        right: 5,
      },
      margin: { // margin
        top: 5,
        left: 5,
        bottom: 5,
        right: 5,
      },
      box: { // box 样式
        fill: 'white',
        stroke: 'gray',
        'stroke-width': 1,
        opacity: 0.5,
      },
    },
  ],
  mark: {
        shape: 'rect', // 单元形状
        isColorScaleShared: true,
        size: { // 根据单元形状决定
          type: 'uniform', // 统一大小
          width: 20, // rect width
          height: 20, // rect height
          rx: 2, // rx
          ry: 2, // ry
        },
  },
  filters: [ // 过滤字段 cross 处理
    'feature_1',
    'feature_2',
    'feature_3',
    'feature_4',
    'feature_5',
    'feature_6',
    'feature_7',
    'feature_8',
    'feature_9',
    'feature_10',
    'feature_11',
    'feature_12',
  ],
  chart: undefined, // 采用自定义chart，没明确指定图表名称
};

In conjunction with the FIG. 4, starting from the root container, containing all the data; and the entity data in accordance with a basic information -ID packet , sub container according to size by the number of elements determined by the length, from top to bottom layout Layout ; Then according to the feature attributes in the entity data, the size is equal, layout from left to right; finally draw the order multi-view visualization.

图 4 高级语法生成的订单多视图可视化

Advanced syntax analysis

To generate the target visualization, our grammar constructs a root container and applies unit visualization operations recursively until all containers become units. In other words, the presentation becomes a traversal of the tree, where the root container is the root node of the tree, and the unit container is the leaf node. Once all the cells have been generated, the layout is complete and the cells can be visualized. Before parsing the syntax, we first build the root container RootContainer, including the original data, predecessor nodes, label, visual space (width, height, padding, and position) and other information. The layouts configuration layout will be parsed into a hierarchical nested structure, and then the nested structure will be laid out from the RootContainer to generate the sub-containers ChilrenContainer at their respective levels.

图 5 案例数据结果

Applications

Single user order heat map

图 6 当订单数据中只有一个用户时

Heat map of multiple user orders

图 7 多个用户对比

Interactive behavior

Support click, mouseover, mouseout and other interactive methods, among which, click get all order information, mouseover and mouseout highlight and cancel the current focus .

图 8 鼠标交互效果

In addition, it also supports attribute feature filtering. When the user only pays attention to feature 1 and feature 2 , the effect is as shown in the figure below.

图 9 字段过滤对比效果

Reference

[1] The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations
[2] https://projector.tensorflow.org/
[3] https://umap-learn.readthedocs.io/en/latest/how_umap_works.html
[4] https://distill.pub/2016/misread-tsne/
[5] [Visualization of Labeled Data Using Linear Transformations]()
[6] https://g2.antv.vision/zh/docs/manual/about-g2
[7] https://vega.github.io/vega-lite/

Author: ES2049 / Teacher Li

The article can be reprinted at will, but please keep this link to the original text.
You are very welcome to join ES2049 Studio if you are passionate. Please send your resume to caijun.hcj@alibaba-inc.com

Design and Implementation of Visual Analysis Framework for Machine Learning

Preface

data visualization

Machine learning visualization

Problems and challenges

framework design

key step

Raw data

Advanced syntax configuration

Advanced syntax analysis

Applications

Single user order heat map

Heat map of multiple user orders

Interactive behavior

Reference

ES2049

引用和评论

Web推理 - ONNX Runtime 入门

你可能不知道的图片加载相关知识

使用CSS给标题添加书名号并超出省略

原生electron起步-从零到一完成构建和打包

Koa+Typescript起手式(空环境) 不用每次玩node都要搭环境了！

LRU算法，你别跑，我就要吃透你

更强大、更灵活！ defineModel 重新定义双向绑定