1. Summary
Clustering is a technique of statistical data analysis that is widely used in many fields, including machine learning, data mining, image analysis, and more. Clustering is to divide similar objects into different groups or more subsets, so that the member objects of each subset have similar properties.
The so-called clustering algorithm is actually a method of automatically dividing a pair of unlabeled data into several categories. In application scenarios, clustering can help us solve many classification problems in computers, such as: color category classification, density classification in spatial coordinates, and crowd feature classification in e-commerce. In addition to classification problems, it can also help us achieve "exception checking", what is exception checking? We can understand it as looking for noise, which is to find those mouse feces in a pot of porridge.
This article mainly introduces the implementation principle of clustering algorithm and how clustering algorithm is applied in D2C design draft generation code .
2 DBSCAN clustering algorithm
DBSCAN - Density-based clustering algorithm with noise . Compared with K-Means, which is only suitable for convex sample sets, DBSCAN can be used for both convex and non-convex sample sets. It can classify scattered samples based on a certain similarity, that is, when is uncertain about the number of , it can divide and according to the tightness of the samples. for example:
We need to cluster the data of "100, 101, 123, 98, 200, 203, 220". If the minimum value of is 2 ,
At this point, if we set the cluster density threshold to 30 . Then "100, 101, 123, 98" and "200, 203, 220" will be divided into 2 clusters.
When the cluster density threshold is 10 . Then "100, 101, 98", "200, 203" are divided into 2 clusters, and "123" and "220" belong to noise points (abnormal data)
2.1 Core idea
The DBSCAN algorithm is mainly to find all the dense areas in the sample points, we call these dense areas cluster . Then the sample points that are not in the dense area, we call noise points . Therefore, in addition to helping you classify, DBSCAN can also find out the "mouse feces in a pot of porridge".
2.2 Algorithm parameters
parameter | instruction |
---|---|
Neighborhood radius Eps: | It refers to the search radius of each sample point. For other sample points scanned within the search radius, we can understand that the scanned sample point is , which is similar to the center point of . |
Minimum number of points minpoints: | The minimum number of samples that can aggregate into clusters is , which can be understood as the minimum number of samples required for each and . On the above figure, we can see that the sample points scanned by both red and blue within the radius R > minpoints, while the number of scans in yellow is less than minpoints. |
2.3 Categories of Points
category | instruction |
---|---|
core point | The number of sample points within the neighborhood radius Eps >= the minimum number of points minpoints points. |
boundary point | A point that does not belong to a core point but is within the neighborhood of a core point. |
noise point | Neither the core point nor the boundary point. |
2.4 point relationship
relation | instruction |
---|---|
density direct | A is the core point, B is in the neighborhood Eps of A, then the density from A to B is direct. Any core point to the boundary point within its neighborhood Eps is density up to . |
density up to | If there are core points C, D, E, F. C to D density direct, D to F density direct, E to F density direct. Then we can call C to F density up to . And F (core point) to G (boundary point) is also direct density, C to G is also density can reach . |
density linked | If there is a core point that makes both sample point X and sample point Y density-reachable, then we say that X and Y are density-connected. |
non-density connection | If it does not belong to density connection, it is and non-density connection . The two non-density connection points belong to different clusters, or among them are noise points. |
2.5 Algorithm Implementation Steps
The sample set of the maximum density derived from the density reachability relationship is a category, or a cluster, of our final clustering. In terms of implementation, we can divide it into the following 4 steps:
Step 1: Select any core location without a category as the initial point;
Step 2: Find the sample set whose core point can density and , that is, find all the boundary points in the neighborhood of this core point, then you can become a cluster;
Step 3: Continue to find another core point without a category and continue to repeat the operation of Step 2;
Step 4: Until all the points.
Let’s take a more vivid example: you can assume that there is a MLM person (_core point_) in a group of people. To develop offline, you need to find N people (_minPoints_) first, so he is by your side (_neighborhood_ ) to find someone to develop the offline, then the offline (boundary point) will continue to find the offline until there is no one around.
3 Combination of layout algorithm and DBSCAN
After a brief introduction to the algorithm concept and algorithm implementation of DBSCAN, let's talk about the application scenarios of the clustering algorithm in the Deco layout algorithm.
The core of the layout algorithm is actually a group of . How to judge whether a group can be formed based on the position information and size of each module in the design draft is the key DIV
trapped.
As shown in the figure above, there are 11 nodes of white block nodes on the design draft, and we can see with the naked eye, is based the close distance relationship between each node, the upper part and the lower part are separated . But this is limited to our vision, so how to make machine vision also think separate? We need the DBSCAN clustering algorithm just mentioned to generate , then our goal is to make the upper part form a cluster cluster, and the lower half also form a cluster cluster.
Just now we mentioned that DBSCAN is the Euclidean distance between points and points as the basis for the close relationship , then in terms of nodes, we change our thinking and change the shortest distance between the block and the block as the close relationship The basis of .
3.1 Point Distance > Block Distance
In fact, it is relatively simple to obtain the shortest distance between blocks. There are three situations:
The first type: two blocks intersect, then the distance is actually 0;
The second type: Block A and Block B are on/bottom/left/right on their , then you only need to obtain the distance between them;
The third type: block A and block B are in their upper left/lower left/upper right/lower right , then use Pythagorean theorem to obtain the diagonal distance between the two opposite vertices.
The effect after the transformation is as shown in the figure below. According to the implementation of the clustering algorithm, we can finally divide the upper and lower 2 into 2 clusters:
3.2 Derivation of Neighborhood Radius
In addition to the input DBSCAN clustering algorithm, there sample data sets , data object threshold number MinPoints , neighborhood radius Eps , then the strip layout algorithm, neighborhood radius Eps how much is an appropriate value set in the end Woolen cloth? It can't always be a fixed value. The overall spacing of some modules is larger, and some of them are smaller. When we aggregate the blocks in the actual layout, we need to find the dynamic neighborhood radius Eps .
Step 1: We first make a statistic on the distance between the sample data sets, and first find the shortest distance between these 5 blocks:
Module 1 | Module 2 | Module 3 | Module 4 | Module 5 | |
---|---|---|---|---|---|
Module 1 | - | 5 | 5 | 7 | 210 |
Module 2 | 5 | - | 7 | 5 | 100 |
Module 3 | 5 | 7 | - | 5 | 214 |
Module 4 | 7 | 5 | 5 | - | 107 |
Module 5 | 210 | 100 | 214 | 107 | - |
Step 2: Then we can get the shortest distance between each module and its nearest module according to the distance matrix table:
module | Module 1 | Module 2 | Module 3 | Module 4 | Module 5 |
---|---|---|---|---|---|
shortest distance | 5 | 5 | 5 | 5 | 100 |
Step 3: In this pile of data, we need to extract , which accounts for more, and the more effective data as our Eps value, and remove some interference items:
According to the calculation formula of the standard deviation, we take 1 times the standard deviation as the filter item, filter out the data set that meets the majority of samples, and take [5, 5, 5, 5, 100] to find its standard deviation, we can get, The population standard deviation is 38 and the mean is 24.
Then we take one standard deviation as the basis, we can get that within the range of one standard deviation, the maximum value is 24 + 38 = 62, then we can take 62 as our neighbor in this sample set Domain radius Eps .
3.3 Algorithm optimization
Based on the above algorithm transformation, in fact, we have completed the module clustering and Then in the application of the actual algorithm, we will also dynamically generate for the neighborhood radius Eps to do an optimization in the actual layout of the scene:
For example, a layout like the following: the horizontal spacing is 5 and the vertical spacing is 10:
Then if according to the form of shortest distance standard deviation , then the shortest distance of 8 modules is 5, and Eps is finally calculated to be 5, then it is very likely that the upper and lower lines will be separated.
Therefore, in practical application, in the process of generating the standard deviation sample , according to certain rules, the "10" of the horizontal distance is also taken into account, and calculated as the standard deviation sample.
4. Technology landing
The above technologies have been implemented in the Deco intelligent code generation project. Deco is our team's exploration in the direction of "front-end intelligence". It focuses on design drafts to generate multi-terminal codes with one click, and realizes the integration of Sketch/Photoshop. The ability to wait for the design draft to be parsed and directly generate multi-terminal code (Taro/React/Vue). Deco can make front-end engineers do not need to spend a lot of energy on design drafts, which greatly saves development costs, provides strong support for outputting more multi-terminal pages, and also brings great impetus to business cost reduction and efficiency improvement.
In the past year, Deco has successfully landed in two major promotions on . In the construction of personalized event venues, R&D efficiency has increased by 48%.
Interested students can go to Deco official website for experience. In addition, I also attach the nanny-level tutorial for the Deco experience.
5. Summary
This article mainly introduces the implementation principle of DBSCAN. In the introduction, the specific code implementation is given. If you are interested in this, there are many specific code implementation logics on the Internet. The purpose is mainly to tell you about the realization idea of clustering algorithm and the application of clustering algorithm in D2C layout. In addition to the density-based clustering algorithm such as DBSCAN, there are actually many algorithms waiting for our mining on the D2C layout algorithm.
Citation:
- [1] [DBSCAN Density Clustering Algorithm] ( https://www.cnblogs.com/pinard/p/6208966.html )
- [2] [DBSCAN Clustering Algorithm - Machine Learning (Theory + Diagram + Python Code)] ( https://blog.csdn.net/huacha__/article/details/81094891 )
[3] [Details of DBSCAN] ( https://blog.csdn.net/hansome_hong/article/details/107596543 )
Welcome to Bump Labs blog: aotu.io
Or pay attention to the AOTULabs official account (AOTULabs), and push articles from time to time.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。