What are the cluster analysis methods?

Clustering is to divide a data set into different classes or clusters according to a certain standard (such as distance criterion, that is, the distance between data points), so that the similarity of data objects in the same cluster is as large as possible. The difference in data objects in the same cluster is also as large as possible. We can specifically understand that the same type of data is clustered together as much as possible, and different types of data are separated as much as possible.

Clustering technology is booming, and research areas that contribute to this include data mining, statistics, machine learning, spatial database technology, biology, and marketing. Various clustering methods have been proposed and improved constantly, and different methods are suitable for different types of data. Therefore, the comparison of various clustering methods and clustering effects has become a subject worth studying.

Cluster analysis is an ideal multivariate statistical technique, mainly consisting of hierarchical clustering and iterative clustering. Cluster analysis, also known as group analysis and point group analysis, is a multivariate statistical method for studying classification.

For example, we can divide the outlets into several grades based on factors such as the amount of savings, human resources, business area, featured functions, outlet level, and functional area of ​​each bank outlet, and then compare the different grades between banks. Quantity comparison status.

Classification of clustering algorithms

Currently, there are a large number of clustering algorithms. For specific applications, the choice of clustering algorithm depends on the type of data and the purpose of clustering. If cluster analysis is used as a tool for description or exploration, multiple algorithms can be tried on the same data to discover the results that the data may reveal.

The main clustering algorithms can be divided into the following categories: partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods.

At present, the research of clustering problem is not limited to the above-mentioned hard clustering, that is, each data can only be classified into one class, and fuzzy clustering [10] is also a branch with extensive research in clustering analysis. Fuzzy clustering uses membership functions to determine the extent to which each data belongs to each cluster, rather than categorizing a data object into a cluster. At present, many algorithms for fuzzy clustering have been proposed, such as the well-known FCM algorithm, which will be mentioned later.

Common clustering method

1. k-mean cluster analysis is suitable for sample clustering;

2. Hierarchical clustering is suitable for clustering variables;

3. Two-step clustering is suitable for categorical variables and continuous variable clustering;

4. A density based clustering algorithm;

5. Network-based clustering;

6. Clustering algorithm in machine learning;

The first three kinds can be realized by simple operation of spss;

Four common clustering algorithms for k-means clustering algorithm

K-means is one of the more classical clustering algorithms in the partitioning method. Because of its high efficiency, the algorithm is widely used in clustering large-scale data. At present, many algorithms are extended and improved around the algorithm.

The goal of the k-means algorithm is to divide n objects into k clusters with k as a parameter, so that the clusters have higher similarity and the similarity between clusters is lower.

The processing of the k-means algorithm is as follows: First, k objects are randomly selected, each object initially represents the average or center of a cluster; for each remaining object, according to its distance from the center of each cluster, It is assigned to the nearest cluster; then the average of each cluster is recalculated. This process is repeated until the criterion function converges. Usually, the squared error criterion is used, which is defined as follows:

Here E is the sum of the squared errors of all objects in the database, p is the point in space, and mi is the average of cluster Ci [9]. The objective function makes the generated cluster as compact and independent as possible. The distance metric used is the Euclidean distance, although other distance metrics can be used. The algorithm flow of the k-means clustering algorithm is as follows:

Input: the number of databases and clusters containing n objects k;

Output: k clusters, which minimizes the squared error criterion.

step:

(1) arbitrarily select k objects as the initial cluster center;

(2) repeat;

(3) Assign each object (re) to the most similar cluster based on the average of the objects in the cluster;

(4) Update the average of the clusters, that is, calculate the average of the objects in each cluster;

(5) until no longer changes.

to sum up:

Advantages: Simple and straightforward (in terms of logic and implementation difficulty), easy to understand, has a good effect on low-dimensional data sets (simple algorithms do not necessarily get excellent results).

Disadvantages: For high-dimensional data (such as hundreds of thousands, in reality, not only so much), its calculation speed is very slow, mainly slow calculation distance (refer to Euclidean distance, of course, parallelization is possible This is a problem at the algorithm implementation level. Another disadvantage is that it requires us to set the desired number k of clusters. If we don't have a good understanding of the data, then setting the k value becomes an estimate. work.

Hierarchical clustering algorithm

According to the order of hierarchical decomposition, whether it is bottom-up or top-down, the hierarchical clustering algorithm is divided into a condensed hierarchical clustering algorithm and a split hierarchical clustering algorithm.

The strategy of condensed hierarchical clustering is to first treat each object as a cluster, then merge the clusters into larger and larger clusters until all objects are in one cluster, or a certain termination condition is satisfied. Most hierarchical clustering belongs to agglomerative hierarchical clustering, which differs only in the definition of similarity between clusters. The four widely used methods for measuring the distance between clusters are as follows:

Here is the flow of the condensed hierarchical clustering algorithm using the minimum distance:

(1) Treat each object as a class and calculate the minimum distance between the two;

(2) Combine the two classes with the smallest distance into one new class;

(3) Recalculate the distance between the new class and all classes;

(4) Repeat (2), (3) until all classes are finally merged into one class.

to sum up:

advantage:

1, the similarity between distance and rule is easy to define, and the restrictions are small;

2, do not need to pre-determine the number of clusters;

3, you can find the hierarchical relationship of the class (in some specific areas such as biology has a great effect);

Disadvantages:

1, the computational complexity is too high (consider parallelization);

2, singular values ​​can also have a big impact;

3, the algorithm is likely to be clustered into a chain (one layer contains a layer);

4, the algorithm does not need to predetermine the number of clusters, but we choose which level of clustering as the clustering effect we need, which requires us to complete according to the actual objective situation and experience, after all, in the case of agglomerative clustering, from the bottom Each individual acts as an individual, and all individuals at the top level are combined into one individual, and the clustering results may be many.

Of course, there are many solutions to this problem. One of the commonly used ones is to condense to some extent and the distance between their clusters is greater than a certain threshold k to stop condensing.

SOM clustering algorithm

The SOM neural network [11] was proposed by the Finnish neural network expert Professor Kohonen. The algorithm assumes that there are some topological structures or sequences in the input object, which can achieve dimensionality reduction from the input space (n-dimensional) to the output plane (2-dimensional). Mapping, its mapping has topological feature retention properties, and has a strong theoretical connection with actual brain processing.

The SOM network consists of an input layer and an output layer. The input layer corresponds to a high-dimensional input vector, and the output layer consists of a series of ordered nodes organized on a 2-dimensional grid. The input node and the output node are connected by a weight vector. During the learning process, find the output layer unit with the shortest distance, that is, the winning unit, and update it. At the same time, the weights of the neighboring regions are updated so that the output node maintains the topological features of the input vector.

Algorithm flow:

(1) Network initialization, assigning an initial value to each node weight of the output layer;

(2) randomly select the input vector from the input sample to find the weight vector with the smallest distance from the input vector;

(3) Define the winning unit, adjust the weight in the vicinity of the winning unit to make it close to the input vector;

(4) Provide new samples and conduct training;

(5) Shrink the neighborhood radius, reduce the learning rate, and repeat until it is less than the allowable value, and output the clustering result.

FCM clustering algorithm

In 1965, Professor Zade of the University of California, Berkeley, first proposed the concept of 'collection'. After more than ten years of development, the fuzzy set theory has gradually been applied to various practical applications. In order to overcome the shortcomings of classification, the clustering analysis based on fuzzy set theory is presented. Cluster analysis using fuzzy mathematics is fuzzy cluster analysis [12].

The FCM algorithm is an algorithm that determines the degree to which each data point belongs to a certain cluster degree by membership degree. This clustering algorithm is an improvement of the traditional hard clustering algorithm.

Algorithm flow:

(1) Standardized data matrix;

(2) Establish a fuzzy similarity matrix and initialize the membership matrix;

(3) The algorithm starts iterating until the objective function converges to a minimum value;

(4) According to the iterative result, the class to which the data belongs is determined by the last membership matrix, and the final clustering result is displayed.

to sum up:

Advantages: Compared to the previous "hard clustering", the FCM method calculates the membership of each class for all classes, which gives us a calculation method that refers to the reliability of the sample classification results. We can think like this if If the membership of a class has an absolute advantage in the membership of all classes, it is a very safe practice to assign the sample to this class. If the sample is relatively average in all classes, then we need other assistance. Means to classify.

Disadvantages: KNN's shortcomings basically have it.

Four clustering algorithm test data

In the experiment, IRIS [13] data set in the international UCI database dedicated to test classification and clustering algorithm was selected. The IRIS data set contains 150 sample data, which are taken from three different Iris plants, setosa. Versicolor and virginica flower samples, each data contains 4 attributes, namely the length of the bract, the width of the bract, the length of the petal, in cm. By performing different clustering algorithms on the dataset, clustering results with different precisions can be obtained.

Test result description

Based on the previous algorithm principles and algorithm flow, the programming operation is performed with matlab, and the clustering results shown in Table 1 are obtained.

As shown in Table 1, for the four clustering algorithms, compare them in three aspects:

(1) The number of errors in the number of samples: the total number of samples of the error, that is, the sum of the number of samples in each category;

(2) Running time: the time spent in clustering the whole process, the unit is s;

(3) Average accuracy: Let the original data set have k classes, use ci to represent the i-th class, ni is the number of samples in ci, and mi is the correct number of clusters, then mi/ni is the i-th class. The accuracy of the average accuracy is:

Analysis of test results

Among the four clustering algorithms, k-means and FCM are superior to others in terms of running time and accuracy. However, each algorithm still has a fixed disadvantage: the initial point selection of the k-means clustering algorithm is unstable and randomly selected, which causes the instability of the clustering result. In this experiment, it is the average value obtained after many experiments. However, the selection method of the specific initial point needs further study; although the hierarchical clustering does not need to determine the classification number, once a split or merge is performed, it cannot be corrected, and the cluster quality is limited; FCM is sensitive to the initial cluster center. It is necessary to artificially determine the number of clusters, and it is easy to fall into the local optimal solution; SOM has a strong theoretical connection with the actual brain processing. However, the processing time is long and further research is needed to adapt it to large databases.

Elf Bar

disposable electronic cigarette disposable vape pen disposable e-cig

Shenzhen Xcool Vapor Technology Co.,Ltd , https://www.szxcoolvape.com