[Text Mining]Text clustering and topic modelling

November 21, 2019 0 Comments Text Mining

This post is based on the 732A92 Texting Mining course, given by Marco Kuhlmann at LiU in 2019.

在進入 Text clustering 前我想要先介紹 Clustering。

Clustring

Clustering (分群）是一種 unsupervised learning（非監督學習）
Typical applications
1. As a stand-alone tool to get insight into data distribution
2. As a preprocessing step for other algorithms
Cluster: a collection of data objects。但什麼樣的 data 會被歸在同一個 cluster 呢？這就是 clustering 最重要的概念，Similar 和 Dissimilar。
Similar to one another within the same cluster
Dissimilar to the objects in other clusters

=> 而提到 Similar 和 Dissimilar 就要必須要提到 distance (or similarity) measures。有了 distance（similarity）才有辦法定義 Similar 和 Dissimilar。

Distance（Similarity）

Distances are normally used to measure the similarity or dissimilarity between two data objects

Distances 可以有很多種定義方式，但不管怎麼定義，都一定要符合以下的 properties。

\(d(i, j) \ge 0\) (non-negativity)
\(d(i, i) = 0\) (identity of indiscernibles)
\(d(i, j) = d(j, i)\) (symmetry)
\(d(i, j) \le d(i, k) + d(k, j)\) (triangle inequality)

Minkowski distance

\(d(i, j) = \sqrt[q]{(|x_{i_1} - x_{j_1}|^q + |x_{i_2} - x_{j2}|^q + \ldots + |x_{i_p} - x_{j_p}|^q)}\) , q is a positive integer

If q = 1, d is Manhattan distance
If q = 2, d is Euclidean distance

Binary Variables

symmetric binary variables: both states are equally important; 0/1
asymmetric binary variables: one state is more important than the other (e.g. outcome of disease test); 1 is the important state, 0 the other

PMF_two coins.png

Accuracy for symmetric binary variables

\[sim(i, j) = \frac{a+d}{a+b+c+d}\]

Jaccard similarity for asymmetric binary variables

\[sim(i, j) = \frac{a}{a+b+c}\]

Text clustering

PMF_two coins.png

上圖是 Conceptual framework for text mining，而這篇要介紹的是 Clustering 和 Topic Analysis。

Text clustering is the task of grouping similar texts together. What is considered ‘similar’ depends on the application.
Clustering is a central tool in exploratory data analysis, where it can help us to get insights into the distribution of a data set.
Clustering is also useful as a pre-processing technique in knowledge-focused applications. Example: Brown clustering

PMF_two coins.png

Similarity measures

剛剛前面提到的 Similarity，如何用在 text 上呢？

Accuracy for symmetric binary vectors

PMF_two coins.png

Jaccard similarity for asymmetric binary vectors

Hard clustering and soft clustering

Hard clustering

Each document either belongs to a cluster or not. Ex. Hierarchical clustering(brown clustring), Partitioning clustering(k-means), Density-Based clustering(DBSCAN)

Soft clustering

Each document belongs to each cluster to a certain degree. Ex. LDA (topic model)

An overview of hard clustering methods

Hierarchical clustering

This method does not require the number of clusters k as an input, but needs a termination condition

Agglomerative: Each document starts in its own cluster. Hierarchy is created by merging pairs of clusters. 將點倆倆合併，最後所有的點會全部在同一個 cluster 裡。那至於要用什麼決定要將兩個點合併呢？這時候又有 Linkage criteria 來決定，eg. Single-link, Complete-link, Average-link 等等。不同的 linkage criteria 會導致不同的分群結果，如下圖。
Divisive clustering: All documents start in one cluster. Hierarchy is created by splitting clusters recursively.

Brown clustring

Partitioning clustering

K-means
Issues with the k-means algorithm
1. The k-means algorithm always converges, but there is no guarantee that it finds a global optimum. (Solution: random restarts)
2. The number of clusters needs to be specified in advance, or chosen based on heuristics and cross-validation. (Example: elbow method)
3. The k-means algorithm is not good at handling outliers – every document will eventually belong to some cluster.
4. K-means is restricted to clusters with convex shapes => Density-Based clustering

Density-Based clustering

The basic idea behind density-based algorithms is that different regions of the vector space can be more or less densely populated.
Under this view, clusters can take any shape; they are not constrained to convex clusters as in k-means.

Directly density-reachable

DBSCAN

Evaluation of hard clustering

Intrinsic and extrinsic evaluation

In intrinsic evaluation, a clustering is evaluated based on internal measures such as coherence and separation. Are documents in the same cluster similar? Are clusters well-separated?
In extrinsic evaluation, a clustering is evaluated based on data that was not used for the clustering, such as known class labels. cluster purity, Rand index

Rand index

假設一個集合中有N篇文章一個集合中有N(N-1)/2個集合對 TP：同一類的文章被分到同一個簇 TN：不同類的文章被分到不同簇 FP：不同類的文章被分到同一個簇 FN：同一類的文章被分到不同簇 Rand Index度量的正確的百分比 RI = （TP+TN）/（TP+FP+FN+TN）

Topic models

A topic model is a statistical model for representing the abstract topics that are expressed in a collection of documents.
Topic models are examples of soft clustering techniques – each document belongs to each cluster (topic) to a certain degree.

Latent Dirichlet Allocation (LDA)

LDA有兩個原則，

每個 domcuments 是由多個 Topic 組成（each document belongs to each cluster (topic) to a certain degree）
每個主題會有不同的 terms 來描述，且同樣對詞可以同時出現在不同的主題。

Reference:
732A92 Texting Mining
732A75 Advanced Data Mining
自然語言處理 – Brown Clustering
直觀理解 LDA (Latent Dirichlet Allocation) 與文件主題模型
 聚類評價指標 Rand Index,RI,Recall,Precision,F1

M.C. Shih

[Text Mining]Text clustering and topic modelling

Clustring

Distance（Similarity）

Text clustering

Similarity measures

Hard clustering and soft clustering

An overview of hard clustering methods

Hierarchical clustering

Partitioning clustering

Density-Based clustering

Evaluation of hard clustering

Intrinsic and extrinsic evaluation

Rand index

Topic models

Latent Dirichlet Allocation (LDA)

Latest Posts

[Python]Class、Object、Instance 的關係 Sep 11, 2023

[MongoDB]建立 Cloud MongoDB Jun 10, 2022

[Python][Flask]Flask 基本實做 Apr 6, 2022