Skip to content

LSIApproach

Huascar Sanchez edited this page Aug 23, 2016 · 1 revision

Clustering: Kmeans algorithm

Problem: You are given N documents (e.g., fully qualified class names), M terms that occur in those documents, and the distance d(u, v) between two documents. d(u,v) is the Euclidean distance and determines how similar two things are. Each document is a d-dimensional real vector consisting of tf_idf scores, which are scores that rank the importance of terms in a document.

Our goal is to partition those N documents into k (<= N) sets so as to minimize the total sum of the squared Euclidean distance of every point to its corresponding cluster centroid (within-cluster "nearest" mean). The k is calculated using this formula: k = floor(sqrt(N))

Idea:

  1. Randomly choose k documents as seeds, one per cluster.
  2. Create initial clusters based on these seeds
  3. Iterate, repeatedly reassign documents to different clusters to improve the overall clustering. This reassignment of documents to clusters is based on the distance to the current cluster centroids.
  4. Stop when clustering converges or after a fixed number of iterations.

The final result is a list KC of k document clusters.

label recommendation

Two approaches:

  1. Same as Kruskal + Text Processing approach.
  2. Based on Latent Semantic Indexing (LSI)

Application of Latent Semantic Indexing (LSI)

Let C (e.g., "FooBar", "FooOk", "FooPen") be a given cluster produced by the Kmeans algorithm, and W the set of words considered to be the most typical words in our project corpus. The following set of steps describes how labels are recommended for the cluster C.

Problem: Use LSI to rank words for the cluster C.

  1. Set document weights (based of tf_idf) and then construct the document-term matrix A (based on KC, and W) and query matrix Q.
  2. (Using SVD) Decompose matrix A and then find matrices U (unitary matrix), S (diagonal matrix), and V (complex unitary matrix), where A = U * S * V^T
  3. Implement a rank k approximation by keeping the first k columns of U and V and the first k columns and rows of S. This will result in _U_k, S_k, and V{k}^T matrices. k is calculated using: k = floor(sqrt(RowsCount(A)))
  4. Find the new term vector coordinates in this reduced k-dimensional space. Rows of V hold eigenvector values. These are the coordinates of individual term vectors.
  5. Find the new query vector coordinates in the reduced k-dimensional space:q = Q^T * U_k * (S_k)^-1
  6. Rank terms in decreasing order of query-term cosine similarities.

The final step will result in a list of terms or labels our graph-relabeling algorithm can use.

Clone this wiki locally