-
Notifications
You must be signed in to change notification settings - Fork 7
LSIApproach
Problem: You are given N documents (e.g., fully qualified class names), M terms that occur in those documents, and the distance d(u, v) between two documents. d(u,v) is the Euclidean distance and determines how similar two things are. Each document is a d-dimensional real vector consisting of tf_idf scores, which are scores that rank the importance of terms in a document.
Our goal is to partition those N documents into k (<= N) sets so as to minimize the total sum of the squared Euclidean distance of every point to its corresponding cluster centroid (within-cluster "nearest" mean). The k is calculated using this formula: k = floor(sqrt(N))
Idea:
- Randomly choose k documents as seeds, one per cluster.
- Create initial clusters based on these seeds
- Iterate, repeatedly reassign documents to different clusters to improve the overall clustering. This reassignment of documents to clusters is based on the distance to the current cluster centroids.
- Stop when clustering converges or after a fixed number of iterations.
The final result is a list KC of k document clusters.
Two approaches:
- Same as Kruskal + Text Processing approach.
- Based on Latent Semantic Indexing (LSI)
Let C (e.g., "FooBar", "FooOk", "FooPen") be a given cluster produced by the Kmeans algorithm, and W the set of words considered to be the most typical words in our project corpus. The following set of steps describes how labels are recommended for the cluster C.
Problem: Use LSI to rank words for the cluster C.
- Set document weights (based of tf_idf) and then construct the document-term matrix A (based on KC, and W) and query matrix Q.
- (Using SVD) Decompose matrix A and then find matrices U (unitary matrix), S (diagonal matrix), and V (complex unitary matrix), where A = U * S * V^T
- Implement a rank k approximation by keeping the first k columns of U and V and the first k columns and rows of S. This will result in _U_k, S_k, and V{k}^T matrices. k is calculated using: k = floor(sqrt(RowsCount(A)))
- Find the new term vector coordinates in this reduced k-dimensional space. Rows of V hold eigenvector values. These are the coordinates of individual term vectors.
- Find the new query vector coordinates in the reduced k-dimensional space:q = Q^T * U_k * (S_k)^-1
- Rank terms in decreasing order of query-term cosine similarities.
The final step will result in a list of terms or labels our graph-relabeling algorithm can use.