Skip to content

LDA Tutorial: Exploring documents and words

junotk edited this page Apr 17, 2013 · 6 revisions

The previous tutorial (link here) illustrated various methods to examine topics in an estimated LDA model. In this document we focus on the analysis of documents using LDA. As before, we use the 100-topics LDA model trained with the Stanford Encyclopedia of Philosophy, and assume the viewer object 'v' is created from an appropriate corpus and a model.

Exploring documents

LDA provides powerful methods to search, sort and relate documents in the corpus.

As a first step, we illustrate how to find documents that are related to a particular topic or topics. Suppose we are interested in articles in the whole SEP that deals with the classical physics. To find them we use sim_top_doc. As we have seen before (LINK HERE) in our model the classical physics is featured in topic 2. So we look for documents related to topic 2:

$ v.sim_top_doc([2])
Topics: 2
Document Prob
newton-principia.txt 0.92416
newton-stm.txt 0.91340
descartes-physics.txt 0.89946
leibniz-physics.txt 0.87209
atomism-modern.txt 0.85466
newton-philosophy.txt 0.84752
galileo.txt 0.82841
spacetime-theories.txt 0.81533
copernicus.txt 0.76159
gassendi.txt 0.70505
The table lists the top 10 relevant documents, with the corresponding probabilities as a measure of relevance.

One can also use a set of topics as a query. For example, suppose we are interested not only in the classical physics but in the physics in general. For this purpose we may first run sim_top_top([2]) and then use the result (i.e. a set of topics similar to topic 2) as a query for sim_top_doc. Here, we use the top 6 topics most related to topic 2 as a query.

$ query, prob = zip(*v.sim_top_top([2])[:5])

$ query
(2, 89, 79, 93, 21)

$ v.sim_top_doc(query)
Topics: 2, 89, 79, 93, 21
Document Prob
spacetime-iframes.txt 0.58719
time-machine.txt 0.54281
spacetime-theories.txt 0.52951
arabic-islamic-natural.txt 0.50608
spacetime-bebecome.txt 0.49068
equivME.txt 0.48692
time-thermo.txt 0.48065
causation-backwards.txt 0.47840
spacetime-convensimul.txt 0.47770
spacetime-singularities.txt 0.46816

Note that this time the list contains documents on more general issues.

In LDA, each document assigns probabilities over topics, and thus can be located in the K-dimensional real space, where K is the total number of topics in the model. Similarity among topics can thus be defined in this space. Currently our similarity function uses cosine values so that two documents with a high cosine value is judged as similar.

To look for documents similar to 'descartes.txt'. Use

$ v.sim_doc_doc('descartes.txt')
Documents:
Document Cosine
descartes.txt 1.00000
desgabets.txt 0.93482
henricus-regius.txt 0.90705
legrand.txt 0.90498
margaret-cavendish.txt 0.84320
leibniz.txt 0.83812
malebranche.txt 0.83744
cordemoy.txt 0.82133
john-norris.txt 0.79888
spinoza-physics.txt 0.79842

As with topics, one can obtain pairwise similarities for a set of documents in the form of similarity matrix. As an example, the similarity matrix for the above five documents can be obtained by:

$ docs, prob = zip(*v.sim_doc_doc('descartes.txt')[:5])

$ docs
('descartes.txt',
 'desgabets.txt',
 'henricus-regius.txt',
 'legrand.txt',
 'margaret-cavendish.txt')

$ v.simmat_docs(docs)
IndexedSymmArray([[ 1.        ,  0.93481742,  0.90705438,  0.90498041,  0.84319661],
                  [ 0.93481742,  1.        ,  0.8757895 ,  0.91618604,  0.8773178 ],
                  [ 0.90705438,  0.8757895 ,  1.        ,  0.92957827,  0.74286053],
                  [ 0.90498041,  0.91618604,  0.92957827,  1.        ,  0.69783225],
                  [ 0.84319661,  0.8773178 ,  0.74286053,  0.69783225,  1.        ]])

Exploring words

A word query is the most common way to search documents. In LDA, each occurrence of a word is assigned with its topic value, giving the idea as to in what context the word is used. Our word search function thus outputs not only tells us documents that contain the query word, but also its position in the documents and the assigned topic values. Take for example the term 'anthropomorphism':

$ v.word_topics('anthropomorphism')
Word: anthropomorphism
Document Pos Topic
abraham-daud.txt 2161 19
arnauld.txt 4076 19
causation-mani.txt 5906 91
cognition-animal.txt 1373 76
cognition-animal.txt 2006 76
cognition-animal.txt 2014 76
cognition-animal.txt 2016 76
cognition-animal.txt 2035 76
cognition-animal.txt 2060 76
cognition-animal.txt 2086 76
cognition-animal.txt 2121 76
cognition-animal.txt 2275 76
cognition-animal.txt 2354 76
cognition-animal.txt 2441 76
cognition-animal.txt 3061 76
cognition-animal.txt 3770 76
comte.txt 2506 19
consciousness-animal.txt 1803 76
consciousness-animal.txt 1816 76
ethics-environmental.txt 3435 19
feminist-religion.txt 3171 19
hume-religion.txt 4133 19
hume-religion.txt 7637 19
kant-religion.txt 4596 19
kukai.txt 1684 31
ludwig-feuerbach.txt 3874 19
maimonides.txt 4194 19
nothingness.txt 4360 76
philolaus.txt 5881 31
reduction-biology.txt 149 76
relativism.txt 11889 76
xenophanes.txt 21 19

Hence this word instantiates three topics, 19, 76, 91. These topics are:

$ v.topics(k_indices=[19, 76, 91])
Topics Sorted by Index
Topic Words
19 god, divine, world, human, religion, theological, power, christian, creation, nature
76 behavior, psychology, cognitive, mental, human, mind, psychological, attention, imagery, animals
91 one, two, system, set, case, first, given, way, also, example

Clone this wiki locally