LDA Tutorial: Exploring documents and words

The previous tutorial (link here) illustrated various methods to examine topics in an estimated LDA model. In this document we focus on the analysis of documents using LDA. As before, we use the 100-topics LDA model trained with the Stanford Encyclopedia of Philosophy, and assume the viewer object 'v' is created from an appropriate corpus and a model.

Exploring documents

LDA provides powerful methods to search, sort and relate documents in the corpus.

As a first step, we illustrate how to find documents that are related to a particular topic or topics. Suppose we are interested in articles in the whole SEP that deals with the classical physics. To find them we use sim_top_doc. As we have seen before (LINK HERE) in our model the classical physics is featured in topic 2. So we look for documents related to topic 2:

$ v.sim_top_doc([2])

Topics: 2
Document	Prob
newton-principia.txt	0.92416
newton-stm.txt	0.91340
descartes-physics.txt	0.89946
leibniz-physics.txt	0.87209
atomism-modern.txt	0.85466
newton-philosophy.txt	0.84752
galileo.txt	0.82841
spacetime-theories.txt	0.81533
copernicus.txt	0.76159
gassendi.txt	0.70505

The table lists the top 10 relevant documents, with the corresponding probabilities as a measure of relevance.

One can also use a set of topics as a query. For example, suppose we are interested not only in the classical physics but in the physics in general. For this purpose we may first run sim_top_top([2]) and then use the result (i.e. a set of topics similar to topic 2) as a query for sim_top_doc. Here, we use the top 6 topics most related to topic 2 as a query.

$ query, prob = zip(*v.sim_top_top([2])[:5])

$ query
(2, 89, 79, 93, 21)

$ v.sim_top_doc(query)

Topics: 2, 89, 79, 93, 21
Document	Prob
spacetime-iframes.txt	0.58719
time-machine.txt	0.54281
spacetime-theories.txt	0.52951
arabic-islamic-natural.txt	0.50608
spacetime-bebecome.txt	0.49068
equivME.txt	0.48692
time-thermo.txt	0.48065
causation-backwards.txt	0.47840
spacetime-convensimul.txt	0.47770
spacetime-singularities.txt	0.46816

Note that this time the list contains documents on more general issues.

In LDA, each document assigns probabilities over topics, and thus can be located in the K-dimensional real space, where K is the total number of topics in the model. Similarity among topics can thus be defined in this space. Currently our similarity function uses cosine values so that two documents with a high cosine value is judged as similar.

To look for documents similar to 'descartes.txt'. Use

$ v.sim_doc_doc('descartes.txt')

Documents:
Document	Cosine
descartes.txt	1.00000
desgabets.txt	0.93482
henricus-regius.txt	0.90705
legrand.txt	0.90498
margaret-cavendish.txt	0.84320
leibniz.txt	0.83812
malebranche.txt	0.83744
cordemoy.txt	0.82133
john-norris.txt	0.79888
spinoza-physics.txt	0.79842

As with topics, one can obtain pairwise similarities for a set of documents in the form of similarity matrix. As an example, the similarity matrix for the above five documents can be obtained by:

$ docs, prob = zip(*v.sim_doc_doc('descartes.txt')[:5])

$ docs
('descartes.txt',
 'desgabets.txt',
 'henricus-regius.txt',
 'legrand.txt',
 'margaret-cavendish.txt')

$ v.simmat_docs(docs)
IndexedSymmArray([[ 1.        ,  0.93481742,  0.90705438,  0.90498041,  0.84319661],
                  [ 0.93481742,  1.        ,  0.8757895 ,  0.91618604,  0.8773178 ],
                  [ 0.90705438,  0.8757895 ,  1.        ,  0.92957827,  0.74286053],
                  [ 0.90498041,  0.91618604,  0.92957827,  1.        ,  0.69783225],
                  [ 0.84319661,  0.8773178 ,  0.74286053,  0.69783225,  1.        ]])

Exploring words

A word query is the most common way to search documents. In LDA, each occurrence of a word is assigned with its topic value, giving the idea as to in what context the word is used. Our word search function thus outputs not only tells us documents that contain the query word, but also its position in the documents and the assigned topic values. Take for example the term 'anthropomorphism':

$ v.word_topics('anthropomorphism')

Word: anthropomorphism
Document	Pos	Topic
abraham-daud.txt	2161	19
arnauld.txt	4076	19
causation-mani.txt	5906	91
cognition-animal.txt	1373	76
cognition-animal.txt	2006	76
cognition-animal.txt	2014	76
cognition-animal.txt	2016	76
cognition-animal.txt	2035	76
cognition-animal.txt	2060	76
cognition-animal.txt	2086	76
cognition-animal.txt	2121	76
cognition-animal.txt	2275	76
cognition-animal.txt	2354	76
cognition-animal.txt	2441	76
cognition-animal.txt	3061	76
cognition-animal.txt	3770	76
comte.txt	2506	19
consciousness-animal.txt	1803	76
consciousness-animal.txt	1816	76
ethics-environmental.txt	3435	19
feminist-religion.txt	3171	19
hume-religion.txt	4133	19
hume-religion.txt	7637	19
kant-religion.txt	4596	19
kukai.txt	1684	31
ludwig-feuerbach.txt	3874	19
maimonides.txt	4194	19
nothingness.txt	4360	76
philolaus.txt	5881	31
reduction-biology.txt	149	76
relativism.txt	11889	76
xenophanes.txt	21	19

Hence this word instantiates three topics, 19, 76, 91. These topics are:

$ v.topics(k_indices=[19, 76, 91])

Topics Sorted by Index
Topic	Words
19	god, divine, world, human, religion, theological, power, christian, creation, nature
76	behavior, psychology, cognitive, mental, human, mind, psychological, attention, imagery, animals
91	one, two, system, set, case, first, given, way, also, example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LDA Tutorial: Exploring documents and words

Exploring documents

Exploring words

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally