-
Notifications
You must be signed in to change notification settings - Fork 14
LDA Tutorial: Exploring documents and words
The previous tutorial (link here) illustrated various methods to examine topics in an estimated LDA model. In this document we focus on the analysis of documents using LDA. As before, we use the 100-topics LDA model trained with the Stanford Encyclopedia of Philosophy, and assume the viewer object 'v' is created from an appropriate corpus and a model.
LDA provides powerful methods to search, sort and relate documents in the corpus.
As a first step, we illustrate how to find documents that are related to a particular topic or topics.
Suppose we are interested in articles in the whole SEP that deals with the classical physics.
To find them we use sim_top_doc.
As we have seen before (LINK HERE) in our model the classical physics is featured in topic 2.
So we look for documents related to topic 2:
$ v.sim_top_doc([2])
| Topics: 2 | |
|---|---|
| Document | Prob |
| newton-principia.txt | 0.92416 |
| newton-stm.txt | 0.91340 |
| descartes-physics.txt | 0.89946 |
| leibniz-physics.txt | 0.87209 |
| atomism-modern.txt | 0.85466 |
| newton-philosophy.txt | 0.84752 |
| galileo.txt | 0.82841 |
| spacetime-theories.txt | 0.81533 |
| copernicus.txt | 0.76159 |
| gassendi.txt | 0.70505 |
One can also use a set of topics as a query.
For example, suppose we are interested not only in the classical physics but in the physics in general.
For this purpose we may first run sim_top_top([2]) and then use the result (i.e. a set of topics similar to topic 2) as a query for sim_top_doc.
Here, we use the top 6 topics most related to topic 2 as a query.
$ query, prob = zip(*v.sim_top_top([2])[:5])
$ query
(2, 89, 79, 93, 21)
$ v.sim_top_doc(query)
| Topics: 2, 89, 79, 93, 21 | |
|---|---|
| Document | Prob |
| spacetime-iframes.txt | 0.58719 |
| time-machine.txt | 0.54281 |
| spacetime-theories.txt | 0.52951 |
| arabic-islamic-natural.txt | 0.50608 |
| spacetime-bebecome.txt | 0.49068 |
| equivME.txt | 0.48692 |
| time-thermo.txt | 0.48065 |
| causation-backwards.txt | 0.47840 |
| spacetime-convensimul.txt | 0.47770 |
| spacetime-singularities.txt | 0.46816 |
Note that this time the list contains documents on more general issues.
In LDA, each document assigns probabilities over topics, and thus can be located in the K-dimensional real space, where K is the total number of topics in the model.
Similarity among topics can thus be defined in this space.
Currently our similarity function uses cosine values so that two documents with a high cosine value is judged as similar.
To look for documents similar to 'descartes.txt'. Use
$ v.sim_doc_doc('descartes.txt')
| Documents: | |
|---|---|
| Document | Cosine |
| descartes.txt | 1.00000 |
| desgabets.txt | 0.93482 |
| henricus-regius.txt | 0.90705 |
| legrand.txt | 0.90498 |
| margaret-cavendish.txt | 0.84320 |
| leibniz.txt | 0.83812 |
| malebranche.txt | 0.83744 |
| cordemoy.txt | 0.82133 |
| john-norris.txt | 0.79888 |
| spinoza-physics.txt | 0.79842 |
As with topics, one can obtain pairwise similarities for a set of documents in the form of similarity matrix. As an example, the similarity matrix for the above five documents can be obtained by:
$ docs, prob = zip(*v.sim_doc_doc('descartes.txt')[:5])
$ docs
('descartes.txt',
'desgabets.txt',
'henricus-regius.txt',
'legrand.txt',
'margaret-cavendish.txt')
$ v.simmat_docs(docs)
IndexedSymmArray([[ 1. , 0.93481742, 0.90705438, 0.90498041, 0.84319661],
[ 0.93481742, 1. , 0.8757895 , 0.91618604, 0.8773178 ],
[ 0.90705438, 0.8757895 , 1. , 0.92957827, 0.74286053],
[ 0.90498041, 0.91618604, 0.92957827, 1. , 0.69783225],
[ 0.84319661, 0.8773178 , 0.74286053, 0.69783225, 1. ]])
A word query is the most common way to search documents. In LDA, each occurrence of a word is assigned with its topic value, giving the idea as to in what context the word is used. Our word search function thus outputs not only tells us documents that contain the query word, but also its position in the documents and the assigned topic values. Take for example the term 'anthropomorphism':
$ v.word_topics('anthropomorphism')
| Word: anthropomorphism | ||
|---|---|---|
| Document | Pos | Topic |
| abraham-daud.txt | 2161 | 19 |
| arnauld.txt | 4076 | 19 |
| causation-mani.txt | 5906 | 91 |
| cognition-animal.txt | 1373 | 76 |
| cognition-animal.txt | 2006 | 76 |
| cognition-animal.txt | 2014 | 76 |
| cognition-animal.txt | 2016 | 76 |
| cognition-animal.txt | 2035 | 76 |
| cognition-animal.txt | 2060 | 76 |
| cognition-animal.txt | 2086 | 76 |
| cognition-animal.txt | 2121 | 76 |
| cognition-animal.txt | 2275 | 76 |
| cognition-animal.txt | 2354 | 76 |
| cognition-animal.txt | 2441 | 76 |
| cognition-animal.txt | 3061 | 76 |
| cognition-animal.txt | 3770 | 76 |
| comte.txt | 2506 | 19 |
| consciousness-animal.txt | 1803 | 76 |
| consciousness-animal.txt | 1816 | 76 |
| ethics-environmental.txt | 3435 | 19 |
| feminist-religion.txt | 3171 | 19 |
| hume-religion.txt | 4133 | 19 |
| hume-religion.txt | 7637 | 19 |
| kant-religion.txt | 4596 | 19 |
| kukai.txt | 1684 | 31 |
| ludwig-feuerbach.txt | 3874 | 19 |
| maimonides.txt | 4194 | 19 |
| nothingness.txt | 4360 | 76 |
| philolaus.txt | 5881 | 31 |
| reduction-biology.txt | 149 | 76 |
| relativism.txt | 11889 | 76 |
| xenophanes.txt | 21 | 19 |
Hence this word instantiates three topics, 19, 76, 91. These topics are:
$ v.topics(k_indices=[19, 76, 91])
| Topics Sorted by Index | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Topic | Words | |||||||||
| 19 | god, divine, world, human, religion, theological, power, christian, creation, nature | |||||||||
| 76 | behavior, psychology, cognitive, mental, human, mind, psychological, attention, imagery, animals | |||||||||
| 91 | one, two, system, set, case, first, given, way, also, example | |||||||||