Categorise words using fastText word embeddings.
-
Download
crawl-300d-2M.vecfrom fastText here and place it in root. -
Create a
categories.txtfile in root and list the categories you want, space-separated. For example:Red Blue Green Yellow Purple Pink Brown Black White MagentaCapitalisation is ignored.
-
Create a
words.txtfile in root and list the words you want to categorise, space-separated. For example:Apple Banana Mango Pineapple Strawberry Blueberry Raspberry Blackberry Watermelon Cantaloupe Honeydew Grapes Kiwi Papaya Pear Peach Plum Cherry Pomegranate Fig Date Coconut Guava Lychee Passionfruit Dragonfruit Jackfruit Persimmon Mulberry AvocadoCapitalisation is ignored.
-
After installing requirements, run:
python parse_files.pyThis will extract your words and categories from
crawl-300d-2M.vec, and assign words to each category. In our example, we are trying to categorise fruit into a collection of colours. The internal representation looks like:# Category: ([most-linked items], [secondmost-linked items], [thirdmost-linked items]). { 'Brown': (['Apple', 'Kiwi', 'Mulberry'], [], []), 'magenta': (['FIG', 'BlackBerry', 'raspberry', 'persimmon'], ['Apple', 'plum', 'papaya', 'guava', 'passionfruit'], ['mango', 'lychee', 'dragonfruit'] ), 'green': (['coconut', 'avocado'], ['FIG', 'Kiwi', 'jackfruit'], ['Apple', 'banana'] ), 'yellow': (['banana', 'papaya', 'honeydew', 'jackfruit'], ['peach', 'mango', 'watermelon', 'pear', 'cantaloupe'], ['FIG', 'coconut', 'grapes', 'pineapple', 'avocado', 'strawberry', 'guava', 'persimmon'] ), 'purple': (['plum', 'blueberry', 'pomegranate'], ['banana', 'grapes', 'pineapple', 'avocado', 'strawberry', 'lychee', 'persimmon', 'dragonfruit'], ['cherry', 'BlackBerry', 'peach', 'watermelon', 'pear', 'raspberry', 'Mulberry', 'cantaloupe', 'passionfruit', 'honeydew', 'jackfruit'] ), 'pink': (['cherry', 'pineapple', 'strawberry', 'peach', 'mango', 'watermelon', 'pear', 'cantaloupe', 'guava', 'lychee', 'passionfruit', 'dragonfruit'], ['coconut', 'BlackBerry', 'raspberry', 'blueberry', 'Mulberry', 'honeydew'], ['plum', 'pomegranate', 'papaya'] ), 'red': (['grapes'], ['cherry', 'pomegranate'], []), 'white': ([], [], ['Kiwi']), 'blue': ([], [], ['blueberry']) }
Note the capitalisation in our example for "Apple" and "BlackBerry". This is because DictionMap uses the most common capitalisation (and associated embedding) for each word.
-
Run
python app.pyThis starts a Flask web app which allows you to interact with your categories.
-
[Optional.] Any words in
words.txtwhich are not in the fastText database cannot be categorised automatically. These get put into a new file,for_manual_tagging.txt. Imagine that for some reason, "raspberry" and "coconut" couldn't be categorised automatically. Thenfor_manual_tagging.txtwould look like:raspberry coconutWe can now manually assign them categories by modifying the file as follows:
raspberry pink red coconut brown whiteThe next time we run
parse_files.py, these will be put into a new file,manual_tags.txtwhere they will be processed by the categoriser.