Dataset processing and download.

The dataset currently in use is a stripped-down version of the [Kaggle arXiv Dataset ](https://www.kaggle.com/Cornell-University/arxiv/)in which only the following categories are retained: `cs.AI`, `cs.CL`, `cs.CV`, `cs.LG`, `cs.MA`, `cs.NE`.

We should self-host this dataset, provide the scripts to process it, and keep it up-to-date with the original ArXiv.