The dataset currently in use is a stripped-down version of the Kaggle arXiv Dataset in which only the following categories are retained: cs.AI, cs.CL, cs.CV, cs.LG, cs.MA, cs.NE.
We should self-host this dataset, provide the scripts to process it, and keep it up-to-date with the original ArXiv.
The dataset currently in use is a stripped-down version of the Kaggle arXiv Dataset in which only the following categories are retained:
cs.AI,cs.CL,cs.CV,cs.LG,cs.MA,cs.NE.We should self-host this dataset, provide the scripts to process it, and keep it up-to-date with the original ArXiv.