Generate Plausible Python 3 Documentation using Deep Learning
This project is NOT pure python (unfortunately for the purists) because I use golang for web crawling and some parallelized data cleaning. That being said, it is very possible to turn this into a 'pure' Python project using other tools.
- Write a web crawler to fetch text from the python 3.7 docs
- Create a data pipeline to turn that text into a learnable format.
- Train a RNN (or LSTM or GRU) on the text.
- Generate some documentation using deep learning
Originally, the code used a character-level approach. This method has terrible performance, and training is still time-consuming on GPU.
Then, I applied the pretrained GPT-2 model. This approach was more successful. Samples of the generated text can be found in the 'samples' folder. I've included a sample of the best generated text below.
The first line of Formatter class is defined by
Formatter.setfield(value, type)
If type is a sequence of
strings that contains a field named value, the field is
modified by the setfield() method. If type is a single
string, the field is modified by the setfield() method. For example:
>>> class _Field:
... 'fieldname' = '__class__'
... 'value' = 2
This class is documented in section Field objects.
Impressively, the model was able to produce believeable Python code.
To duplicate the results of this repository, I recommend pre-installing Go, and Python 3.7+.
- Upload the file
notebook/gpt2_model.ipynbto a Google Colab notebook. - Upload the file
src/data/raw_corpus.tar.gzto that notebook's workspace. - Ensure the Colab runtime has GPU acceleration enabled*.
- Run the notebook, and sit back for ~10-20 minutes as the model finetunes
I use conda to manage my python packages, but because this
codebase is multi-lingual, there is no one-size-fits-all solution. However, the
data pipeline uses very few non-standard packages. As well, I recommend that the model training
and text generation be done on colab using notebook/gpt2_model.ipynb, which installs
one package into the cloud environment. In other words, this project is small enough
that I don't think it requires a dedicated virtual environment.
The go pipeline only requires you to install the package colly,
and its dependencies. The rest of the go packages used are standard.
However, if you wish to create your own python environment (with conda, or venv), I have provided a requirements.txt file for your convenience. Note: Model training is done using Google's colab, so although tensorflow and keras are used in this codebase, they need not be installed locally.
I've included a compressed npz file which you can use to train your own models. Alternatively, you can prepare the data yourself from scratch.
There is a file src/data/raw_corpus.tar.gz which you can unpack, either from
the command line or using the 'shutil' standard package in python. Make sure it
is unpacked into the directory data/raw/corpus/ for the repo's compatibility.
After installing Go and python, run the script in src named 'download_and_prepare_data.sh', from the root-directory level.
chmod u+x src/download_and_prepare_data.sh
./src/download_and_prepare_data.sh
The last python script (txt-to-numpy.py) may fail, but if you run it a few times on a device with sufficient memory, it should complete without much trouble.
- Install the following go package to replicate the web crawling.
go get github.com/gocolly/colly
- Run the web scraping tool
go run src/data/crawl.go
The next steps are only necessary to integer-encode the characters. 3. Process the data in the raw corpus into character tokens
go run src/features/unique_chars.go -dir=data/raw/corpus
go run src/features/encode_text.go -dir=data/raw/corpus -target=data/interim/cleaned_corpus
- Transform the integerized character tokens into a compressed numpy matrix
python src/features/count-chars.py
python src/features/txt-to-numpy.py