ScienceBeam Utils

Overview

Provides utility functions related to the ScienceBeam project.

Please refer to the development documentation if you wish to contribute to the project.

Most tools are not yet documented. Please feel free to browse the code or tests, or raise an issue.

Pre-requisites

Python 3
Apache Beam

Apache Beam may be used to for preprocessing but also its transparent FileSystems API which makes it easy to access files in the cloud.

Install

pip install apache_beam[gcp]

pip install sciencebeam-utils

CLI Tools

Find File Pairs

The preferred input layout is a directory containing a gzipped pdf (.pdf.gz) and gzipped xml (.nxml.gz), e.g.:

manuscript_1/
- manuscript_1.pdf.gz
- manuscript_1.nxml.gz
manuscript_2/
- manuscript_2.pdf.gz
- manuscript_2.nxml.gz

Using compressed files is optional but recommended to reduce file storage cost.

The parent directory per manuscript is optional. If that is not the case then the name before the extension must be identical (which is recommended in general).

Run:

python -m sciencebeam_utils.tools.find_file_pairs \
--data-path <source directory> \
--source-pattern *.pdf.gz --xml-pattern *.nxml.gz \
--out <output file list csv/tsv>

e.g.:

python -m sciencebeam_utils.tools.find_file_pairs \
--data-path gs://some-bucket/some-dataset \
--source-pattern *.pdf.gz --xml-pattern *.nxml.gz \
--out gs://some-bucket/some-dataset/file-list.tsv

That will create the TSV (tab separated) file file-list.tsv with the following columns:

source_url
xml_url

That file could also be generated using any other preferred method.

Split File List

To separate the file list into a training, validation and test dataset, the following script can be used:

python -m sciencebeam_utils.tools.split_csv_dataset \
--input <csv/tsv file list> \
--train 0.5 --validation 0.2 --test 0.3 --random --fill

e.g.:

python -m sciencebeam_utils.tools.split_csv_dataset \
--input gs://some-bucket/some-dataset/file-list.tsv \
--train 0.5 --validation 0.2 --test 0.3 --random --fill

That will create three separate files in the same directory:

file-list-train.tsv
file-list-validation.tsv
file-list-test.tsv

The file pairs will be randomly selected (--random) and one group will also include all remaining file pairs that wouldn't get include due to rounding (--fill).

As with the previous step, you may decide to use your own process instead.

Note: those files shouldn't change anymore once you used those files

Get Output Files

Since ScienceBeam is intended to convert files, there will be output files. To make it specific what the filenames are, the output files are also kept in a file list. This tool will generate the file list (it doesn't matter whether the files actually exist for this purpose).

e.g.

python -m sciencebeam_utils.tools.get_output_files \
  --source-file-list path/to/source/file-list-train.tsv \
  --source-file-column=source_url \
  --output-file-suffix=.xml \
  --output-file-list path/to/results/file-list.lst

By adding the --check argument, it will check whether the output files exist (see below).

Check File List

After generating an output file list, this tool can be used whether the output files exist or are complete.

e.g.

python -m sciencebeam_utils.tools.check_file_list \
  --file-list path/to/results/file-list.lst \
  --file-column=source_url \
  --limit=100

This will check the first 100 output files and report on it. The command will fail if none of the output files exist.

Name		Name	Last commit message	Last commit date
Latest commit History 157 Commits
.github		.github
doc		doc
docker		docker
sciencebeam_utils		sciencebeam_utils
tests		tests
.dockerignore		.dockerignore
.env		.env
.flake8		.flake8
.gitignore		.gitignore
.pylintrc		.pylintrc
.travis.yml		.travis.yml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Jenkinsfile		Jenkinsfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
docker-compose.ci.yml		docker-compose.ci.yml
docker-compose.override.yml		docker-compose.override.yml
docker-compose.yml		docker-compose.yml
maintainers.txt		maintainers.txt
print_version.sh		print_version.sh
project_tests.sh		project_tests.sh
pytest.ini		pytest.ini
requirements.build.txt		requirements.build.txt
requirements.dev.txt		requirements.dev.txt
requirements.prereq.txt		requirements.prereq.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScienceBeam Utils

Overview

Pre-requisites

Install

CLI Tools

Find File Pairs

Split File List

Get Output Files

Check File List

About

Uh oh!

Releases 10

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

License

eLifePathways/sciencebeam-utils

Folders and files

Latest commit

History

Repository files navigation

ScienceBeam Utils

Overview

Pre-requisites

Install

CLI Tools

Find File Pairs

Split File List

Get Output Files

Check File List

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages