This codebase is based on the latest version of the Play framework
and as such it needs Java 8 to build. Modules are defined under
modules. The main Play app is defined in app. To build the
main app, type
$ ./activator {target}where {target} can be one of
{compile,run,test, dist}. Building modules is
similar:
$ ./activator {module}/{target}where {module} is the module name as it appears under modules/
and {target} can be {compile, test}. To run a particular
class in a particular module, use the runMain syntax, e.g.,
$ ./activator "project stitcher" "runMain ncats.stitcher.tools.DuctTape"We propose a graph-based approach to entity stitching and resolution. Briefly, our approach uses clique detection to do the stitching and resolution as follows:
-
For a given hypergraph (multi-edge) of stitched entities, extract connected components based on stitching keys as defined in
StitchKey. -
For each connected component, perform exhaustive clique enumeration over each stitch key. A clique is a complete subgraph of size 3 or larger.
-
Next we identify a set of high confidence cliques. A high confidence clique is a clique for which its members do not belong to any other clique. All nodes in a clique are merged to become a stitched node.
-
For the leftover cliques, we perform a sort by descending order of the value |V| * |E| where |V| and |E| are the clique size and the cardinality of stitch keys, respectively. Stitched nodes are created as we iterate through this order ignoring any nodes that have already been stitched.
- Try invoking the
sbtshell to check if it is available, thenexit.
$ sbt- Initiate (define auxiliary functions, check for java version, etc.), then
exit.
$ bash activator2- Build, stitch, and calculate events.
a) Make sure you have a file
.sbtoptsin yourstitcherdirectory that has the following content:
-J-Xms1024M -J-Xmx16G -J-Xss1024M -J-XX:+CMSClassUnloadingEnabled -J-XX:+UseConcMarkSweepGCb) Check the script and search for the database name (e.g. stitchv1.db):
$ cat scripts/stitch-all-current.shIf you have a database with the same name in your stitcher directory, either remove it or modify the script to have a different db name (e.g. stitchv2.db).
c) From the stitcher directory, run:
$ bash scripts/stitch-all-current.shNOTE: Building the databse and stitching should take about 4 and 5 hours, respectively, on a laptop (i5-4200U @ 2.3 GHz, 8GB RAM).
Complete process on a server (ifxdev.ncats.nih.gov) takes approximately 5-6 hours.
NOTE: Since the process takes a while, it's better run the process in a separate screen to keep the process running, if the connection to the server/terminal is reset.
While nohup is another option, it is problematic in this case, as it will stop the job at the end of every command due to a tty output attempt.
$ screen
$ bash scripts/stitch-all-current.sh > stitch.out 2>&1
#press 'ctrl+a', then 'd' to disconnect from the screenNOTE: If you encounter errors, try cleaning the project by removing all target directories directly, and then re-run the script:
$ find . -name target -type d -exec rm -rf {} \;
$ bash scripts/stitch-all-current.sh- In your
stitcherdirectory, make a symbolic linkstitcher.ix/data.dbpointing to the database you have just made.
#first, remove old link or a folder with the same name (if present)
$ rm -r stitcher.ix/data.db
#then create the symlink
$ ln -s ../stitchv1.db stitcher.ix/data.db- Navigate to your
stitcherdirectory and run the project.
$ sbt run- When prompted in the console, navigate to http://localhost:9000/app/stitches/latest in your browser.
####(optional -- only do this if you have changed the stitcher code or starting anew)
- !!!Please make sure you run the following test when you update the stitching algorithm
sbt stitcher/"testOnly ncats.stitcher.test.TestStitcher"and ensure all the basic stitching test cases are passed before doing a build
- Make a distribution. In the
stitcherdirectory run:
sbt dist
It will be created in stitcher/target/universal/ and have a name similar to ncats-stitcher-master-20171110-400d1f1.zip.
- Copy the archive to the deployment server (e.g.
dev.ncats.io). For example:
#navigate to path-to-stitcher-parent-directory/stitcher/target/universal/
#scp to the server
$ scp ncats-stitcher-master-20171110-400d1f1.zip centos@dev.ncats.io:/tmp
- Unzip into the desired folder (on
centos@dev.ncats.io, it is~).
#navigate to the desired folder on the deployment server
$ ssh centos@dev.ncats.io
#unzip
$ unzip /tmp/ncats-stitcher-master-20171110-400d1f1.zip
- In the
stitcherfolder (where you have prepared the database), archive the database folder and copy it over to the deployment server.
$ zip -r stitchv1db.zip stitchv1.db/
$ scp stitchv1db.zip centos@dev.ncats.io:/tmp
- On the deployment server, navigate to a directory containing the stitcher distribution folder and unzip the database.
$ ssh centos@dev.ncats.io
$ unzip /tmp/stitchv1db.zip
- Start up the app. The script takes the distribution and db folders as arguments.
$ bash restart-stitcher.sh ncats-stitcher-master-20171110-400d1f1 stitchv1.db
- A distribution folder (e.g.
~/ncats-stitcher-master-20171110-400d1f1). - A database (e.g.
~/stitchv1.db). - A
files-for-stitcher.ixfolder with three files. - The script for (re)starting stitcher
restart-stitcher.sh.
https://stitcher.ncats.io/app/stitches/latest
https://stitcher.ncats.io/app/stitches/latest/ + UNII
https://stitcher.ncats.io/app/stitches/latest/aspirin
https://stitcher.ncats.io/api/datasources
- Problem:
Cause:
java.lang.NumberFormatException: For input string: "0x100"
SBTusesjlinefor terminal output. The latter in turn uses theinfocmputility provided byncurses, which expects only decimal values. This behaviour was fixed in a new version ofjlineand and newer version ofSBT, however version0.13.15used for this project still suffers from it.
Solution:
Add the following to your~/.bashrc:export TERM=xterm-color
The underlying Neo4j for stitcher is publicly accessible here. Please specify stitcher.ncats.io:80 in the Host field. No credentials are needed.
cd scripts python approvalYears.py [requires python 3+] in the /data folder, there should now be a file like approvalYears-2020-12-14.txt. If acceptable, update the filename reference in /data/conf/ob.conf to point to this new file.