An Open Source Web Application for Genetic Data (SNPs) using 23AndMe and Data Crawling Technologies
OSGenome2 is a Flask-based web application that cross-references your personal SNP data with a comprehensive SNPedia database to provide personalized genomic insights and trait analysis.
From Bioinformatics - A Practical Approach by Shui Qing Ye, M.D., Ph.D. (pg 108):
SNP, pronounced “snip,” stands for single-nucleotide polymorphism, which represents a substitution of one base for another, e.g., C to T or A to G. SNP is the most common variation in the human genome and occurs approximately once every 100 to 300 bases. SNP is terminologically distinguished from mutation based on an arbitrary population frequency cutoff value: 1%, with SNP [greater than] 1% and mutation [less than] 1%. A key aspect of research in genetics is associating sequence variations with heritable phenotypes. Because SNPs are expected to facilitate large-scale association genetics studies, there has been an increasing interest in SNP discovery and detection.
23andMe gathers hundreds of thousands of SNPs that give you everything from your genetic ancestry (haplogroups) to whether you are more likely to think Cilantro tastes like soap, or how quickly you likely digest coffee. Unfortunately, and fortunately, there is a lot of information out there on each specific SNP and what associations they might have. Much like Phrenology of the late 18th and early 19th century, where personality was attempted to be associated to facial features, there can be a lot of attempts to draw conclusions in noise. Enter OS Genome v2, where you can discover links and research at your own pace with the information you gather. It will link what specific Genotype is yours, and what that means in the context of discovery. From there you can google the relevant SNP id at your own intrigue or use the link on the RSId to discover more about that SNP on SNPedia.
OSGenome2/
├── app.py # Flask application & SNP cross-referencing logic
├── crawler.py # SNPedia web crawler
├── GenomeImporter.py # 23AndMe Genome Importer
├── snpDict.json # Your 23AndMe SNP data (Genome Importer Generated)
├── category_snps.jsonl # Claude-curated category tags (JSONL format)
├── detailed_snps.json # SNPedia data — starter dataset included (see below)
├── snpedia_snps.json # Cached list of all SNPedia rsIDs — included
├── crawl_progress.jsonl # Crawler progress tracking — created on first crawl run
├── templates/
│ └── index.html # Dashboard UI
├── README.md
└── LICENSE
- Python 3.7 or higher
- pip (Python package manager)
pip install flask requestsdetailed_snps.json ships with a curated starter dataset of 3,975 SNPs sourced from SNPedia. You can use it immediately without running the crawler, or replace it with a full personalized crawl.
The included dataset contains every SNP where at least one genotype has a SNPedia magnitude ≥ 1. SNPedia's magnitude scale is roughly:
| Magnitude | Meaning |
|---|---|
| 0 | No known significance / benign |
| 1 | Interesting, worth knowing |
| 2 | Moderate clinical relevance |
| 3+ | High significance (e.g. hereditary cancer, cardiomyopathy) |
Filtering at ≥ 1 removes ~21,500 low-signal entries (variants with no documented effect) while keeping every SNP with a meaningful annotation. This covers the vast majority of clinically relevant results for most users.
Just import your genome (Step 1 below) and run the app — the starter data is already there.
For complete coverage of every SNP in your specific genome:
- Delete
detailed_snps.json - Run
python crawler.py
The crawler queries SNPedia for each rsID in your snpDict.json at one request per second (to be respectful to SNPedia's servers). A full 23andMe genome contains ~600,000 SNPs, though SNPedia only has meaningful data for a fraction of them. Expect the crawl to run for several hours. The app can be used at any point during the crawl, it reloads new results automatically as the file grows.
Use GenomeImporter.py to convert your raw 23AndMe DNA text file into the required SNP dictionary format.
python GenomeImporter.py -f <path_to_23andme_file.txt> -o snpDict.jsonWhat happens:
- Reads your raw 23AndMe text file line-by-line
- Extracts SNP IDs (rsids) and genotypes
- Formats genotypes to SNPedia standard:
(A;G)syntax - Exports the processed data to
snpDict.json
Expected output format:
{
"rs53576": "(A;G)",
"rs1815739": "(C;T)",
"rs6152": "(A;G)"
}Skip this step if you want to use the included starter dataset.
Delete detailed_snps.json first, then run:
python crawler.pyWhat happens:
- Reads all SNP IDs from
snpDict.json - For each SNP, queries the SNPedia API (
bots.snpedia.com) - Extracts SNP metadata: gene, chromosome, orientation, summary
- Retrieves all genotype-specific data (magnitude & traits)
- Writes results line-by-line to
detailed_snps.json(JSONL format)
The crawler tracks every attempted SNP in crawl_progress.jsonl — including failures and gateway errors — so re-running the command will always pick up exactly where it left off without re-scanning anything.
python crawler.py # resumes automatically from crawl_progress.jsonl| Flag | Description |
|---|---|
-s / --start <rsID|index> |
Skip all SNPs before this point and mark them as skipped in the progress file. Accepts an rsID (e.g. rs53576) or a zero-based numeric index. |
--crawl-skipped |
Re-crawl SNPs previously marked as skipped via --start, while still skipping anything already successfully crawled or confirmed missing. |
--reset |
Clear the progress file entirely and start the crawl from scratch. |
--refresh-snplist |
Re-fetch the SNPedia SNP list even if a local cache (snpedia_snps.json) exists. |
snpedia_snps.json is a cached list of every rsID that SNPedia has a page for, fetched from Category:Is_a_snp. This file is included in the repository — generating it from scratch takes ~30 minutes of paginated API calls, so the cached copy lets you start crawling immediately.
Before each crawl, your genome's SNPs are pre-filtered against this list so SNPs with no SNPedia page are never queried. A typical 23andMe genome has ~600,000 SNPs; SNPedia covers roughly 25,000 of them, so this eliminates the majority of wasted requests.
To refresh the cache (e.g. to pick up SNPs newly added to SNPedia):
python crawler.py --refresh-snplistOr delete the file manually and re-run — the crawler will regenerate it automatically.
Example workflow — start mid-list, then backfill:
# Start crawling from rs53576 (skips everything before it)
python crawler.py --start rs53576
# Later: go back and fill in the entries that were skipped above
python crawler.py --crawl-skippedStart the Flask app (can be run at any point — with or without running the crawler):
python app.pyThe application will be available at http://localhost:5000
Extracts all SNP IDs (top-level keys) from your snpDict.json file.
Fetches main SNP page data from SNPedia including:
- Summary
- Chromosome location
- Gene name
- Strand orientation (+ or -)
Retrieves all genotype variations for a specific SNP with their:
- Allele pairs (e.g., A;A, A;G, G;G)
- Magnitude (effect size)
- Associated traits/summaries
Converts line-delimited JSON to a standard JSON array format.
The crawler maintains a JSONL progress file alongside detailed_snps.json. Each line records one attempted SNP:
{"snp": "rs53576", "status": "success"}
{"snp": "rs1815739", "status": "not_found"}
{"snp": "rs6152", "status": "skipped"}| Status | Meaning |
|---|---|
success |
SNPedia returned data; written to detailed_snps.json |
not_found |
SNP has no SNPedia page |
skipped |
Skipped via --start; can be re-crawled with --crawl-skipped |
Re-running the crawler always skips success and not_found entries. Only skipped entries can be selectively resumed.
OSGenome2 integrates with Ollama to provide local, private AI-powered explanations of your genetic data. No data leaves your machine.
- Install Ollama from ollama.com
- Pull a model (e.g.
llama3is a good starting point):ollama pull llama3
- Make sure Ollama is running before you start the app. The app detects it automatically on page load.
Per-SNP Analysis — Each row in the table has a 🤖 button on the right. Clicking it opens a modal that streams a plain-language explanation of:
- What the gene does in the body
- What your specific allele means
- Relevant lifestyle or health considerations
- Caveats about interpreting DTC genetic testing
Genome Chat — A floating chat button in the bottom-right corner opens a conversational interface. Your top 25 SNPs (by magnitude) are automatically included as context, so you can ask questions like:
- "Which of my variants are most clinically significant?"
- "What does my APOE status mean?"
- "Are any of my variants related to cardiovascular risk?"
Both features support multiple Ollama models. Use the model selector in each panel to switch between installed models.
Privacy note: All AI processing runs locally through Ollama. Your genetic data is never sent to any external server.
Medical disclaimer: AI-generated explanations are for educational purposes only. Always confirm significant findings with a licensed clinician.
- SNP Explorer: Browse all matched SNPs with detailed genotype information
- Category Filter: Clickable color-coded pills to filter by disease area
- AI Analysis: Per-SNP explanations and genome-wide chat powered by local Ollama models
- Filtering: Hide "Common in ClinVar" variants and empty traits
- Sorting: Sort by magnitude (effect size) to identify high-impact variants
- Search: Quick search across genes, SNP IDs, and traits
- Responsive Design: Mobile-friendly Bootstrap interface
Category badges are color-coded by disease area — both in the filter panel and in the table — so you can spot related variants at a glance.
| Color | Disease Area | Example Categories |
|---|---|---|
| 🔴 Red | Hereditary Cancer & Autoimmune | Hereditary Cancer, Lynch Syndrome, Breast/Ovarian Cancer, Rheumatoid Arthritis |
| 🟠 Orange | Cardiovascular | Cardiovascular Disease, Cardiomyopathy, Hypertrophic Cardiomyopathy, Familial Hypercholesterolemia, Aortic Aneurysm |
| 🟣 Purple | Neurological | Neurological Disease, Alzheimer's Disease, Parkinson's Disease, Epilepsy, Tuberous Sclerosis |
| 🟤 Dark Orange | Metabolic | Metabolic Disease, Phenylketonuria, Gaucher Disease, MCAD Deficiency, Hemochromatosis |
| 🟢 Teal | Drug Metabolism | Drug Metabolism, Pharmacogenomics, Warfarin Sensitivity, Alcohol Metabolism |
| 🔴 Dark Red | Blood Disorders | Blood Disorders, Thrombophilia, Sickle Cell Disease, Hemophilia |
| 🔵 Blue | Eye & Blood Type | Eye Disease, Retinitis Pigmentosa, Age-Related Macular Degeneration, Blood Type |
| 🔵 Cyan | Respiratory & Hearing | Respiratory Disease, Cystic Fibrosis, Hearing Loss |
| 🟢 Green | Connective Tissue & Kidney | Connective Tissue Disease, Marfan Syndrome, Ehlers-Danlos Syndrome, Kidney Disease |
| 🟤 Brown | Bone | Bone Disease, Hypophosphatasia |
| 🟠 Burnt Orange | Skin | Skin Disease, Epidermolysis Bullosa |
| 🟣 Deep Purple | Neuromuscular & Psychiatric | Neuromuscular Disease, Psychiatric Traits, Bipolar Disorder Risk, Dopamine Metabolism |
| 🟢 Olive | Nutrition & Physical Traits | Nutrition & Vitamins, Vitamin D, Physical Traits, Height |
| ⚫ Grey | Unclassified | Clinical Variant (SNPs not matched to a known disease category) |
- The starter
detailed_snps.jsoncontains 3,975 SNPs with SNPedia magnitude ≥ 1 — all entries with a documented clinical effect - The app reloads
detailed_snps.jsonautomatically as the crawler adds new data (at most once every 15 seconds) - The crawler includes a 1-second delay between API requests to be respectful to SNPedia
- User alleles are automatically flipped when the SNP uses the minus strand orientation
- Magnitude values > 2.0 are highlighted as high-impact variants in the dashboard
- All data is processed locally — no personal genetic data is sent to external servers
- AI analysis requires Ollama running on
localhost:11434; the app works normally without it (the chat button turns grey) - Categories were generated by Claude (Anthropic) using a curated gene→disease mapping; always confirm significant findings with a clinician
Raw Data coming from Genetic tests done by Direct To Consumer companies such as 23andMe and Ancestry.com were found to have a false positive rate of 40% for genes with clinical significance in a March 2018 study False-positive results released by direct-to-consumer genetic tests highlight the importance of clinical confirmation testing for appropriate patient care. For this reason, it's important to confirm any at risk clinical SNPs with your doctor who can provide genetic tests and send them to a clinical laboratory.
- Dr. Sergey Kornilov (Biostochastics) for brief audit of bioinformatics correctness
GNU General Public License v3.0 - See LICENSE for details

