-
Notifications
You must be signed in to change notification settings - Fork 11
Customizing the projects.json file
The projects.json file is (obviously) a JSON file. This file controls the MiCall pipeline by:
- Defining the seed references, where a seed reference is a reference sequence to which we perform the preliminary mapping of reads
- Grouping seed references into seed groups, such that MiCall will select a single reference from a seed group to move forward from the preliminary mapping stage to the iterative remapping stage.
- Defining the coordinate references that are used to extract and interpret nucleotide and amino acid frequencies
- Defining how each seed reference is partitioned into regions according to the coordinate reference system; typically, these regions are genes that we want to pull out of a whole genome sequence.
The JSON file has the following structure:
-
projects-
max_variants: integer, number of sequence variants foraln2counts:write_nuc_variantsto output -
regions-
coordinate_region: string reference to a coordinate reference inregions -
seed_region_names: string reference to a seed reference inregions
-
-
-
regions-
is_nucleotide:boolean -
reference:listof strings comprising a nucleotide or amino acid sequence -
seed_group: string ornull
-
Separating the definition of individual regions from the regions field within each entry in projects facilitates a many-to-one mapping, such that a defined region may be used in more than one project.
Both seed and coordinate reference sequences are defined by region entries in the JSON file. Typically, a seed reference is a nucleotide sequence (is_nucleotide=true) and may be assigned to a seed group. In contrast, a coordinate reference is typically an amino acid sequence (is_nucleotide=false) and has no seed group assignment (null).
The nucleotide or amino acid sequence is specified as a double-quoted string. The convention in MiCall is to break these strings up into comma-separated substrings of a maximum 65 characters each:
"ATAGGACAAGGAATTTGTAGAGCTATTTTAAACATACCTAGAAGAATCAGACAGGGCCTCGAAAG",
"AGCTTTGCTATAA"
These strings are contained in a JSON array object defined by square brackets [...].
A project is defined by a set of seed references and a map of these references to regions within coordinate references (coordinate regions). A project may comprise more than one coordinate reference. For example,