Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,12 @@ under construction.

# Requirements

Install the Python requirements from requirements.txt:
Install the Python requirements from `requirements.txt`:
```
$ python -m pip install -r requirements.txt
```

You will also need a working R installation with reticulate and irkernel installed.
You will also need a working R installation with `reticulate` and `irkernel` installed.
This command should do the trick:
```
$ R -e 'install.packages(c("reticulate", "IRkernel")); IRkernel::installspec()'
Expand All @@ -35,8 +35,8 @@ $ R -e 'install.packages(c("reticulate", "IRkernel")); IRkernel::installspec()'
[jupytext](https://github.com/mwouts/jupytext) to convert the notebook into
the right format.
- To build locally, run ``make``. The output tells you where to find the
built HTML.
- Pages rendered at https://tskit.dev/tutorials
built `HTML` file(s).
- Pages rendered at https://tskit.dev/tutorials.
- Pages might take a while to be updated after a new tutorial is merged.

If you have an idea for a tutorial, please open an issue to discuss.
60 changes: 40 additions & 20 deletions tskitr.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,29 +19,29 @@ kernelspec:

# Tskit and R

To interface with `tskit` in R, we can use the [reticulate](https://rstudio.github.io/reticulate/) R package, which lets you call Python functions within an R session. In this tutorial, we'll go through a couple of examples to show you how to get started. If you haven't done so already, you'll need to install `reticulate` in your R session via `install.packages("reticulate")`.
To interface with `tskit` in R, we can use the [reticulate](https://rstudio.github.io/reticulate) R package, which lets you call `tskit`'s Python API within an R session. In this tutorial, we'll go through a couple of examples to show you how to get started. You'll need to install `reticulate` in your R session via `install.packages("reticulate")` and at a minimum state the required Python packages via `reticulate::py_require(c("tskit", "msprime"))`. This setting will use `reticulate`'s isolated Python virtual environment, but you can also use an alternative Python environment as described at https://rstudio.github.io/reticulate.

We'll begin by simulating a small tree sequence using `msprime`.

```{code-cell}
msprime <- reticulate::import("msprime")

ts <- msprime$sim_ancestry(80, sequence_length=1e4, recombination_rate=1e-4, random_seed=42)
ts # See "Jupyter notebook tips", below for how to render this nicely
```

## Attributes and methods

`reticulate` allows us to access a Python object's attributes via
the `$` operator. For example, we can access (and assign to a variable) the number of
samples in the tree sequence:
the `$` operator. For example, we can access (and assign to a native R object)
the number of samples in the tree sequence:

```{code-cell}
n <- ts$num_samples
n
is(n) # Show the type of the object
```

The `$` operator can also be used to call methods, for example, the
The `$` operator can also be used to call methods, for example, the
{meth}`~TreeSequence.simplify` method associated with the tree sequence.
The method parameters are given as native R objects
(but note that object IDs still use tskit's 0-based indexing system).
Expand All @@ -58,33 +58,44 @@ paste(
### IDs and indexes

Note that if a bare digit is provided to one of these methods, it will be treated as a
floating point number. This is useful to know when calling `tskit` methods that
require integers (e.g. object IDs). For example, the following will not work:
floating point number (numeric in R). This is useful to know when calling `tskit` methods that
require integers (such as object IDs). For example, the following will not work:

```{code-cell}
:tags: [raises-exception, remove-output]
ts$node(0) # Will raise an error
ts$node(0) # Will raise a TypeError
is(0)
```

In this case, to force the `0` to be passed as an integer, you can either coerce it
using `as.integer` or simply prepend the letter `L`:
using `as.integer` or simply append the letter `L`:

```{code-cell}
is(as.integer(0))
ts$node(as.integer(0))
# or
is(0L)
ts$node(0L)
```

Coercing in this way is only necessary when passing parameters to those underlying
`tskit` methods that expect integers. It is not needed e.g. to index into numeric arrays.
`tskit` methods that expect integers. It is not needed to index into numeric arrays.
_However_, when using arrays, very careful attention must be paid to the fact that
`tskit` IDs start at zero, whereas R indexes start at one:

```{code-cell}
root_id <- ts$first()$root
paste("Root time via tskit method:", ts$node(root_id)$time)
# When indexing into tskit arrays in R, add 1 to the ID
paste("Root time via array access:", ts$nodes_time[root_id + 1])
root_id_int <- ts$first()$root
is(root_id_int)
root_id_num <- as.numeric(root_id_int)

# Using integer ID will work
paste("Root time via tskit method:", ts$node(root_id_int)$time)
# Using numeric ID will not work
# paste("Root time via tskit method:", ts$node(root_id_num)$time)

# When indexing into tskit arrays in R, add 1 to the ID (integer or numeric)
paste("Root time via array access:", ts$nodes_time[root_id_int + 1])
paste("Root time via array access:", ts$nodes_time[root_id_num + 1])
```

## Analysis
Expand All @@ -98,7 +109,6 @@ for each of the tree sequence's sample nodes:

```{code-cell}
ts_mut = msprime$sim_mutations(reduced_ts, rate=1e-4, random_seed=321)

paste(ts_mut$num_mutations, "mutations, genetic diversity is", ts_mut$diversity())
```

Expand All @@ -109,6 +119,7 @@ the tree sequence as a matrix object in R.
```{code-cell}
G = ts_mut$genotype_matrix()
G
is(G)
```

We can then use R functions directly on the genotype matrix:
Expand Down Expand Up @@ -142,10 +153,9 @@ It also allows trees and tree sequences to be plotted inline:
ts_mut$draw_svg(y_axis=TRUE, y_ticks=0:10)
```


## Interaction with R libraries

R has a number of libraries to deal with genomic data and trees. Below we focus on the
R has a number of libraries to deal with genomic data and trees. Below we demo the
phylogenetic tree representation defined in the the popular
[ape](http://ape-package.ird.fr) package, taking all the trees
{meth}`exported in Nexus format<TreeSequence.write_nexus>`, or
Expand All @@ -165,14 +175,24 @@ plot(tree, direction="downward", srt=90, adj=0.5) # or equivalently use trees[[
```

Note that nodes are labelled with the prefix `n`, so that nodes `0`, `1`, `2`, ...
become `n0`, `n1`, `n2` ... etc. This helps to avoid
become `n0`, `n1`, `n2`, ... This helps to avoid
confusion between the the zero-based counting system used natively
by `tskit`, and the one-based counting system used in `R`.

## Further information

Be sure to check out the [reticulate](https://rstudio.github.io/reticulate/)
Be sure to check out the [reticulate](https://rstudio.github.io/reticulate)
documentation, in particular on
[Calling Python from R](https://rstudio.github.io/reticulate/articles/calling_python.html),
which includes important information on how R data types are converted to their
equivalent Python types.
equivalent Python types.

## Advanced usage through the tskit's C API

While the approach with `reticulate` is very powerful and will be useful for
most analyses, it does have an overhead due to calling Python and conversion of objects.
This overhead will be minimal for some use cases but could be significant for others.
An alternative approach is to use the `tskit` C API from R,
via the standard [R Extensions](https://cran.r-project.org/doc/manuals/R-exts.html) or
via [Rcpp](https://www.rcpp.org),
but these are advanced topics beyond the scope of this tutorial.
Loading