This repo provides a configurable way to collate data from multiple sources into a single denormalized dataframe and create tokenized timelines from the results.
You can use uv to create an environment for running this code (with Python >= 3.12) as follows:
uv sync
uv run cocoa --helpCocoa does two things: collation and tokenization.
The collator reads raw data tables (parquet or CSV) and combines them into a
single denormalized dataframe in a
MEDS-like format. Each row
in the output represents an event with a subject_id, time, code, and
optional numeric_value / text_value columns.
Collation is driven by a YAML config that specifies:
- A reference table with a primary key (
subject_id), start/end times, and optional augmentation joins (e.g. joining a patient demographics table). - A list of entries, each mapping a source table (or the reference frame
itself via
table: REFERENCE) to the output schema. Each entry declares which column provides thecode,time, and optionallynumeric_value,text_value,prefix,filter_expr, andwith_col_expr. - Subject splits (
train_frac/tuning_frac) that partition subjects chronologically into train, tuning, and held-out sets.
The tokenizer consumes the collated parquet output and converts events into integer token sequences suitable for sequence models. It:
- Adds
BOS/EOSsentinel tokens to each subject's timeline. - Computes quantile-based bins for numeric values (from training data only).
- Maps codes (and optionally their binned values) to integer tokens via a vocabulary that grows during training and is frozen for tuning/held-out data.
- Aggregates per-subject token sequences in a configurable sort order.
Tokenization is driven by its own YAML config that specifies:
n_bins— number of quantile bins for numeric values.fused— whether to fuse the code, binned value, and text value into a single token (true) or keep them as separate tokens (false).collated_inputs— paths to the collated parquet files to tokenize.ordering— the priority order of code prefixes when sorting events within the same timestamp.
The tokenizer produces two main outputs:
tokens_times.parquet— one row per subject with three columns:subject_idtokens— the integer token sequence for the subject's timeline.times— a parallel list of timestamps, one per token, indicating when each event occurred.
tkzr.pkl.gz— a gzipped pickle of the frozenTokenizerobject, including its vocabulary and bin definitions, for use at inference time.
For example, a subject with two events might look like:
| subject_id | tokens | times |
|---|---|---|
"100" |
[1, 5, 8, 12, 2] |
[2025-01-01, 2025-01-01, 2025-01-02, 2025-01-03, 2025-01-03] |
Here 1 is the BOS token, 2 is EOS, and the tokens in between correspond to
the subject's clinical events in chronological order (with ties broken by the
configured ordering). In fused mode each event is a single token; in unfused
mode an event with a numeric value becomes two tokens (code + quantile bin).
All configuration lives under config/. The entrypoint is config/main.yaml,
which points to the collation and tokenization configs and sets shared paths:
data_home: ~/path/to/raw/data
processed_data_home: ~/path/to/output
collation_config: ./config/collation/clif-21.yaml
tokenization_config: ./config/tokenization/clif-21.yamlTo use a different dataset or schema, create new YAML files under
config/collation/ and config/tokenization/ and update the paths in
config/main.yaml.
Both the Collator and Tokenizer classes also accept **kwargs that are
merged on top of the YAML config via OmegaConf, so any config value can be
overridden programmatically:
from cocoa.collator import Collator
from cocoa.tokenizer import Tokenizer
collator = Collator(data_home="~/other/data")
tokenizer = Tokenizer(n_bins=20, fused=False)A collation config has three top-level sections: identifiers, subject splits, and the reference + entries that define which events to extract.
subject_id: hospitalization_id # the atomic unit of interest
group_id: patient_id # multiple subjects can belong to a group
subject_splits:
train_frac: 0.7
tuning_frac: 0.1
# the remainder is held outsubject_id is the column that uniquely identifies each subject (e.g. a
hospitalization). group_id is an optional higher-level grouping column.
Subjects are sorted chronologically and split into train / tuning / held-out sets
according to the specified fractions.
The reference table is the primary static table to which everything else is joined:
reference:
table: clif_hospitalization
start_time: admission_dttm
end_time: discharge_dttm
augmentation_tables:
- table: clif_patient
key: patient_id
validation: "m:1"
with_col_expr: pl.lit("AGE").alias("AGE")table— the name of the parquet (or CSV) file indata_home(without the extension).start_time/end_time— columns that define the subject's time window; used to filter events from other tables whenreference_keyis set (see below).augmentation_tables— optional list of tables to join onto the reference frame. Each needs akeyto join on and avalidationmode (e.g."m:1"). You can also add computed columns viawith_col_expr.
The entries list defines the events to extract. Every entry produces rows with
the columns subject_id, time, code, numeric_value, and text_value. The
entry's fields tell the collator which source columns map to these outputs.
Required fields:
| Field | Description |
|---|---|
table |
Source table name, or REFERENCE to pull from the reference frame. |
code |
Column whose values become the event code. |
time |
Column whose values become the event timestamp. |
Optional fields:
| Field | Description |
|---|---|
prefix |
String prepended to the code (separated by //), e.g. LAB-RES. |
numeric_value |
Column to use as the numeric value for the event. |
text_value |
Column to use as the text value for the event. |
filter_expr |
A Polars expression (or list of expressions) to filter rows before extraction. |
with_col_expr |
A Polars expression (or list) to add computed columns before extraction. |
reference_key |
Join the source table to the reference frame on this key and keep only rows within the subject's start_time–end_time window. |
Examples:
A simple categorical event from the reference frame:
- table: REFERENCE
prefix: DSCG
code: discharge_category
time: discharge_dttmA numeric event from an external table:
- table: clif_labs
prefix: LAB-RES
code: lab_category
numeric_value: lab_value_numeric
time: lab_result_dttmFiltering rows before extraction (single filter):
- table: clif_position
prefix: POSN
filter_expr: pl.col("position_category") == "prone"
code: position_category
time: recorded_dttmMultiple filters (applied as a list):
- table: clif_medication_admin_intermittent_converted
prefix: MED-INT
filter_expr:
- pl.col("mar_action_category") == "given"
- pl.col("_convert_status") == "success"
code: med_category
numeric_value: med_dose_converted
time: admin_dttmCreating a computed column with with_col_expr to use as the code:
- table: clif_respiratory_support_processed
prefix: RESP
with_col_expr: pl.lit("fio2_set").alias("code")
filter_expr: pl.col("fio2_set").is_finite()
code: code
numeric_value: fio2_set
time: recorded_dttmUsing reference_key to restrict events to a subject's time window:
- table: clif_code_status
prefix: CODE
code: code_status_category
time: admission_dttm
reference_key: patient_idCocoa provides a CLI with the following commands:
# collate raw data into a denormalized parquet file
cocoa collate [-o OUTPUT_DIR]
# tokenize collated data into integer sequences
cocoa tokenize [-o OUTPUT_DIR]
# run both steps in sequence
cocoa pipeline [-o OUTPUT_DIR]
# display current configuration
cocoa info