Preview: Record types #593

bentsherman · 2026-01-13T03:37:29Z

This PR is my attempt to try out the proposed syntax for static types and record types on a real nf-core pipeline

Fix strict syntax errors and warnings
Replace ch_versions channels with topic channel
Add params block with typed parameters
Add type annotations to workflow takes
Convert processes to static types
Replace tuples with records

TODO:

Fix remaining strict syntax errors
Replace publishDir with workflow outputs
Run with test profile

nf-core-bot · 2026-01-13T03:38:10Z

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.3.2.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

bentsherman · 2026-01-13T22:39:50Z

modules/nf-core/bedtools/intersect/main.nf

+
 process BEDTOOLS_INTERSECT {
-    tag "$meta.id"
+    tag id


Previsouly this had to be specified with a closure e.g. tag { id } or a dynamic string e.g. tag "$id", but with the strict syntax it can now be specified without either.

bentsherman · 2026-01-13T22:51:06Z

modules/nf-core/bedtools/intersect/main.nf

+    (id: String, intervals1: Path, intervals2: Path) : Record
+    chrom_sizes: Path?


Here we refactor the first input as a record, specifying the expected fields and types.

We replace the meta-map with the specific fields that are used by the process. In this case it's only id, but other common fields include single_end and strandedness. The incoming record can have other meta-fields and even files, they will just be ignored by the process.

One implication is that you can't do something like this in the config:

process { withName:'BEDTOOLS_INTERSECT' { ext.prefix = { "${meta.my_custom_field}" } } }

Because meta no longer exists, and even if my_custom_field happens to be in the incoming record, there is no way to access it from here. You can only use explicitly declared inputs like id.

So that could be an argument for continuing to have a meta map in the record. On the other hand, I think this pattern is difficult to understand and error-prone, so I'm not inclined to bend over backwards to support it.

This was a bit of a rollercoaster reading these two comments.

Beyond making it slightly easier to use records throughout a pipeline (by not having to constantly drop extra values) the main driver to want to allow extra values is so they can be used by closures to set custom values dynamically at run time.

It's only personal anecdotal evidence but I make very heavy usage of closures like the above to set custom prefixes, args, etc.

Losing this ability or having to still pass around a meta object (in this case would probably rename it to ext or extra_args actually) would really dent the benefits of the record types I think.

I assume the reason the extra values are not available is because the record is basically unwrapped by that point and so only the bare value names are present? Would it not be possible to expose the record object and then use e.g. record.id and this would presumably allow for the undefined extra fields to also be accessed?

I appreciate that this may not actually be how records really work and so may not be possible in such a manner.

What would be an alternative way to handle metadata in this case without having to specify each possible option?

I agree with @awgymer that this is concerning. We did already have a discussion about this on Slack (https://nextflow.slack.com/archives/C09T3N7KNAY/p1765798958982349) but I think this is definitely worth picking up on again.

I also make heavy use of parameterising modules using a meta object, and more specifically in one pipeline I am developing, various command-line flags are sample-specific and supplied as part of the sample sheet, and these need to be passaged through the pipeline in the various sample records until they are needed. So I would be deeply concerned if this feature went away without a clear and easy way to replace it.

Part of the problem I see with records is that the output record only includes some subset of fields, plus whatever new files were generated. So if you want to write a chain of processes, you might have to repeatedly join back to the initial input record in order to "re-capture" some essential piece of sample metadata that's an important downstream input.

I think this is definitely a bigger problem for "generic", re-usable modules (all nf-core modules), compared to modules that are written for and exist only within a specific pipeline context. nf-core modules simply cannot account for all the possible use-cases or inputs, and have to be written to be as flexible as possible.

This is partly why I'm so in in favour of the idea of wrapping process calls inside the map operator - this would allow you to manipulate the input channels, generate the things we normally include in a config closure as a variable, and run the process. Then in the same closure you can co-mingle the inputs and the outputs to return some record with a structure of your own choosing. This gives a great degree of flexibility with minimal need for additional join operations, which might be tricky to set up and require modifications to records on both sides to ensure the joins line up.

From an nf-core perspective, borrowing from the syntax in ths PR, and assuming config closures are not supported, I think we will have to have an approach of something like the following, where all processes take an (undefined, unstructured) meta map, as well as args and a prefix. The output needs to output that meta object untouched.

input: ( id: String, intervals1: Path, intervals2: Path, meta: Map, args: String, prefix: String ) : Record chrom_sizes: Path? output: record(id: id, meta: meta, intersect: file("*.${extension}"))

Then you have to add these fields to the record before calling the process, either in a closure that calls the process, or in a map closure before the call. Here r.meta.args is some arbitrary sample-specific bit of metadata that we want to keep attached to the record.

ch_intersect_inputs = ch_inputs .map { r -> record( id: r.id, intervals1: r.bedgraph, args: r.meta.args ?: "", prefix: r.id) } .cross( target_regions.map { tr -> record(intervals2: tr) } ) ch_bedgraph_filtered = BEDTOOLS_INTERSECT( ch_intersect_inputs, null )

I am personally fine with option 2 and 3 with maybe a small preference for option 2 since it's more declarative about the actual input to the module.

With option 2, can a Metadata record contain fields beyond those that are specified in the declaration?

@prototaxites yes

Oh option 3 is basically what I wanted right? I think I maybe misunderstood what was implemented in this example - if I now understand correctly it's some sort of anonymous record type - and simply using a defined record type solves it?

@awgymer yes, I think you understood the PR example but I didn't make it clear in my first comment that there were alternatives

You only lose the meta-map config closure thing if you destructure the record and remove the meta-map. But if you keep the meta-map, or use an external record type instead of destructuring, then you still have a variable like meta or sample that you can use in the config to access arbitrary metadata

bentsherman · 2026-01-13T23:06:22Z

modules/nf-core/bismark/align/main.nf

    input:
-    tuple val(meta), path(reads)
-    tuple val(meta2), path(fasta, stageAs: 'tmp/*') // This change mounts as directory containing the FASTA file to prevent nested symlinks
-    tuple val(meta3), path(index)
+    (id: String, single_end: Boolean, reads: Path, fasta: Path, bismark_index: Path): Record


This process originally had three separate inputs for the reads, fasta, and index. This is fine as long as you only ever have a single fasta/index, but if you wanted to do combinations of many reads and many fasta/index files, you would have to "cheat" with combine and multiMap.

The "correct" way to do that is to declare everything in a single input. This is quite tedious with a tuple, somewhat less tedious with a record.

The order doesn't matter, but the names do, so you have to make sure that the input channel uses the same field names.

I suspect most modules will converge on this convention of "one big record input" for maximum flexibility. More on this in later comments.

bentsherman · 2026-01-13T23:15:30Z

modules/nf-core/bismark/align/main.nf

    output:
-    tuple val(meta), path("*bam")       , emit: bam
-    tuple val(meta), path("*report.txt"), emit: report
-    tuple val(meta), path("*fq.gz")     , emit: unmapped, optional: true
-    tuple val("${task.process}"), val("bismark"), eval('bismark --version | grep Version | sed -e "s/Bismark Version: v//" | xargs'), topic: versions, emit: versions_bismark
+    record(
+        id: id,
+        single_end: single_end,
+        bam: file("*bam"),
+        align_report: file("*report.txt"),
+        unmapped: file("*fq.gz", optional: true)
+    )
+
+    topic:
+    file("versions.yml") >> 'versions'


Here we make the versions output to a topic output and combine the tuple outputs into a single record. I call them "skinny tuples" and "fat records".

The language server should be able to convert most of this for you. Just change emit: versions to topic: versions and it should handle the rest.

We could probably enforce a rule of "only one output allowed" because you might as well put everything into a single fat record. Downstream processes can pick out what they need.

So maybe the convention will be "one big record input" and "one big record output". It has a nice symmetry to it...

We decided to have a different structure for the versions topic output, you can see an example in fastqc. It basically contains the raw values of the versions YAML file instead of emitting the file itself which makes it easier to parse down the line. AFAICT this should be possible by having a "versions" record type in nf-core, right?

Yes you can structure that versions output however you want, be it a file or tuple or record. I kept it as a file here just to keep the PR focused on record types

bentsherman · 2026-01-13T23:27:31Z

modules/nf-core/multiqc/main.nf

+    replace_names        : Path?
+    sample_names         : Path?


Several processes in this pipeline have optional file inputs, but they couldn't be specified as optional in the old syntax, and you had to supply an empty list [] because null didn't work.

With static types, you can now specify optional file inputs as Path?, and you can use null when calling the process.

bentsherman · 2026-01-13T23:35:49Z

modules/nf-core/bedtools/intersect/main.nf

+    container "${workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container
+        ? 'https://depot.galaxyproject.org/singularity/bedtools:2.31.1--hf5e1c6e_0'
+        : 'biocontainers/bedtools:2.31.1--hf5e1c6e_0'}"


Pardon the formatting changes, it's a side effect of running the auto-converter. The only notable changes are the inputs and outputs.

bentsherman · 2026-01-13T23:39:18Z

main.nf

-params.bwamem_index  = getGenomeAttribute('bwa')
-params.bismark_index = params.aligner == 'bismark_hisat' ? getGenomeAttribute('bismark_hisat2') : getGenomeAttribute('bismark')
+
+params {


This params block was generated by the "Convert script to static types" language server command. It uses the nextflow_schema.json and any legacy parameter assignments to infer the names, types, default values, and even the description comments.

Likely some of these params should be moved to the config file, if they are only used in the config, e.g. igenomes_ignore and igenomes_base.

What is the resolving order for params now? Currently the only reason we put params in the workflow itself is in order to set generated defaults (such as here for fasta).

The resolution order should be the same

But I think the convention should be that all params used by the script should be declared in the script, and not in the config (except for e.g. profile overrides)

That way Nextflow knows up-front which params you intend to use in your script, and it can validate your workflow code accordingly

bentsherman · 2026-01-13T23:44:07Z

subworkflows/nf-core/fasta_index_bismark_bwameth/main.nf

+    ch_fasta = isGzipped(fasta)
+        ? GUNZIP( fasta )
+        : channel.value(fasta)


It looks like this subworkflow was designed to process multiple fasta files, bismark indices, etc, but it seems the pipeline always assumes one of each, so I applied that assumption here to simplify the logic.

You could restore the multiple fasta files support if you wanted to, just use filter and mix.

bentsherman · 2026-01-13T23:48:29Z

workflows/methylseq/main.nf

-    ch_bismark_index   // channel: [ path(bismark index)   ]
-    ch_bwameth_index   // channel: [ path(bwameth index)   ]
-    ch_bwamem_index    // channel: [ path(bwamem_index)    ]
+    ch_samples: Channel<Sample>


This is where I think record types will mostly be used -- at the boundaries of subworkflows. Basically, you figure out what fields are required by the processes that use this channel, throw them into a record type (in this case Sample), and use the record type here.

This makes it clear that whatever ch_samples contains, it needs to at least contain the fields in Sample.

bentsherman · 2026-01-13T23:54:38Z

workflows/methylseq/main.nf

+    ch_fasta_index: Value<Path>
+    ch_bismark_index: Value<Path>
+    ch_bwameth_index: Value<Path>
+    params: MethylseqParams


You may have heard that using params outside the entry workflow is discouraged... well, here's a trick you can use to get around it quickly.

Grab all the references to params.<whatever> in the subworkflow, throw them into a record type (see MethylseqParams at the bottom of this file), and declare a params input with the record type. When you call the subworkflow, pass in the main params, and as long as your params block is compatible with your record type, it should work. Easy and type-safe!

Does declaring an input params not risk confusion with the core nextflow params object? Is the params object no longer exposed at all in the workflow or does this mask it?

It essentially masks the global params just in this subworkflow

I don't plan to disallow the global params anytime soon, since many people still use it and it doesn't really hurt anything. I will probably just promote it to a warning once record types are supported, because then there is a reasonable workaround

bentsherman · 2026-01-14T00:00:28Z

subworkflows/nf-core/fastq_align_dedup_bismark/main.nf

+    ch_bismark_inputs = ch_reads
+        .cross( fasta.map { fa -> fa ? record(fasta: fa) : null } )
+        .cross( bismark_index.map { bi -> bi ? record(bismark_index: bi) : null } )
+        .filter { r -> r.fasta && r.bismark_index }


This logic was moved from the METHYLSEQ workflow. It's somewhat better thanks to the records (no more multiMap), but still pretty ugly.

There might be a good use case for named arguments here. Something like this:

ch_alignments = BISMARK_ALIGN( ch_reads, fasta: fasta, bismark_index: bismark_index )

The named arguments would be added to each record in ch_reads, which is fine because they are single values.

What happens in this the named arguments case if the fasta was null and the process doesn't take it optionally? Would it fail, or just skip running, which is what happens in the current situation AFAICT?

Good point. It would fail. We'll have to think about how one would recover the filter logic.

You could just set fasta and bismark_index to null if they are missing, instead of channel.value(null). Then you could use an if statement:

if (fasta && bismark_index) { ch_alignments = BISMARK_ALIGN( ch_reads, fasta: fasta, bismark_index: bismark_index ) }

So their type would be Value<Path>? instead of Value<Path?>

bentsherman · 2026-01-14T00:03:41Z

subworkflows/nf-core/fastq_align_dedup_bismark/main.nf

+    /*
+     * Collect per-sample results
+     */
+    ch_results = ch_bam
+        .join(ch_bai, by: 'id')
+        .join(ch_coverage2cytosine, by: 'id')
+        .join(ch_methylation, by: 'id')
+        .join(ch_bismark_report, by: 'id')


In the spirit of "fat records" and keeping related data together, a common subworkflow convention might be to join related channels as shown here. This way, the subworkflow can provide one big record output, and downstream processes can pick out what they need.

The main risk is name conflicts, so you might want to use field name prefixes or nested records to keep everything organized.

Would the first or the last record take precedence in overwriting names? Or will it just error if the field is already set?

Likely the last record would take precedence, similar to map addition. But it might be a good idea to give a type-checker warning or runtime warning if there is a name conflict

bentsherman added 2 commits December 22, 2025 20:18

Migrate to Nextflow 25.10

e5dbade

Record types [wip]

09a299f

bentsherman requested a review from a team as a code owner January 13, 2026 03:37

bentsherman requested a review from ewels January 13, 2026 03:37

bentsherman commented Jan 13, 2026

View reviewed changes

bentsherman commented Jan 14, 2026

View reviewed changes

		(id: String, intervals1: Path, intervals2: Path) : Record
		chrom_sizes: Path?

Preview: Record types #593

Are you sure you want to change the base?

Preview: Record types #593

Uh oh!

Conversation

bentsherman commented Jan 13, 2026

Uh oh!

nf-core-bot commented Jan 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awgymer Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

awgymer Jan 15, 2026 •

edited

Loading