Skip to content

Conversation

@bentsherman
Copy link

This PR is my attempt to try out the proposed syntax for static types and record types on a real nf-core pipeline

  • Fix strict syntax errors and warnings
  • Replace ch_versions channels with topic channel
  • Add params block with typed parameters
  • Add type annotations to workflow takes
  • Convert processes to static types
  • Replace tuples with records

TODO:

  • Fix remaining strict syntax errors
  • Replace publishDir with workflow outputs
  • Run with test profile

@bentsherman bentsherman requested a review from a team as a code owner January 13, 2026 03:37
@bentsherman bentsherman requested a review from ewels January 13, 2026 03:37
@nf-core-bot
Copy link
Member

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.3.2.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.


process BEDTOOLS_INTERSECT {
tag "$meta.id"
tag id
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previsouly this had to be specified with a closure e.g. tag { id } or a dynamic string e.g. tag "$id", but with the strict syntax it can now be specified without either.

Comment on lines +13 to +14
(id: String, intervals1: Path, intervals2: Path) : Record
chrom_sizes: Path?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we refactor the first input as a record, specifying the expected fields and types.

We replace the meta-map with the specific fields that are used by the process. In this case it's only id, but other common fields include single_end and strandedness. The incoming record can have other meta-fields and even files, they will just be ignored by the process.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One implication is that you can't do something like this in the config:

process {
    withName:'BEDTOOLS_INTERSECT' {
        ext.prefix = { "${meta.my_custom_field}" }
    }
}

Because meta no longer exists, and even if my_custom_field happens to be in the incoming record, there is no way to access it from here. You can only use explicitly declared inputs like id.

So that could be an argument for continuing to have a meta map in the record. On the other hand, I think this pattern is difficult to understand and error-prone, so I'm not inclined to bend over backwards to support it.

Copy link

@awgymer awgymer Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bit of a rollercoaster reading these two comments.

Beyond making it slightly easier to use records throughout a pipeline (by not having to constantly drop extra values) the main driver to want to allow extra values is so they can be used by closures to set custom values dynamically at run time.

It's only personal anecdotal evidence but I make very heavy usage of closures like the above to set custom prefixes, args, etc.

Losing this ability or having to still pass around a meta object (in this case would probably rename it to ext or extra_args actually) would really dent the benefits of the record types I think.

I assume the reason the extra values are not available is because the record is basically unwrapped by that point and so only the bare value names are present? Would it not be possible to expose the record object and then use e.g. record.id and this would presumably allow for the undefined extra fields to also be accessed?

I appreciate that this may not actually be how records really work and so may not be possible in such a manner.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be an alternative way to handle metadata in this case without having to specify each possible option?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @awgymer that this is concerning. We did already have a discussion about this on Slack (https://nextflow.slack.com/archives/C09T3N7KNAY/p1765798958982349) but I think this is definitely worth picking up on again.

I also make heavy use of parameterising modules using a meta object, and more specifically in one pipeline I am developing, various command-line flags are sample-specific and supplied as part of the sample sheet, and these need to be passaged through the pipeline in the various sample records until they are needed. So I would be deeply concerned if this feature went away without a clear and easy way to replace it.

Part of the problem I see with records is that the output record only includes some subset of fields, plus whatever new files were generated. So if you want to write a chain of processes, you might have to repeatedly join back to the initial input record in order to "re-capture" some essential piece of sample metadata that's an important downstream input.

I think this is definitely a bigger problem for "generic", re-usable modules (all nf-core modules), compared to modules that are written for and exist only within a specific pipeline context. nf-core modules simply cannot account for all the possible use-cases or inputs, and have to be written to be as flexible as possible.

This is partly why I'm so in in favour of the idea of wrapping process calls inside the map operator - this would allow you to manipulate the input channels, generate the things we normally include in a config closure as a variable, and run the process. Then in the same closure you can co-mingle the inputs and the outputs to return some record with a structure of your own choosing. This gives a great degree of flexibility with minimal need for additional join operations, which might be tricky to set up and require modifications to records on both sides to ensure the joins line up.


From an nf-core perspective, borrowing from the syntax in ths PR, and assuming config closures are not supported, I think we will have to have an approach of something like the following, where all processes take an (undefined, unstructured) meta map, as well as args and a prefix. The output needs to output that meta object untouched.

input:
(
    id: String,
    intervals1: Path,
    intervals2: Path,
    meta: Map,
    args: String,
    prefix: String
) : Record
chrom_sizes: Path?

output: 
record(id: id, meta: meta, intersect: file("*.${extension}"))

Then you have to add these fields to the record before calling the process, either in a closure that calls the process, or in a map closure before the call. Here r.meta.args is some arbitrary sample-specific bit of metadata that we want to keep attached to the record.

ch_intersect_inputs = ch_inputs
    .map { r -> record(
        id: r.id, 
        intervals1: r.bedgraph,
        args: r.meta.args ?: "", 
        prefix: r.id) }
    .cross( target_regions.map { tr -> record(intervals2: tr) } )

ch_bedgraph_filtered = BEDTOOLS_INTERSECT( ch_intersect_inputs, null )

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am personally fine with option 2 and 3 with maybe a small preference for option 2 since it's more declarative about the actual input to the module.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With option 2, can a Metadata record contain fields beyond those that are specified in the declaration?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh option 3 is basically what I wanted right? I think I maybe misunderstood what was implemented in this example - if I now understand correctly it's some sort of anonymous record type - and simply using a defined record type solves it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awgymer yes, I think you understood the PR example but I didn't make it clear in my first comment that there were alternatives

You only lose the meta-map config closure thing if you destructure the record and remove the meta-map. But if you keep the meta-map, or use an external record type instead of destructuring, then you still have a variable like meta or sample that you can use in the config to access arbitrary metadata

Comment on lines 12 to +13
input:
tuple val(meta), path(reads)
tuple val(meta2), path(fasta, stageAs: 'tmp/*') // This change mounts as directory containing the FASTA file to prevent nested symlinks
tuple val(meta3), path(index)
(id: String, single_end: Boolean, reads: Path, fasta: Path, bismark_index: Path): Record
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This process originally had three separate inputs for the reads, fasta, and index. This is fine as long as you only ever have a single fasta/index, but if you wanted to do combinations of many reads and many fasta/index files, you would have to "cheat" with combine and multiMap.

The "correct" way to do that is to declare everything in a single input. This is quite tedious with a tuple, somewhat less tedious with a record.

The order doesn't matter, but the names do, so you have to make sure that the input channel uses the same field names.

I suspect most modules will converge on this convention of "one big record input" for maximum flexibility. More on this in later comments.

Comment on lines 18 to 28
output:
tuple val(meta), path("*bam") , emit: bam
tuple val(meta), path("*report.txt"), emit: report
tuple val(meta), path("*fq.gz") , emit: unmapped, optional: true
tuple val("${task.process}"), val("bismark"), eval('bismark --version | grep Version | sed -e "s/Bismark Version: v//" | xargs'), topic: versions, emit: versions_bismark
record(
id: id,
single_end: single_end,
bam: file("*bam"),
align_report: file("*report.txt"),
unmapped: file("*fq.gz", optional: true)
)

topic:
file("versions.yml") >> 'versions'
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we make the versions output to a topic output and combine the tuple outputs into a single record. I call them "skinny tuples" and "fat records".

The language server should be able to convert most of this for you. Just change emit: versions to topic: versions and it should handle the rest.

We could probably enforce a rule of "only one output allowed" because you might as well put everything into a single fat record. Downstream processes can pick out what they need.

So maybe the convention will be "one big record input" and "one big record output". It has a nice symmetry to it...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to have a different structure for the versions topic output, you can see an example in fastqc. It basically contains the raw values of the versions YAML file instead of emitting the file itself which makes it easier to parse down the line. AFAICT this should be possible by having a "versions" record type in nf-core, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you can structure that versions output however you want, be it a file or tuple or record. I kept it as a file here just to keep the PR focused on record types

Comment on lines +16 to +17
replace_names : Path?
sample_names : Path?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several processes in this pipeline have optional file inputs, but they couldn't be specified as optional in the old syntax, and you had to supply an empty list [] because null didn't work.

With static types, you can now specify optional file inputs as Path?, and you can use null when calling the process.

Comment on lines +8 to +10
container "${workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container
? 'https://depot.galaxyproject.org/singularity/bedtools:2.31.1--hf5e1c6e_0'
: 'biocontainers/bedtools:2.31.1--hf5e1c6e_0'}"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pardon the formatting changes, it's a side effect of running the auto-converter. The only notable changes are the inputs and outputs.

params.bwamem_index = getGenomeAttribute('bwa')
params.bismark_index = params.aligner == 'bismark_hisat' ? getGenomeAttribute('bismark_hisat2') : getGenomeAttribute('bismark')

params {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This params block was generated by the "Convert script to static types" language server command. It uses the nextflow_schema.json and any legacy parameter assignments to infer the names, types, default values, and even the description comments.

Likely some of these params should be moved to the config file, if they are only used in the config, e.g. igenomes_ignore and igenomes_base.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the resolving order for params now? Currently the only reason we put params in the workflow itself is in order to set generated defaults (such as here for fasta).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resolution order should be the same

But I think the convention should be that all params used by the script should be declared in the script, and not in the config (except for e.g. profile overrides)

That way Nextflow knows up-front which params you intend to use in your script, and it can validate your workflow code accordingly

Comment on lines +29 to +31
ch_fasta = isGzipped(fasta)
? GUNZIP( fasta )
: channel.value(fasta)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this subworkflow was designed to process multiple fasta files, bismark indices, etc, but it seems the pipeline always assumes one of each, so I applied that assumption here to simplify the logic.

You could restore the multiple fasta files support if you wanted to, just use filter and mix.

ch_bismark_index // channel: [ path(bismark index) ]
ch_bwameth_index // channel: [ path(bwameth index) ]
ch_bwamem_index // channel: [ path(bwamem_index) ]
ch_samples: Channel<Sample>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I think record types will mostly be used -- at the boundaries of subworkflows. Basically, you figure out what fields are required by the processes that use this channel, throw them into a record type (in this case Sample), and use the record type here.

This makes it clear that whatever ch_samples contains, it needs to at least contain the fields in Sample.

ch_fasta_index: Value<Path>
ch_bismark_index: Value<Path>
ch_bwameth_index: Value<Path>
params: MethylseqParams
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may have heard that using params outside the entry workflow is discouraged... well, here's a trick you can use to get around it quickly.

Grab all the references to params.<whatever> in the subworkflow, throw them into a record type (see MethylseqParams at the bottom of this file), and declare a params input with the record type. When you call the subworkflow, pass in the main params, and as long as your params block is compatible with your record type, it should work. Easy and type-safe!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does declaring an input params not risk confusion with the core nextflow params object? Is the params object no longer exposed at all in the workflow or does this mask it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It essentially masks the global params just in this subworkflow

I don't plan to disallow the global params anytime soon, since many people still use it and it doesn't really hurt anything. I will probably just promote it to a warning once record types are supported, because then there is a reasonable workaround

Comment on lines +24 to +27
ch_bismark_inputs = ch_reads
.cross( fasta.map { fa -> fa ? record(fasta: fa) : null } )
.cross( bismark_index.map { bi -> bi ? record(bismark_index: bi) : null } )
.filter { r -> r.fasta && r.bismark_index }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic was moved from the METHYLSEQ workflow. It's somewhat better thanks to the records (no more multiMap), but still pretty ugly.

There might be a good use case for named arguments here. Something like this:

ch_alignments = BISMARK_ALIGN( ch_reads, fasta: fasta, bismark_index: bismark_index )

The named arguments would be added to each record in ch_reads, which is fine because they are single values.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in this the named arguments case if the fasta was null and the process doesn't take it optionally? Would it fail, or just skip running, which is what happens in the current situation AFAICT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It would fail. We'll have to think about how one would recover the filter logic.

You could just set fasta and bismark_index to null if they are missing, instead of channel.value(null). Then you could use an if statement:

if (fasta && bismark_index) {
  ch_alignments = BISMARK_ALIGN( ch_reads, fasta: fasta, bismark_index: bismark_index )
}

So their type would be Value<Path>? instead of Value<Path?>

Comment on lines +81 to +88
/*
* Collect per-sample results
*/
ch_results = ch_bam
.join(ch_bai, by: 'id')
.join(ch_coverage2cytosine, by: 'id')
.join(ch_methylation, by: 'id')
.join(ch_bismark_report, by: 'id')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spirit of "fat records" and keeping related data together, a common subworkflow convention might be to join related channels as shown here. This way, the subworkflow can provide one big record output, and downstream processes can pick out what they need.

The main risk is name conflicts, so you might want to use field name prefixes or nested records to keep everything organized.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the first or the last record take precedence in overwriting names? Or will it just error if the field is already set?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely the last record would take precedence, similar to map addition. But it might be a good idea to give a type-checker warning or runtime warning if there is a name conflict

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants