Introduction to Nextflow on Unity

Nextflow is a workflow management software and domain-specific language (DSL) that allows you to automate pipelines within SLURM (or other workload managers). Nextflow is ideal for automating data analysis pipelines, and can simplify the management of software environments, resource allocation, error handling, and data organization. The documentation is extensive, so this guide focuses on the steps required to translate a SLURM pipeline to Nextflow with specific tips for using Nextflow on Unity.

Summary

If you are already familiar with Nextflow or workflow management, these are the main steps required to build and run your pipeline on Unity:

Define your pipeline in the Nextflow language (.nf) using processes to encapsulate your code and workflows to control the dataflow, as described in the Nextflow documentation.
Copy the Unity nextflow config to a file called nextflow.config to use default slurm allocation limits for Unity. Overwrite the actual memory, cpu, etc. requirements per process with directives within your main.nf script.
Set the publishDir directive to tell nextflow where you want to save output files for a process
Use conda or apptainer by setting conda.enabled = true or apptainer.enabled = true in your nextflow.config. Specify the environment per process with directives, using recipes (e.g. conda ‘pandas numpy’) or absolute paths (e.g. container ‘/home/user/containers/container.sif’)
Launch your nextflow runner with nextflow run main.nf within a slurm job. Allocate enough time for it to complete your whole pipeline, and make sure to load the nextflow module, and the conda or apptainer modules if using those
Set the working directory (where intermediate files and conda environments are stored) using nextflow run main.nf -work-dir /path/to/workdir or by setting workDir = ‘/path/to/workdir/’ in your nextflow.config. Otherwise it will write these in your current working directory under a directory called work/.

Building a nextflow project (on Unity)

Setting up your nextflow directory
- main.nf - your nextflow file that defines your processes and workflow. Processes can also be written as separate files and imported into your workflow script
- nextflow.config - nextflow will automatically look for a configuration file called ‘nextflow.config’ in your project directory
- Additional scripts and libraries can be organized however you desire. In the example below, they are stored in a lib/ directory.
  stylus_note
  Consider using git to version control your nextflow directory.
  stylus_note
  It is useful to keep your code libraries within the nextflow project directory so that you can refer to them with relative paths within your nextflow script, and can version control them together with your nextflow code. The Nextflow documentation gives some additional tips on best organization practices: https://www.nextflow.io/docs/latest/sharing.html
  nextflow_project/
  ├── lib
  │ ├── calculate_statistics.py
  │ ├── combine_files.py
  │ └── generate_random_numbers.py
  ├── main.nf
  └── nextflow.config
Writing your nextflow (.nf) script
1. Nextflow processes contain individual analysis steps and contain the following sections:
  - Script: executes bash code, though this can be modified to use other interpreters by adding a shebang line (e.g. #!/usr/bin/python)
  - Input: and Output: define filenames/patterns that will be produced by your script. Nextflow will verify that the appropriate input files are present before executing a process, and will verify that the appropriate output files were generated by the process after it completes.
  - Directives are supplied separate from the input, output, and script blocks. These are especially useful to specify properties for slurm (e.g. memory and time limits), and for specifying conda or apptainer environments specific to that process.
  - The process will be run using the default executor which can be defined in your configuration file. Adding the directive executor ‘slurm’ to a process will cause it to create a slurm job for each instance of the process. Using executor ‘local’ will cause the process to be launched using your local environment/resources.
  - Custom slurm arguments can be passed via directives. E.g. clusterOptions '--nodes 1'. This can allow you to allocate GPUs, apply node constraints, enable MPI.
2. The ‘workflow’ defines how data passes through your script
  - Nextflow uses a language based on groovy, giving you access to its syntax to define logic. It includes common data structures and logic operations: https://www.nextflow.io/docs/latest/script.html
  - Parallelization is enabled by using channels to pass asynchronous values to processes. Common usecases would be to generate a channel of filepaths to run your pipeline on different inputs, or a channel of integers to replicate a slurm array job. https://www.nextflow.io/docs/latest/channel.html#dataflow-page
  - Nextflow will infer how to run your processes and how to store your data based on how the inputs and outputs are defined.
    stylus_note
    Use variables and pattern matching in your nextflow script to avoid having to hard-code filepaths.
Configuring your project with nextflow.config
1. Before setting up your configuration file, look at the Unity nf-core config file. You can copy this file directly into your nextflow.config, as it provides useful default settings for unity and enables apptainer.
2. Parameters specific to your pipeline are added with the params block, and these can be referred to within your nextflow script as params.<parameter_name>
  stylus_note
  Command line arguments to `nextflow run` will also be read as parameters. E.g. adding `--output_path <path>` will create a variable params.output\_path that can be referred to in your script.
3. Conda and apptainer must be manually enabled within your nextflow.config to be used. For example, to enable conda add conda.enabled = true. Conda environments are specified as a directive per process using the environment path, e.g. conda ‘/path/to/conda_env/’. Similarly, container environments can be specified as a directive like container ‘/path/to/container.sif’
  stylus_note
  Additional arguments for apptainer can be specified in the config. For example, in apptainer you will commonly need to bind parts of the filesystem that include your data so that they can be accessed within the container. This can be specified in the configuration file as follows: ``` apptainer { enabled = true runOptions = "-B ${projectDir}/lib"} ```
The working directory and publish directory determine where data is written
1. The working directory stores intermediate files, outputs, and logs for your pipeline. This is useful for debugging and is how nextflow caches your results so you can resume the pipeline if it was interrupted.
  - By default, nextflow will create a directory called ‘work’ in your current working directory
  - You can change the working directory location using the -work-dir command line argument of nextflow run, or by setting workDir = ‘/path/to/workdir’ in your nextflow.config file.
    report
    If your job writes large amounts of data, make sure the working directory is set to a suitable location such as `/work` or `/scratch`.
2. The publish directory is where data is written after processes are finished.
  - Each process can have its own publish directory, which is set by the publishDir directive in the process (e.g. publishDir ‘/path/to/output_dir’, mode: 'copy', overwrite: true)
  - All outputs defined in the output block of the process will be copied to the publish directory
    report
    If `publishDir` is not defined for a process, you won’t see any of the results of that processes\!
Running your pipeline
- On Unity, you will need to load the appropriate modules to run nextflow, and corresponding modules for conda or apptainer if your pipeline uses those
  module load nextflow/24.10.3
  module load conda/latest
  module load apptainer/latest
- Launch the nextflow script, with a custom workDir and a command line argument ‘output_dir’.
  nextflow run main.nf -work-dir /path/to/workdir> --output_dir /path/to/outputDir
- The nextflow run command is typically executed from an sbatch script, and it will stay alive for as long as your pipeline takes to complete.
  report
  Make sure you properly calculate the run time in your sbatch allotment.
  report
  Your nextflow job will need enough resources to execute any processes that use the ‘local’ executor.
Additional coding tips:
- You can declare variables in your nextflow script with ‘=’.
- Variables will generally be scoped within the block where they are defined. However, inputs and outputs of processes/workflows can be referred to in other scopes (e.g. generateNumbers.out). Parameters are available globally (params.<parameter_name>).
- Variables are referred to without quotes. To refer to a variable within quotes, use the following syntax: “${variable}”. Single quotes are treated as literal and variables cannot be used within them, like in bash.
- Nextflow has several global default variables that may be useful. In the example script below, I refer to projectDir, which is a nextflow variable that gives the path of your nextflow script. https://www.nextflow.io/docs/latest/config.html#constants
- Use the executor ‘local’ directive for processes that don’t require separate slurm allocations. These will be run in your current environment.

Simple example

The nextflow code in this example is hosted at: https://github.com/jvjvjvjv/nextflow-testing.git

Below is a graphic representation of a three step pipeline where a process generates results asynchronously with an array job, and those reuslts are then combined and used in a later process to calculate statistical results. To keep things quick and simple for this example, we simply generate random numbers, concatenate them, and then calculate their means across the rows and columns.

Executing with SLURM

In SLURM, each step is specified using its own sbatch script, allowing us to specify unique resource requirements for each step and enable paralellization of step 1 via an array job. If we were to run the pipeline again, we would have to manually edit the filepaths across all 3 scripts, and launch and verify the success of each step manually

slurm_script1.sh:

#!/bin/bash
#SBATCH --job-name=generate_numbers
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=1G
#SBATCH --partition=cpu
#SBATCH --time=01:00:00
#SBATCH --array=1-12%12

script_dir="/work/pi_yingzhang_uri_edu/jvailionis/nextflow_testing/lib"
work_dir="/scratch3/workspace/jason_vailionis_uri_edu-kinetics/nextflow_testing"

cd "$work_dir"
# writes a tsv of random numbers in the results_dir directory
mkdir -p results_dir
python "$script_dir"/generate_random_numbers.py \
    results_dir/result_"$SLURM_ARRAY_TASK_ID".tsv

slurm_script2.sh:

#!/bin/bash
#SBATCH --job-name=combine_files
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=4G
#SBATCH --partition=cpu
#SBATCH --time=01:00:00

module load conda/latest
conda activate Py-3.12-rdkit  # environment with pandas

script_dir="/work/pi_yingzhang_uri_edu/jvailionis/nextflow_testing/lib"
work_dir="/scratch3/workspace/jason_vailionis_uri_edu-kinetics/nextflow_testing"

cd "$work_dir"
# combines all files within results_dir into combined_results.tsv
python "$script_dir"/combine_files.py results_dir/ combined_results.tsv

slurm_script3.sh:

#!/bin/bash
#SBATCH --job-name=calculate_statistics
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --mem=2G
#SBATCH --partition=cpu
#SBATCH --time=01:00:00

module load conda/latest
conda activate Py-3.12-rdkit  # environment with pandas

script_dir="/work/pi_yingzhang_uri_edu/jvailionis/nextflow_testing/lib"
work_dir="/scratch3/workspace/jason_vailionis_uri_edu-kinetics/nextflow_testing"

cd "$work_dir"

# calculates per-row and per-column mean and creates two output files beginning with 'output_file'
python "$script_dir"/calculate_statistics.py combined_results.tsv output_file_

To run the pipeline:

$ sbatch slurm_script1.sh
... wait for completion, check output files
$ sbatch slurm_script2.sh
... wait for completion, check output files
$ sbatch slurm_script3.sh

Executing with Nextflow

The pipeline can be defined within a single Nextflow workflow, which will automate the SLURM allocation, dataflow, error handling, and cleanup.

This script demonstrates some of the unique features of Nextflow, including the use of hard-coded file names, file glob patterns, or variables to specify data flow, the built-in projectDir variable that can be used to refer to scripts within your nextflow directory with relative paths, and specification of different conda environments and local and slurm executors on a per-process basis.

main.nf:

process generateNumbers {
    executor 'slurm'
    cpus 1
    memory '1G'
    time '1h'

    input:
    val x

    output:
    path "result_${x}.tsv"

    script:
    """
    python ${projectDir}/lib/generate_random_numbers.py result_${x}.tsv
    """
}

process combineFiles {
    conda '/work/pi_yingzhang_uri_edu/jvailionis/conda_env/Py-3.12-rdkit'
    executor 'local'
    memory '4G'
    cpus 1
    time '1h'
    publishDir params.output_dir, mode: 'copy', overwrite: true

    input:
    path "results_dir/*"

    output:
    path "combined_results.tsv"

    script:
    """
    python ${projectDir}/lib/combine_files.py results_dir/ combined_results.tsv
    """
}

process calculateStatistics {
    conda '/work/pi_yingzhang_uri_edu/jvailionis/conda_env/Py-3.12-rdkit'
    publishDir params.output_dir, mode: 'copy', overwrite: true
    executor 'local'
    cpus 1
    time '1h'
    memory '1G'

    input:
    path combined_results

    output:
    path("output_file_*.csv", arity: '2')

    script:
    """
    python ${projectDir}/lib/calculate_statistics.py ${combined_results} output_file_
    """
}

// Workflow definition
workflow {
    def num = channel.from(1..params.num_replicates)
    random_numbers = generateNumbers(num)
    combined_numbers = combineFiles(random_numbers.collect())
    calculateStatistics(combined_numbers)
}

run_nextflow.sh:

#!/bin/bash
#SBATCH --job-name=nf-runner
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --partition=cpu
#SBATCH --time=02:00:00

module load nextflow/24.10.3
module load conda/latest

# Path to conda environment
conda_path='/work/pi_yingzhang_uri_edu/jvailionis/conda_env/Py-3.12-rdkit'

# Path to temporary working directory. Recommended if your script will generate lots of data
workdir='/scratch3/workspace/jason_vailionis_uri_edu-kinetics/nextflow_work'

# Path to output dir. If not specified, will create a folder called 'output' in the project directory
outdir='/scratch3/workspace/jason_vailionis_uri_edu-kinetics/nextflow_results'

nextflow run main.nf \
    -resume \
    -process.conda "$container" \
    -work-dir "$workdir" \
    --output_dir "$outdir" \
    --num_replicates 10

To run the pipeline:

$ sbatch run_nextflow.sh

Documentation