Creating Pipelines with Builders

Builders are responsible for creating the pipeline configuration that is later used to run the pipeline. Configuration information include the library used, input/output filenames, and run parameters for related algorithm.

Here are individual file readers and builders:

1. File Readers

FastqReader

from cosap import FastqReader

sample_fastq = FastqReader("/path/to/fastq.fastq", name="normal_sample")

You can bundle paired FASTQ files in a list:

germline_fastqs = [
    FastqReader("/path/to/fastq_1.fastq", name="normal_sample", read=1),
    FastqReader("/path/to/fastq_2.fastq", name="normal_sample", read=2)
]

tumor_fastqs = [
    FastqReader("/path/to/fastq_1.fastq", name="tumor_sample", read=1),
    FastqReader("/path/to/fastq_2.fastq", name="tumor_sample", read=2)
]

BamReader

from cosap import BamReader

sample_bam = BamReader("/path/to/sample.mdup.bam")

When running COSAP without Dockerization, relative file paths passed to Readers are resolved relative to the directory from which you run the Python command.

The workdir option in the pipeline builder only affects where intermediate and final files will be created.

2. Builders

Trimmer

Trimmer builder for adapter trimming and quality control. Takes the list of paired FastqReaders as input. Uses fastp.

from cosap import Trimmer

trimmer_germline = Trimmer(input_step=germline_fastqs)

Mapper

Mapper builder for read mapping. Takes the Trimmer or FastqReader as input. Currently following libraries are supported:

from cosap import Mapper

mapper_germline_params = {
    "read_groups": {
        "ID": "H0164.2",
        "SM": "Pt28N",
        "PU": "0",
        "PL": "illumina",
        "LB": "Solexa-272222"
    }
}

mapper_germline_bwa = Mapper(
    library="bwa2",
    input_step=trimmer_germline,
    params=mapper_germline_params
)

mapper_germline_bowtie = Mapper(
    library="bowtie",
    input_step=trimmer_germline,
    params=mapper_germline_params
)

MarkDuplicates Builder

Duplicate read tagger and remover builder. Takes Mapper as input.

from cosap import MDUP

mdup_germline = MDUP(input_step=mapper_germline_bwa)

# By default, this removes all duplicates.
# If you only want to mark them, use duplicate_handling_strategy argument
mdup_germline = MDUP(input_step=mapper_germline_bwa,duplicate_handling_method="mark")

BaseRecalibrator

GATK BaseRecalibrator builder. Takes Mapper or MDUP as input.

from cosap import Recalibrator

recalibrator_germline = Recalibrator(input_step=mdup_germline)

Elprep Preprocessing Tool

Elprep is a high performance tool for preprocessing. Its functionality is the same as duplicate remover and base recalibrator combined. This tool requires up to 200GB of memory therefore is only recommended to be used on capable workstations and servers.

from cosap import Elprep

elprep_recalibrator_germline = Elprep(input_step=mapper_germline_bwa)

VariantCaller

Variant caller builder for variant detection tools. Takes Mapper, MDUP, Recalibrator, or Elprep of both normal and tumor samples as input. Currently the following libraries are supported:

from cosap import VariantCaller

sample_params = {"germline_sample_name":"Pt28N"}

mutect_caller = VariantCaller(
    library="mutect", 
    germline=recalibrator_germline, 
    tumor=recalibrator_tumor, 
    params=sample_params
)
strelka_caller = VariantCaller(
    library="strelka", 
    germline=recalibrator_germline, 
    tumor=recalibrator_tumor, 
    params=sample_params
)

If sample name is provided in the Mapper as read group, it must be provided in the VariantCaller params as well.

For Strelka2, Manta, and VarNet, COSAP requires Docker to be installed on the system.

On some systems, Mutect2 may cause crashes when used with multithreading. To turn off multithreading in COSAP, set COSAP_THREADS_PER_JOB to 1.

VariantAnnotator

Variant annotator builder.

Currently following libraries are supported:

from cosap import Annotator

annotator = Annotator(library="vep", input_step=mutect_caller)

For Ensembl-vep, COSAP requires Docker to be installed on the system.

Building Pipeline Config

After creating individual pipeline steps, it is time to gather them under a pipeline. To do this you can simply create a Pipeline instance and add the previously created steps to it.

from cosap import Pipeline

pipeline = Pipeline()

Stacking steps into pipeline is easy as .add():

pipeline.add(trimmer_germline)
pipeline.add(trimmer_tumor)
pipeline.add(mapper_germline_bwa)
pipeline.add(mapper_tumor_bwa)
pipeline.add(mdup_germline)
pipeline.add(mdup_tumor)
pipeline.add(recalibrator_germline)
pipeline.add(recalibrator_tumor)
pipeline.add(mutect_caller)
pipeline.add(annotator)

You must add every step you want to run to the pipeline.

To create the configuration file:

pipeline_config = pipeline.build(workdir="/path/to/pipeline/workdir")

This will create a YAML file in the workdir that you specified.

Last updated