# Creating Pipelines with Builders

Builders are responsible for creating the pipeline configuration that is later used to run the pipeline. Configuration information include the library used, input/output filenames, and run parameters for related algorithm.

Here are individual file readers and builders:

## 1. File Readers

### FastqReader

```python
from cosap import FastqReader

sample_fastq = FastqReader("/path/to/fastq.fastq", name="normal_sample")
```

You can bundle paired FASTQ files in a list:

```python
germline_fastqs = [
    FastqReader("/path/to/fastq_1.fastq", name="normal_sample", read=1),
    FastqReader("/path/to/fastq_2.fastq", name="normal_sample", read=2)
]

tumor_fastqs = [
    FastqReader("/path/to/fastq_1.fastq", name="tumor_sample", read=1),
    FastqReader("/path/to/fastq_2.fastq", name="tumor_sample", read=2)
]
```

### BamReader

```python
from cosap import BamReader

sample_bam = BamReader("/path/to/sample.mdup.bam")
```

{% hint style="info" %}
When running COSAP without Dockerization, relative file paths passed to Readers are resolved relative to the directory from which you run the Python command.

The workdir option in the pipeline builder only affects where intermediate and final files will be created.
{% endhint %}

## 2. Builders

### Trimmer

Trimmer builder for adapter trimming and quality control. Takes the list of paired [FastqReaders](#fastqreader) as input. Uses [fastp](https://github.com/OpenGene/fastp).

```python
from cosap import Trimmer

trimmer_germline = Trimmer(input_step=germline_fastqs)
```

### Mapper

Mapper builder for read mapping. Takes the [Trimmer](#trimmer) or [FastqReader ](#fastqreader)as input. Currently following libraries are supported:

* [BWA](https://github.com/lh3/bwa)
* [BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2)
* [Bowtie2](https://github.com/BenLangmead/bowtie2)
* [Parabricks fq2bam](https://docs.nvidia.com/clara/parabricks/4.0.1/documentation/tooldocs/man_fq2bam.html#man-fq2bam) (Integrated into the "bwa" library. Pipeline runner device must be "gpu")

```python
from cosap import Mapper

mapper_germline_params = {
    "read_groups": {
        "ID": "H0164.2",
        "SM": "Pt28N",
        "PU": "0",
        "PL": "illumina",
        "LB": "Solexa-272222"
    }
}

mapper_germline_bwa = Mapper(
    library="bwa2",
    input_step=trimmer_germline,
    params=mapper_germline_params
)

mapper_germline_bowtie = Mapper(
    library="bowtie",
    input_step=trimmer_germline,
    params=mapper_germline_params
)
```

### MarkDuplicates Builder

Duplicate read tagger and remover builder. Takes Mapper as input.

```python
from cosap import MDUP

mdup_germline = MDUP(input_step=mapper_germline_bwa)

# By default, this removes all duplicates.
# If you only want to mark them, use duplicate_handling_strategy argument
mdup_germline = MDUP(input_step=mapper_germline_bwa,duplicate_handling_method="mark")
```

### BaseRecalibrator

GATK BaseRecalibrator builder. Takes Mapper or MDUP as input.

```python
from cosap import Recalibrator

recalibrator_germline = Recalibrator(input_step=mdup_germline)
```

### Elprep Preprocessing Tool

[Elprep](https://github.com/ExaScience/elprep) is a high performance tool for preprocessing. Its functionality is the same as duplicate remover and base recalibrator combined. This tool requires up to 200GB of memory therefore is only recommended to be used on capable workstations and servers.

```python
from cosap import Elprep

elprep_recalibrator_germline = Elprep(input_step=mapper_germline_bwa)
```

### VariantCaller

Variant caller builder for variant detection tools. Takes Mapper, MDUP, Recalibrator, or Elprep of both normal and tumor samples as input. Currently the following libraries are supported:

* [Mutect2](https://gatk.broadinstitute.org/hc/en-us/articles/360046788432-Mutect2)
* [Varscan2](http://varscan.sourceforge.net/)
* [Strelka2](https://github.com/Illumina/strelka)
* [Octopus](https://github.com/luntergroup/octopus)
* [MuSe](https://github.com/danielfan/MuSE)
* [VarDict](https://github.com/AstraZeneca-NGS/VarDict)
* [SomaticSniper](https://github.com/genome/somatic-sniper)
* [VarNet](https://github.com/skandlab/VarNet)
* [DeepVariant](https://github.com/google/deepvariant)
* [HaplotypeCaller](https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller)
* [Manta](https://github.com/Illumina/manta)

```python
from cosap import VariantCaller

sample_params = {"germline_sample_name":"Pt28N"}

mutect_caller = VariantCaller(
    library="mutect", 
    germline=recalibrator_germline, 
    tumor=recalibrator_tumor, 
    params=sample_params
)
strelka_caller = VariantCaller(
    library="strelka", 
    germline=recalibrator_germline, 
    tumor=recalibrator_tumor, 
    params=sample_params
)
```

{% hint style="info" %}
If sample name is provided in the Mapper as read group, it must be provided in the VariantCaller params as well.
{% endhint %}

{% hint style="info" %}
For Strelka2, Manta, and VarNet, COSAP requires [Docker](https://docs.docker.com/engine/install/) to be installed on the system.
{% endhint %}

{% hint style="warning" %}
On some systems, Mutect2 may cause crashes when used with multithreading. To turn off multithreading in COSAP, set COSAP\_THREADS\_PER\_JOB to 1.
{% endhint %}

### VariantAnnotator

Variant annotator builder.&#x20;

Currently following libraries are supported:

* [Ensembl-vep](https://www.ensembl.org/info/docs/tools/vep/index.html)
* [Annovar](https://annovar.openbioinformatics.org/en/latest/)
* [SnpEff](http://pcingola.github.io/SnpEff/#snpeff)
* [InterVar](https://github.com/WGLab/InterVar)
* [CancerVar](https://github.com/WGLab/CancerVar)
* [PharmGKB](https://www.pharmgkb.org/)
* [Annotsv](https://github.com/lgmgeo/AnnotSV)

```python
from cosap import Annotator

annotator = Annotator(library="vep", input_step=mutect_caller)
```

{% hint style="info" %}
For Ensembl-vep, COSAP requires [Docker](https://docs.docker.com/engine/install/) to be installed on the system.
{% endhint %}

## Building Pipeline Config

After creating individual pipeline steps, it is time to gather them under a pipeline. To do this you can simply create a Pipeline instance and add the previously created steps to it.

```python
from cosap import Pipeline

pipeline = Pipeline()
```

Stacking steps into pipeline is easy as `.add()`:

```python
pipeline.add(trimmer_germline)
pipeline.add(trimmer_tumor)
pipeline.add(mapper_germline_bwa)
pipeline.add(mapper_tumor_bwa)
pipeline.add(mdup_germline)
pipeline.add(mdup_tumor)
pipeline.add(recalibrator_germline)
pipeline.add(recalibrator_tumor)
pipeline.add(mutect_caller)
pipeline.add(annotator)
```

{% hint style="warning" %}
You must add every step you want to run to the pipeline.&#x20;
{% endhint %}

To create the configuration file:

```python
pipeline_config = pipeline.build(workdir="/path/to/pipeline/workdir")
```

This will create a YAML file in the workdir that you specified.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.cosap.bio/fundamentals/using-cosap/creating-pipelines-with-builders.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
