nf-core/eager
A fully reproducible and state-of-the-art ancient DNA analysis pipeline
22.10.6
.
Learn more.
Define where the pipeline should find input data and save output data.
Path to tab- or comma-separated file containing information about the samples in the experiment.
string
^\S+\.(c|t)sv$
You will need to create a design file with information about the samples in your experiment before running the pipeline. Use this parameter to specify its location. It has to be a tab- or comma-separated file with 11 columns, and a header row. See usage docs.
Specify to convert input BAM files back to FASTQ for remapping
boolean
This parameter tells the pipeline to convert the BAM files listed in the --input
TSV or CSV sheet back to FASTQ format to allow re-preprocessing and mapping.
Can be useful when you want to ensure consistent mapping parameters across all libraries when incorporating public data, however be careful of biases that may come from re-processing again (the BAM files may already be clipped, or only mapped reads with different settings are included so you may not have all reads from the original publication).
The output directory where the results will be saved. You have to use absolute paths to storage on Cloud infrastructure.
string
Email address for completion summary.
string
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file (~/.nextflow/config
) then you don't need to specify this on the command line for every run.
MultiQC report title. Printed as page header, used for filename if not otherwise specified.
string
Reference genome related files and options required for the workflow.
Path to FASTA file of the reference genome.
string
^\S+\.fn?a(sta)?(\.gz)?$
This parameter is mandatory if --genome
or --fasta_sheet
are not specified. If you don't supply a mapper index (e.g. for BWA), this will be generated for you automatically. Combine with --save_reference
to save mapper index for future runs.
Specify path to samtools FASTA index.
string
If you want to use a pre-existing samtools faidx
index, use this to specify the required FASTA index file for the selected reference genome. This should be generated by samtools faidx and has a file suffix of .fai
.
Specify path to Picard sequence dictionary file.
string
If you want to use a pre-existing picard CreateSequenceDictionary
dictionary file, use this to specify the required .dict
file for the selected reference genome.
Specify path to directory containing index files of the FASTA for a given mapper.
string
For most people this will likely be the same directory that contains the file you provided to --fasta
.
If you want to use pre-existing bwa index
indices, the directory should contain files ending in '.amb' '.ann' '.bwt'. If you want to use pre-existing bowtie2 build
indices, the directory should contain files ending in'.1.bt2', '.2.bt2', '.rev.1.bt2'.
In any case do not include the files themselves in the path. nf-core/eager will automagically detect the index files by searching for the FASTA filename with the corresponding bwa index
/bowtie2 build
file suffixes. If not supplied, the indices will be generated for you.
Specify to generate '.csi' BAM indices instead of '.bai' for larger reference genomes.
boolean
This parameter is required to be set for large reference genomes. If your reference genome is larger than 3.5GB, the samtools index
calls in the pipeline need to generate .csi
indices instead of .bai
indices to compensate for the size of the reference genome (with samtools: -c
). This parameter is not required for smaller references (including the human reference genomes hg19 or grch37/grch38).
Specify to save any pipeline-generated reference genome indices in the results directory.
boolean
Use this if you do not have pre-made reference FASTA indices for bwa
, samtools
and picard
. If you turn this on, the indices nf-core/eager generates for you and will be saved in the <your_output_dir>/results/reference_genomes
for you. If not supplied, nf-core/eager generated index references will be deleted.
Modifies SAMtools index command:
-c
Path to a tab-/comma-separated file containing reference-specific files.
string
^\S+\.(c|t)sv$
This parameter is mandatory if --genome
or --fasta
are not specified. If you don't supply a mapper index (e.g. for BWA), this will be generated for you automatically.
Name of iGenomes reference.
string
If using a reference genome configured in the pipeline using iGenomes, use this parameter to give the ID for the reference. This is then used to build the full paths for all required reference genome files e.g. --genome GRCh38
.
See the nf-core website docs for more details.
Directory / URL base for iGenomes references.
string
s3://ngi-igenomes/igenomes/
Do not load the iGenomes reference config.
boolean
Do not load igenomes.config
when running the pipeline. You may choose this option if you observe clashes between custom parameters and those supplied in igenomes.config
.
Specify the FASTA header of the extended chromosome when using circularmapper
.
string
The entry (chromosome, contig, etc.) in your FASTA reference that you'd like to be treated as circular.
Applies only when providing a single FASTA file via --fasta
(NOT multi-reference input - see reference TSV/CSV input).
Modifies tool parameter(s):
- circulargenerator
-s
Specify the number of bases to extend reference by (circularmapper only).
integer
500
The number of bases to extend the beginning and end of each reference genome with.
Specify an elongated reference FASTA to be used for circularmapper.
string
Specify an already elongated FASTA file for circularmapper to avoid regeneration.
Specify a samtools index for the elongated FASTA file.
string
Specify the index for an already elongated FASTA file to avoid regeneration.
Parameters used to describe centralised config profiles. These should not be edited.
Git commit id for Institutional configs.
string
master
Base directory for Institutional configs.
string
https://raw.githubusercontent.com/nf-core/configs/master
If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter.
Institutional config name.
string
Institutional config description.
string
Institutional config contact information.
string
Institutional config URL link.
string
Set the top limit for requested resources for any single job.
Maximum number of CPUs that can be requested for any single job.
integer
16
Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. --max_cpus 1
Maximum amount of memory that can be requested for any single job.
string
128.GB
^\d+(\.\d+)?\.?\s*(K|M|G|T)?B$
Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. --max_memory '8.GB'
Maximum amount of time that can be requested for any single job.
string
240.h
^(\d+\.?\s*(s|m|h|d|day)\s*)+$
Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. --max_time '2.h'
Less common options for the pipeline, typically set in a config file.
Display help text.
boolean
Display version and exit.
boolean
Method used to save pipeline results to output directory.
string
The Nextflow publishDir
option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details.
Email address for completion summary, only when pipeline fails.
string
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully.
Send plain-text email instead of HTML.
boolean
File size limit when attaching MultiQC reports to summary emails.
string
25.MB
^\d+(\.\d+)?\.?\s*(K|M|G|T)?B$
Do not use coloured log outputs.
boolean
Incoming hook URL for messaging service
string
Incoming hook URL for messaging service. Currently, MS Teams and Slack are supported.
Custom config file to supply to MultiQC.
string
Custom logo file to supply to MultiQC. File name must also be set in the MultiQC config file
string
Custom MultiQC yaml file containing HTML including a methods description.
string
Boolean whether to validate parameters against the schema at runtime
boolean
true
Show all params when using --help
boolean
By default, parameters set as hidden in the schema are not shown on the command line when a user runs with --help
. Specifying this option will tell the pipeline to show all parameters.
Validation of parameters fails when an unrecognised parameter is found.
boolean
By default, when an unrecognised parameter is found, it returns a warinig.
Validation of parameters in lenient more.
boolean
Allows string values that are parseable as numbers or booleans. For further information see JSONSchema docs.
Base URL or local path to location of pipeline test dataset files
string
https://raw.githubusercontent.com/nf-core/test-datasets/
Removal of adapters, paired-end merging, poly-G removal, etc.
Specify which tool to use for sequencing quality control.
string
Specify which tool to use for sequencing quality control.
Falco is designed as a drop-in replacement for FastQC but written in C++ for faster computation. We recommend using falco with very large datasets (due to reduced memory constraints).
Specify to skip all preprocessing steps (adapter removal, paired-end merging, poly-G trimming, etc).
boolean
Specify to skip all preprocessing steps (adapter removal, paired-end merging, poly-G trimming etc).
This will also mean you will only get one set of FastQC results (of the input reads).
Specify which preprocessing tool to use.
string
Specify which preprocessing tool to use.
AdapterRemoval is commonly used in palaeogenomics, however fastp has similar performance and has many additional functionality (including inbuilt complexity trimming) that can be often useful.
Specify to skip read-pair merging.
boolean
Turns off the paired-end read merging, and will result in paired-end mapping modes being used during reference of reads again alignment.
This can be useful in cases where you have long ancient DNA reads, modern DNA or when you want to utilise mate-pair 'spatial' information.
⚠️ If you run this with --preprocessing_minlength set to a value (as is by default!), you may end up removing single reads from either the pair1 or pair2 file. These reads will be NOT be mapped when aligning with either BWA or bowtie, as both can only accept one (forward) or two (forward and reverse) FASTQs as input in paired-end mode.
⚠️ If you run metagenomic screening as well as skipping merging, all reads will be screened as independent reads - not as pairs! - as all FASTQ files from BAM filtering are merged into one. This merged file is not saved in results directory.
Modifies AdapterRemoval parameter:
--collapse
Modifies fastp parameter:--merge
Specify to exclude read-pairs that did not overlap sufficiently for merging (i.e., keep merged reads only).
boolean
Specify to exclude read-pairs that did not overlap sufficiently for merging (i.e., keep merged reads only). Singletons (i.e. reads missing a pair) or un-merged reads (where there wasn't sufficient overlap) are discarded.
Most ancient DNA molecules are very short, and the majority are expected to merge. Specifying this parameter can sometimes be useful when dealing with ultra-short aDNA reads to reduce the number of longer-reads you may have in your library that are derived from modern contamination. It can also speed up run time of mapping steps.
You may want to use this if you want ensure only the best quality reads for your analysis, but with the penalty of potentially losing still valid data (even if some reads have slightly lower quality and/or are longer). It is highly recommended when using 'dedup' deduplication tool.
Specify to skip removal of adapters.
boolean
Specify to turn off trimming of adapters from reads.
You may wish to do this if you are using publicly available data, that should have all library artefacts from reads removed.
This will override any other adapter parameters provided (i.e, --preprocessing_adapterlist
and --preprocessing_adapter{1,2}
will be ignored)!
Modifies AdapterRemoval parameter:
--adapter1
and--adapter2
(sets both to an empty string)
Applies fastp parameter:--disable_adapter_trimming
Specify the nucleotide sequence for the forward read/R1.
string
Specify a nucleotide sequence for the forward read/R1.
If not modified by the user, the default for the particular preprocessing tool will be used. Therefore, to turn off adapter trimming use --preprocessing_skipadaptertrim
.
Modifies AdapterRemoval parameter:
--adapter1
Modifies fastp parameter:--adapter_sequence
Specify the nucleotide sequence for the reverse read/R2.
string
Specify a nucleotide sequence for the forward read/R2.
If not modified by the user, the default for the particular preprocessing tool will be used. To turn off adapter trimming use --preprocessing_skipadaptertrim
.
Modifies AdapterRemoval parameter:
--adapter2
Modifies fastp parameter:--adapter_sequence_r2
Specify a list of all possible adapters to trim.
string
Specify a file with a list of adapter (combinations) to remove from all files.
Overrides the --preprocessing_adapter1
/--preprocessing_adapter2
parameters.
Note that the two tools have slightly different behaviours.
For AdapterRemoval this consists of a two column table with a .txt
extension: first column represents forward strand, second column for reverse strand. You must supply all possible combinations, one per line, and this list is applied to all files. Only Adapters in this list will be screened for and removed. See AdapterRemoval documentation for more information.
For fastp this consists of a standard FASTA format with a .fasta
/.fa
/.fna
/.fas
extension. The adapter sequence in this file should be at least 6bp long, otherwise it will be skipped. fastp will first perform auto-detection and removal of adapters and then additionally remove adapters present in the FASTA file one by one will.
Modifies AdapterRemoval parameter:
--adapter-list
Modifies fastp parameter:--adapter_fasta
Specify the minimum length reads must have to be retained.
integer
25
Specify the minimum length reads must have to be retained.
Reads smaller than this length after trimming are discarded and not included in downstream analyses. Typically in ancient DNA, users will set this to 30 or for very old samples around 25 bp - reads any shorter that this often are not specific enough to provide useful information.
Modifies AdapterRemoval parameter:
--minlength
Modifies fastp parameter:--length_required
Specify number of bases to hard-trim from 5 prime or front of reads.
integer
Specify number of bases to hard-trim from 5 prime or front of reads. Exact behaviour varies per tool, see documentation. By default set to 0
to not perform any hard trimming.
This parameter allows users to 'hard' remove a number of bases from the beginning or end of reads, regardless of quality.
⚠️ When this trimming occurs depends on the tool, i.e., the exact behaviour is not the same between AdapterRemoval and fastp.
For fastp: 5p/3p trimming occurs prior to any other trimming (quality, poly-G, adapter). Please see the fastp documentation for more information. If you wish to use this to remove damage prior to mapping (to allow more specific mapping), ensure you have manually removed adapters/quality trimmed prior to giving the reads to nf-core/eager. Alternatively, you can use Bowtie 2's inbuilt pre-mapping read-end trimming functionality. Note that nf-core/eager only allows this hard trimming equally for both forward and reverse reads (i.e., you cannot provide different values for the 5p end for R1 and R2).
For AdapterRemoval, this trimming happens after the removal of adapters, however prior to quality trimming. Therefore, this is more suitable for hard-removal of damage before mapping (however the Bowtie 2 system will be more reliable).
Modifies AdapterRemoval parameters:
--trim5p
Modifies fastp parameters:--trim_front1
and/or--trim_front2
Specify number of bases to hard-trim from 3 prime or tail of reads.
integer
Specify number of bases to hard-trim from 3 prime or tail of reads. Exact behaviour varies per tool, see documentation. By default set to 0
to not perform any hard trimming.
This parameter allows users to 'hard' remove a number of bases from the beginning or end of reads, regardless of quality.
⚠️ When this trimming occurs depends on the tool, i.e., the exact behaviour is not the same between AdapterRemoval and fastp.
For fastp: 5p/3p trimming occurs prior to any other trimming (quality, poly-G, adapter). Please see the fastp documentation for more information. If you wish to use this to remove damage prior to mapping (to allow more specific mapping), ensure you have manually removed adapters/quality trimmed prior to giving the reads to nf-core/eager. Alternatively, you can use Bowtie 2's inbuilt pre-mapping read-end trimming functionality. Note that nf-core/eager only allows this hard trimming equally for both forward and reverse reads (i.e., you cannot provide different values for the 3p end for R1 and R2).
For AdapterRemoval, this trimming happens after the removal of adapters, however prior to quality trimming. Therefore this is more suitable for hard-removal of damage before mapping (however the Bowtie 2 system will be more reliable).
Modifies AdapterRemoval parameters:
--trim3p
Modifies fastp parameters:--trim_tail1
and/or--trim_tail2
Specify to save the preprocessed reads in the results directory.
boolean
Specify to save the preprocessed reads in FASTQ format the results directory.
This can be useful for re-analysing FASTQ files manually, or uploading to public data repositories such as ENA/SRA (provided you don't filter by length or merge paired reads).
Specify to turn on sequence complexity filtering of reads.
boolean
Performs a poly-G tail removal step in the beginning of the pipeline using fastp.
This can be useful for trimming ploy-G tails from short-fragments sequenced on two-colour Illumina chemistry such as NextSeqs or NovaSeqs (where no-fluorescence is read as a G on two-colour chemistry), which can inflate reported GC content values.
Modifies fastp parameter:
--trim_poly_g
Specify the complexity threshold that must be reached or exceeded to retain reads.
integer
10
This option can be used to define the minimum length of a poly-G tail to begin low complexity trimming.
Modifies fastp parameter:
--poly_g_min_len
Skip AdapterRemoval quality and N base trimming at 5 prime end.
boolean
Turns off quality based trimming at the 5p end of reads when any of the AdapterRemoval quality or N trimming options are used. Only 3p end of reads will be removed.
This also entirely disables quality based trimming of collapsed reads, since both ends of these are informative for PCR duplicate filtering. For more information see the AdapterRemoval documentation.
Modifies AdapterRemoval parameters:
--preserve5p
Specify to skip AdapterRemoval quality and N trimming at the ends of reads.
boolean
Turns off AdapterRemoval quality trimming from ends of reads.
This can be useful to reduce runtime when running public data that has already been processed.
Modifies AdapterRemoval parameters:
--trimqualities
Specify AdapterRemoval minimum base quality for trimming off bases.
integer
20
Defines the minimum read quality per base that is required for a base to be kept by AdapterRemoval. Individual bases at the ends of reads falling below this threshold will be clipped off.
Modifies AdapterRemoval parameter:
--minquality
Specify to skip AdapterRemoval N trimming (quality trimming only).
boolean
Turns off AdapterRemoval N trimming from ends of reads.
This can be useful to reduce runtime when running publicly available data that has already been processed.
Modifies AdapterRemoval parameters:
--trimns
Specify the AdapterRemoval minimum adapter overlap required for trimming.
integer
1
Specifies a minimum number of bases that overlap with the adapter sequence before AdapterRemoval trims adapters sequences from reads.
Modifies AdapterRemoval parameter:
--minadapteroverlap
Specify the AdapterRemoval maximum Phred score used in input FASTQ files.
integer
41
Specify maximum Phred score of the quality field of FASTQ files.
The quality-score range can vary depending on the machine and version (e.g. see diagram here, and this allows you to increase from the default AdapterRemoval value of 41.
Note that while this can theoretically provide you with more confident and precise base call information, many downstream tools only accept FASTQ files with Phred scores limited to a max of 41, and therefore increasing the default for this parameter may make the resulting preprocessed files incompatible with some downstream tools.
Modifies AdapterRemoval parameters:
--qualitymax
Options for aligning reads against reference genome(s)
Specify to turn on FASTQ sharding.
boolean
Sharding will split the FASTQs into smaller chunks before mapping. These chunks are then mapped in parallel. This approach can speed up the mapping process for larger FASTQ files.
Specify the number of reads in each shard when splitting.
integer
1000000
Make sure to choose a value that makes sense for your dataset. Small values can create many files, which can end up negatively affecting the overall speed of the mapping process.
Specify which mapper to use.
string
Specify which mapping tool to use. Options are BWA aln ('bwaaln
'), BWA mem ('bwamem
'), circularmapper ('circularmapper
'), or Bowtie 2 ('bowtie2
'). BWA aln is the default and highly suited for short-read ancient DNA. BWA mem can be quite useful for modern DNA, but is rarely used in projects for ancient DNA. CircularMapper enhances the mapping procedure to circular references, using the BWA algorithm but utilizing an extend-remap procedure (see Peltzer et al 2016 for details). Bowtie 2 is similar to BWA aln, and has recently been suggested to provide slightly better results under certain conditions (Poullet and Orlando 2020), as well as providing extra functionality (such as FASTQ trimming).
More documentation can be seen for each tool under:
Specify the amount of allowed mismatches in the alignment for mapping with BWA aln.
number
0.01
Specify how many mismatches are allowed in a read during alignment with BWA aln. Default is set following recommendations from Oliva et al. 2021 who compared alignment to human reference genomes.
If you're uncertain what value to use, check out this Shiny App for more information.
Modifies BWA aln parameter:
-n
Specify the maximum edit distance allowed in a seed for mapping with BWA aln.
integer
2
Specify the maximum edit distance during the seeding phase of the BWA aln mapping algorithm.
Modifies BWA aln parameter:
-k
Specify the length of seeds to be used for BWA aln.
integer
1024
Specify the length of the seed used in BWA aln. Default is set to be 'turned off' at the recommendation of Oliva et al. 2021 who tested when aligning to human reference genomes. Seeding is 'turned off' by specifying an arbitrarily long number to force the entire read to act as the seed.
Note: Despite being recommended, turning off seeding can result in long runtimes!
Modifies BWA aln parameter:
-l
Specify the number of gaps allowed for alignment with BWA aln.
integer
2
Specify the number of gaps allowed for mapping with BWA aln. Default is set to BWA default.
Modifies BWA aln parameter:
-o
Specify the minimum seed length for alignment with BWA mem.
integer
19
Configures the minimum seed length used in BWA mem. Default is set to BWA default.
Modifies BWA mem parameter:
-k
Specify the re-seeding threshold for alignment with BWA mem.
number
1.5
Configures the re-seeding threshold used in BWA mem. Default is set to BWA default.
Modifies BWA mem parameter:
-r
Specify the Bowtie 2 alignment mode.
string
Specify the type of read alignment to use with Bowtie 2. 'Local' allows only partial alignment of read with ends of reads possibly 'soft-clipped' (i.e. remain unaligned/ignored), if the soft-clipped alignment provides best alignment score. 'End-to-end' requires all nucleotides to be aligned.
Default is set following Cahill et al (2018) and Poullet and Orlando 2020
Modifies Bowtie 2 presets:
--local
,--end-to-end
Specify the level of sensitivity for the Bowtie 2 alignment mode.
string
Specify the Bowtie 2 'preset' to use. These strings apply to both --mapping_bowtie2_alignmode
options. See the Bowtie 2 manual for actual settings.
Default is set following Poullet and Orlando (2020), when running damaged-data without UDG treatment.
Modifies the Bowtie 2 parameters:
--fast
,--very-fast
,--sensitive
,--very-sensitive
,--fast-local
,--very-fast-local
,--sensitive-local
,--very-sensitive-local
Specify the number of mismatches in seed for alignment with Bowtie 2.
integer
Specify the number of mismatches allowed in the seed during seed-and-extend procedure of Bowtie 2. This will override any values set with --mapping_bowtie2_sensitivity
. Can either be 0 or 1.
Modifies Bowtie 2 parameter:
-N
Specify the length of seed substrings for Bowtie 2.
integer
20
Specify the length of the seed sub-string to use during seeding of Bowtie 2. This will override any values set with --mapping_bowtie2_sensitivity
.
Modifies Bowtie 2 parameter:
-L
Specify the number of bases to trim off from 5 prime end of read before alignment with Bowtie 2.
integer
Specify the number of bases to trim at the 5' (left) end of read before alignment with Bowtie 2. This may be useful when left-over sequencing artefacts of in-line barcodes are present.
Modifies Bowtie 2 parameter:
--trim5
Specify the number of bases to trim off from 3 prime end of read before alignment with Bowtie 2.
integer
Specify the number of bases to trim at the 3' (right) end of read before alignment with Bowtie 2. This may be useful when left-over sequencing artefacts of in-line barcodes are present.
Modifies Bowtie 2 parameter:
--trim3
Specify the maximum fragment length for Bowtie2 paired-end mapping mode only.
integer
500
The maximum fragment for valid paired-end alignments. Only for paired-end mapping (i.e. unmerged), and therefore typically only useful for modern data.
Modifies Bowtie2 parameter:
--maxins
Turn on to remove reads that did not map to the circularised genome.
boolean
If you want to filter out reads that don't map to elongated/circularised chromosome (and also non-circular chromosome headers) from the resulting BAM file, turn this on.
Modifies
-f
and-x
parameters of CircularMapper's RealignSAMFile
Options related to length, quality, and map status filtering of reads.
Specify to turn on filtering of reads in BAM files after mapping. By default, only mapped reads retained.
boolean
Turns on the filtering subworkflow for mapped BAM files after the read alignment step. Filtering includes removal of unmapped reads, length filtering, and mapping quality filtering.
When turning on BAM filtering, by default only the mapped/unmapped filter is activated, thus only mapped reads are retained for downstream analyses. See --bamfiltering_retainunmappedgenomicbam
to retain unmapped reads, if filtering only for length and/or quality is preferred.
Note this subworkflow can also be activated if --run_metagenomics
is supplied.
Specify the minimum read length mapped reads should have for downstream genomic analysis.
integer
Specify to remove mapped reads that fall below a certain length threshold after mapping.
This can be useful to get more realistic 'endogenous DNA' or 'on target read' percentages.
If used instead of minimum length read filtering at AdapterRemoval, you can get more more realistic endogenous DNA estimates when most of your reads are very short (e.g. in single-stranded libraries or samples with highly degraded DNA). In these cases, the default minimum length filter at earlier adapter clipping/read merging will remove a very large amount of your reads in your library (including valid reads), thus making an artificially small denominator for a typical endogenous DNA calculation.
Therefore by retaining all of your reads until after mapping (i.e., turning off the adapter clipping/read merging filter), you can generate more 'real' endogenous DNA estimates immediately after mapping (with a better denominator). Then after estimating this, filter using this parameter to retain only 'useful' reads (i.e., those long enough to provide higher confidence of their mapped position) for downstream analyses.
By specifying 0
, no length filtering is performed.
Note that by default the output BAM files of this step are not stored in the results directory (as it is assumed that deduplicated BAM files are preferred). See --bamfiltering_savefilteredbams
if you wish to save these.
Modifies filter_bam_fragment_length.py parameter:
-l
Specify the minimum mapping quality reads should have for downstream genomic analysis.
integer
Specify a mapping quality threshold for mapped reads to be kept for downstream analysis.
By default all reads are retained and this option is therefore set to 0 to ensure no quality filtering is performed.
Note that by default the output BAM files of this step are not stored in the results directory (as it is assumed that deduplicated BAM files are preferred). See --bamfiltering_savefilteredbams
if you wish to save these.
Modifies samtools view parameter:
-q
Specify the SAM format flag of reads to remove during BAM filtering for downstream genomic steps.
integer
4
Specify to customise the exact SAM format flag of reads you wish to remove from your BAM file to for downstream genomic analyses.
You can explore more using a tool from the Broad Institute here
⚠️ Modify at your own risk, alternative flags are not necessarily supported in downstream steps!
Modifies samtools parameter:
-F
Specify to retain unmapped reads in the BAM file used for downstream genomic analyses.
boolean
Specify to retain unmapped reads (optionally also length filtered) in the genomic BAM for downstream analysis. By default, the pipeline only keeps mapped reads for downstream analysis.
This is also turned on if --metagenomics_input
is set to all
.
⚠️ This will likely slow down run time of downstream pipeline steps!
Modifies tool parameter(s):
- samtools view:
-f 4
/-F 4
Specify to generate FASTQ files containing only unmapped reads from the aligner generated BAM files.
boolean
Specify to turn on the generation and saving of FASTQs of only the unmapped reads from the mapping step in the results directory.
This can be useful if you wish to do other analysis of the unmapped reads independently of the pipeline.
Note: the reads in these FASTQ files have not undergone length of quality filtering
Modifies samtools fastq parameter:
-f 4
Specify to generate FASTQ files containing only mapped reads from the aligner generated BAM files.
boolean
Specify to turn on the generation and saving of FASTQs of only the mapped reads from the mapping step in the results directory.
This can be useful if you wish to do other analysis of the mapped reads independently of the pipeline, such as remapping with different parameters (whereby only including mapped reads will speed up computation time during the re-mapping due to reduced input data).
Note the reads in these FASTQ files have not undergone length of quality filtering
Modifies samtools fastq parameter:
-F 4
Specify to save the intermediate filtered genomic BAM files in the results directory.
boolean
Specify to save intermediate length- and/or quality-filtered genomic BAM files in the results directory.
Options related to metagenomic screening.
Specify to turn on metagenomic screening of mapped, unmapped or all reads.
boolean
Specify to turn on the metagenomic screening subworkflow of the pipeline, where reads are screened against large databases. Typically used for pathogen screening or microbial community analysis.
If supplied, this will also turn on the BAM filtering subworkflow of the pipeline.
Requires subsequent specification of --metagenomics_profiling_tool
and --metagenomics_profiling_database
.
Specify which type of reads to use for metagenomic screening.
string
Specify to select which mapped reads will be sent for metagenomic analysis.
This influences which reads are sent to this step, whether you want unmapped reads (used in most cases, as 'host reads' can often be contaminants in microbial genomes), mapped reads (e.g, when doing competitive against a genomic reference of multiple genomes and which to apply LCA correction) or all reads.
⚠️ If you skip paired-end merging, all reads will be screened as independent reads - not as pairs! - as all FASTQ files from BAM filtering are merged into one. This merged file is not saved in results directory.
Modifies samtools fastq parameters:
-f 4
/-F 4
Specify to run a complexity filter on the metagenomics input files before classification.
boolean
Specify to turn on a subworkflow of the pipeline that filters the FASTQ files for complexity before the metagenomics profiling.
Use the --metagenomics_complexity_tool
parameter to select a method.
Specify to save FASTQ files containing the complexity-filtered reads before metagenomic classification.
boolean
Specify to save the complexity-filtered FASTQ files to the results directory.
Specify which tool to use for trimming, filtering or reformatting of FASTQ reads that go into metagenomics screening.
string
Specify to select which tool is used to generate a final set of reads for the metagenomic classifier after any necessary trimming, filtering or reformatting of the reads.
This intermediate file is not saved in the results directory unless marked with --metagenomics_complexity_savefastq
.
Specify the entropy threshold under which a sequencing read will be complexity-filtered out.
number
0.3
Specify the minimum 'entropy' value for complexity filtering for the BBDuk or PRINSEQ++ tools.
This value will only be used for PRINSEQ++ if --metagenomics_prinseq_mode
is set to entropy
.
Entropy here corresponds to the amount of sequence variation existing within the read. Higher values correspond to more variety and thus will likely result in more specific matching to a taxon's reference genome. The trade-off here is fewer reads (or abundance information) available for having a confident identification.
Modifies parameters:
- BBDuk:
entropy=
- PRINSEQ++:
-lc_entropy
Specify the complexity filter mode for PRINSEQ++.
string
Specify the complexity filter mode for PRINSEQ++.
Use the selected mode together with the correct flag:
'dust' requires the --metagenomics_prinseq_dustscore
parameter set
'entropy' requires the --metagenomics_complexity_entropy
parameter set
Modifies parameters:
- PRINSEQ++:
-lc_entropy
- PRINSEQ++:
-lc_dust
Specify the minimum dust score for PRINTSEQ++ complexity filtering
number
0.5
Specify the minimum dust score below which low-complexity reads will be removed. A DUST score is based on how often different tri-nucleotides occur along a read.
Modifies tool parameter(s):
- PRINSEQ++:
--lc_dust
Specify which tool to use for metagenomic profiling and screening. Required if --run_metagenomics
flagged.
string
Select which tool to run metagenomics profiling on designated metagenomics_input. These tools behave vastly differently due to performing read profiling using different methods and yield vastly different reuslts.
MALT and MetaPhlAn are alignment based, whereas Kraken2 and KrakenUniq are k-mer based.
MALT has addtional postprocessing available (via --run_metagenomics_postprocessing
) which can help authenticate alignments to a provided list of taxonomic nodes using established ancientDNA characteristics.
MetaPhlAn performs profiling on the metagenomcis input data. This may be used to characterize the metagenomic community of a sample but care must be taken that you are not just looking at the modern metagenome of an ancient sample (for instance, soil microbes on a bone)
Kraken2 and KrakenUniq are metagenomics classifiers that rely on fast k-mer-matching rather than whole-read alignments and are very memory efficient.
Specify a databse directory or .tar.gz file of a database directory to run metagenomics profiling on. Required if --run_metagenomics
flagged.
string
Specify a metagenomics profiling database to use with the designated metagenomics_profiling_tool on the selected metagenomics_input. Databases can be provided both as a directory, or a tar.gz of a directory. Metagenomic databases are NOT compatible across different tools (ie a MALT database is different from a kraken2 database).
All databases need to be pre-built/downloaded for use in nf-core/eager. Database construction is often a balancing act between breadth of sequence diversity and size.
Modifies tool parameter(s):
- krakenuniq:
--db
- kraken2:
--db
- MetaPhlAn:
--bowtie2db
and--index
- MALT: '-index'
Turn on saving reads assigned by KrakenUniq or Kraken2
boolean
Save reads that do and do not have a taxonomic classification in your output results directory in FASTQ format.
Modifies tool parameter(s):
- krakenuniq:
--classified-out
and--unclassified-out
Turn on saving of KrakenUniq or Kraken2 per-read taxonomic assignment file
boolean
Save a text file that contains a list of each read that had a taxonomic assignment, with information on specific taxonomic taxonomic assignment that that read recieved.
Modifies tool parameter(s):
- krakenuniq:
--output
Specify how large to chunk database when loading into memory for KrakenUniq
string
16G
nf-core/eager utilises a 'low memory' option for KrakenUniq that can reduce the amount of RAM the process requires using the --preloaded
option.
A further extension to this option is that you can specify how large each chunk of the database should be that gets loaded into memory at any one time. You can specify the amount of RAM to chunk the database to with this parameter, and is particularly useful for people with limited computational resources.
More information about this parameter can be seen here.
Modifies KrakenUniq parameter: --preload-size
Turn on saving minimizer information in the kraken2 report thus increasing to an eight column layout.
boolean
Turn on saving minimizer information in the kraken2 report thus increasing to an eight column layout.
Modifies kraken2 parameter: --report-minimizer-data
.
Specify which alignment mode to use for MALT.
string
Use this to run the program in 'BlastN', 'BlastP', 'BlastX' modes to align DNA
and DNA, protein and protein, or DNA reads against protein references respectively. Ensure your database matches the mode. Check the MALT manual for more details.
Only when --metagenomics_profiling_tool malt
is also supplied.
Modifies tool parameter(s):
- MALT:
-m
Specify alignment method for MALT.
string
Specify what alignment algorithm to use. Options are 'Local' or 'SemiGlobal'. Local is a BLAST like alignment, but is much slower. Semi-global alignment aligns reads end-to-end. Default: 'SemiGlobal'
Only when --metagenomics_profiling_tool malt
is also supplied.
Modifies tool parameter(s):
- MALT:
-at
Percent identity value threshold for MALT.
integer
85
Specify the minimum percent identity (or similarity) a sequence must have to the reference for it to be retained.
Only used when --metagenomics_profiling_tool malt
is also supplied.
Modifies tool parameter(s):
- MALT:
-id
Specify the percent for LCA algorithm for MALT (see MEGAN6 CE manual).
integer
1
Specify the top percent value of the LCA algorithm. From the MALT manual: "For each
read, only those matches are used for taxonomic placement whose bit disjointScore is within
10% of the best disjointScore for that read.".
Only when --metagenomics_profiling_tool malt
is also supplied.
Modifies tool parameter(s):
- MALT:
-top
Specify whether to use percent or raw number of reads for minimum support required for taxon to be retained for MALT.
string
Specify whether to use a percentage, or raw number of reads as the value used to decide the minimum support a taxon requires to be retained.
Only when --metagenomics_profiling_tool malt
is also supplied.
Modifies tool parameter(s):
- MALT:
-sup
and-supp
Specify the minimum percentage of reads a taxon of sample total is required to have to be retained for MALT.
number
0.01
Specify the minimum number of reads (as a percentage of all assigned reads) a given taxon is required to have to be retained as a positive 'hit' in the RMA6 file. This only applies when --malt_min_support_mode
is set to 'percent'.
Only when --metagenomics_profiling_tool malt
is also supplied.
Modifies tool parameter(s):
- MALT:
-supp
Specify a minimum number of reads a taxon of sample total is required to have to be retained in malt or kraken. Not compatible with --malt_min_support_mode 'percent'.
integer
1
For usage in malt: Specify the minimum number of reads a given taxon is required to have to be retained as a positive 'hit'.
For malt, this only applies when --malt_min_support_mode
is set to 'reads'.
Modifies tool parameter(s):
- MALT:
-sup
Specify the maximum number of queries a read can have for MALT.
integer
100
Specify the maximum number of alignments a read can have. All further alignments are discarded.
Only when --metagenomics_profiling_tool malt
is also supplied.
Modifies tool parameter(s):
- MALT:
-mq
Specify the memory load method. Do not use 'map' with GPFS file systems for MALT as can be very slow.
string
How to load the database into memory. Options are 'load'
, 'page'
or 'map'
.
'load' directly loads the entire database into memory prior seed look up, this
is slow but compatible with all servers/file systems. 'page'
and 'map'
perform a sort of 'chunked' database loading, allowing seed look up prior entire
database loading. Note that Page and Map modes do not work properly not with
many remote file-systems such as GPFS.
Only when --metagenomics_profiling_tool malt
is also supplied.
Modifies tool parameter(s):
- MALT:
--memoryMode
Specify to also produce SAM alignment files. Note this includes both aligned and unaligned reads, and are gzipped. Note this will result in very large file sizes.
boolean
Specify to also produce gzipped SAM files of all alignments and un-aligned reads in addition to RMA6 files. These are not soft-clipped or in 'sparse' format. Can be useful for downstream analyses due to more common file format.
⚠️ can result in very large run output directories as this is essentially duplication of the RMA6 files.
Sets tool parameter(s):
- MALT:
--alignments
Define how many fastq files should be submitted in the same malt run. Default value of 0 runs all files at once.
integer
Very many (large) fastq files run through MALT at the same time can lead to excessively long runtimes. This parameter allows for parallelization of MALT runs. Please note, MALT is resource heavy and setting this value N above the default (0) will spawn multiple metagenomics_malt_group_size jobs where N is the number of samples per group. Please only use this if it is necessary to avoid runtime limits on your HPC cluster since the overhead of loading a database is high.
Activate post-processing of metagenomics profiling tool selected.
boolean
Activate the corresponding post-processing tool for your metagenomics profiling software.
malt --> maltextract
krakenuniq/kraken2/metaphlan --> taxpasta
Note: Postprocessing is automatically carried out when using kraken2
and krakenuniq
Path to a text file with taxa of interest (one taxon per row, NCBI taxonomy name format)
string
Path to a .txt
file with taxa of interest you wish to assess for aDNA characteristics. In .txt
file should be one taxon per row, and the taxon should be in a valid NCBI taxonomy name format corresponding to a taxonomic node in your MALT database. An example can be found on the HOPS github.\n\nNecessary when --metagenomics_profiling_tool malt
specified and --metagenomics_run_postprocessing
flagged.
Modifies tool parameter(s):
- MaltExtract:
-t
Path to directory containing containing NCBI resource files (ncbi.tre and ncbi.map; available: https://github.com/rhuebler/HOPS/)
string
Path to directory containing containing the NCBI resource tree and taxonomy table files (ncbi.tre and ncbi.map; available at the HOPS repository).\n\nNecessary when --metagenomics_profiling_tool malt
and --metagenomics_run_postprocessing
specified.
Modifies tool parameter(s):
- MaltExtract:
-r
Specify which MaltExtract filter to use.
string
Specify which MaltExtract filter to use. This is used to specify what types of characteristics to scan for. The default will output statistics on all alignments, and then a second set with just reads with one C to T mismatch in the first 5 bases. Further details on other parameters can be seen in the HOPS documentation.\n\nOnly when --metagenomics_profiling_tool malt
is also supplied.
Modifies tool parameter(s):
- MaltExtract:
-f
Specify percent of top alignments to use.
number
0.01
Specify frequency of top alignments for each read to be considered for each node.\n Note, value should be given in the format of a proportion (where 1 would correspond to 100%, and 0.1 would correspond to 10%).\n\n> ⚠️ this parameter follows the same concept as --malt_top_percent
but uses a different notation i.e. integer (MALT) versus float (MALTExtract)\n\nOnly when --metagenomics_profiling_tool malt
is also supplied.
Modifies tool parameter(s):
- MaltExtract:
-a
Turn off destacking.
boolean
Turn off destacking. If left on, a read that overlaps with another read will be\nremoved (leaving a depth coverage of 1).\n\nOnly when --metagenomics_profiling_tool malt
is also supplied.
Sets tool parameter(s):
- MaltExtract:
--destackingOff
Turn off downsampling.
boolean
Turn off downsampling. By default, downsampling is on and will randomly select 10,000 reads if the number of reads on a node exceeds this number. This is to speed up processing, under the assumption at 10,000 reads the species is a 'true positive'.\n\nOnly when --metagenomics_profiling_tool malt
is also supplied.
Sets tool parameter(s):
- MaltExtract:
--downSampOff
Turn off duplicate removal.
boolean
Turn off duplicate removal. By default, reads that are an exact copy (i.e. same start, stop coordinate and exact sequence match) will be removed as it is considered a PCR duplicate.\n\nOnly when --metagenomics_profiling_tool malt
is also supplied.
Sets tool parameter(s):
- MaltExtract:
--dupRemOff
Turn on exporting alignments of hits in BLAST format.
boolean
Export alignments of hits for each node in BLAST format.\n\nOnly when --metagenomics_profiling_tool malt
is also supplied.
Modifies tool parameter(s):
- MaltExtract:
--matches
Turn on export of MEGAN summary files.
boolean
Export 'minimal' summary files (i.e. without alignments) that can be loaded into MEGAN6.\n\nOnly when --metagenomics_profiling_tool malt
is also supplied.
Sets tool parameter(s):
- MaltExtract:
--meganSummary
Minimum percent identity alignments are required to have to be reported as candidate reads. Recommended to set same as MALT parameter.
number
85
Minimum percent identity alignments are required to have to be reported. Higher values allows fewer mismatches between read and reference sequence, but therefore will provide greater confidence in the hit. Lower values allow more mismatches, which can account for damage and divergence of a related strain/species to the reference. Recommended to set same as MALT parameter or higher.\n\nOnly when --metagenomics_profiling_tool malt
is also supplied.
Modifies tool parameter(s):
- MaltExtract:
--minPI
Turn on using top alignments per read after filtering.
boolean
Use the best alignment of each read for every statistic, except for those concerning read distribution and coverage. \n\nOnly when --metagenomics_profiling_tool malt
is also supplied.
Sets tool parameter(s):
- MaltExtract:
--useTopAlignment
Options for removal of PCR duplicates
Specify to skip the removal of PCR duplicates.
boolean
Specify which tool to use for deduplication.
string
Specify which duplicate read removal tool to use. While markduplicates
is set by default, an ancient DNA specific read deduplication tool dedup
is offered (see Peltzer et al. 2016 for details). The latter utilises both ends of paired-end data to remove duplicates (i.e. true exact duplicates, as markduplicates will over-zealously deduplicate anything with the same starting position even if the ends are different).
⚠️ DeDup can only be used on collapsed (i.e. merged) reads from paired-end sequencing.
Options for filtering for, trimming or rescaling characteristic ancient DNA damage patterns
Specify to turn on damage rescaling of BAM files using mapDamage2 to probabilistically remove damage.
boolean
Specify to turn on mapDamage2's BAM rescaling functionality. This probabilistically replaces Ts back to Cs depending on the likelihood this reference-mismatch was originally caused by damage. If the library is specified to be single-stranded, this will automatically use the --single-stranded
mode.
This process will ameliorate the effects of aDNA damage, but also increase reference-bias.
This functionality does not have any MultiQC output.
⚠️ Rescaled libraries will not be merged with non-scaled libraries of the same sample for downstream genotyping, as the model may be different for each library. If you wish to merge these, please do this manually and re-run nf-core/eager using the merged BAMs as input.
Modifies mapDamage2 parameter:
--rescale
Specify the length of read sequence to use from each side for rescaling.
integer
12
Specify the length in bp from the end of the read that mapDamage should rescale at both ends. This can be overridden by --rescalelength*p
.
Modifies mapDamage2 parameter:
--seq-length
Specify the length of read for mapDamage2 to rescale from 5 prime end.
integer
Specify the length in bp from the end of the read that mapDamage should rescale. This overrides --rescale_seqlength
.
Modifies mapDamage2 parameter:
--rescale-length-5p
Specify the length of read for mapDamage2 to rescale from 3 prime end.
integer
Specify the length in bp from the end of the read that mapDamage should rescale. This overrides --rescale_seqlength
.
Modifies mapDamage2 parameter
--rescale-length-3p
Specify to turn on PMDtools filtering.
boolean
Specify to run PMDtools for damage-based read filtering in sequencing libraries.
Specify PMD score threshold for PMDtools.
integer
3
Specify the PMDScore threshold to use when filtering BAM files for DNA damage. Only reads which surpass this damage score are considered for downstream analysis.
Modifies PMDtools parameter:
--threshold
Specify a masked FASTA file with positions to be used with PMDtools.
string
^\S+\.fa?(\sta)$
Specify a FASTA file to use as reference for samtools calmd
prior to PMD filtering.
Setting the SNPs that are part of the used capture set as N
can alleviate reference bias when running PMD filtering on capture data, where you might not want the allele of a SNP to be counted as damage when it is a transition.
Specify a BED file to be used to mask the reference FASTA prior to running PMDtools.
string
^\S+\.bed?(\.gz)$
Specify a BED file to activate masking of the reference FASTA at the contained sites prior to running PMDtools. Positions that are in the provided BED file will be replaced by Ns in the reference genome.
This can alleviate reference bias when running PMD filtering on capture data, where you might not want the allele of a transition SNP to be counted as damage. Masking of the reference is done using bedtools maskfasta
.
Specify to turn on BAM trimming for non-UDG or half-UDG libraries.
boolean
Specify to turn on the BAM trimming of [n] bases from reads in the deduplicated BAM file. Damage assessment in PMDtools or DamageProfiler remains untouched, as data is routed through this independently. BAM trimming is typically performed to reduce errors during genotyping that can be caused by aDNA damage.
BAM trimming will only affect libraries with 'damage_treatment' of 'none' or 'half'. Complete UDG treatment ('full') should have removed all damage during library construction, so trimming of 0 bp is performed. The amount of bases that will be trimmed off from each side of the molecule should be set separately for libraries depending on their 'strandedness' and 'damage_treatment'.
Note: additional artefacts such as barcodes or adapters should be removed prior to mapping and not in this step.
Specify the number of bases to clip off reads from 'left' (5 prime) end of reads for double-stranded non-UDG libraries.
integer
Specify the number of bases to clip off reads from 'left' (5 prime) end of reads for double-stranded non-UDG libraries. By default, this is set to 0, and therefore clips off no bases on the left side of reads from double-stranded libraries whose UDG treatment is set to 'none'. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).
Modifies bamUtil's trimBam parameter:
-L
Specify the number of bases to clip off reads from 'right' (3 prime) end of reads for double-stranded non-UDG libraries.
integer
Specify the number of bases to clip off reads from 'right' (3 prime) end of reads for double-stranded non-UDG libraries. By default, this is set to 0, and therefore clips off no bases on the right side of reads from double-stranded libraries whose UDG treatment is set to 'none'. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).
Modifies bamUtil's trimBam parameter:
-R
Specify the number of bases to clip off reads from 'left' (5 prime) end of read for double-stranded half-UDG libraries.
integer
Specify the number of bases to clip off reads from 'left' (5 prime) end of read for double-stranded half-UDG libraries. By default, this is set to 0, and therefore clips off no bases on the left side of reads from double-stranded libraries whose UDG treatment is set to 'half'. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).
Modifies bamUtil's trimBam parameter:
-L
Specify the number of bases to clip off reads from 'right' (3 prime) end of read for double-stranded half-UDG libraries.
integer
Specify the number of bases to clip off reads from 'right' (3 prime) end of read for double-stranded half-UDG libraries. By default, this is set to 0, and therefore clips off no bases on the right side of reads from double-stranded libraries whose UDG treatment is set to 'half'. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).
Modifies bamUtil's trimBam parameter:
-R
Specify the number of bases to clip off reads from 'left' (5 prime) end of read for single-stranded non-UDG libraries.
integer
Specify the number of bases to clip off reads from 'left' (5 prime) end of read for single-stranded non-UDG libraries. By default, this is set to 0, and therefore clips off no bases on the left side of reads from single-stranded libraries whose UDG treatment is set to 'none'. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).
Modifies bamUtil's trimBam parameter:
-L
Specify the number of bases to clip off reads from 'right' (3 prime) end of read for single-stranded non-UDG libraries.
integer
Specify the number of bases to clip off reads from 'right' (3 prime) end of read for single-stranded non-UDG libraries. By default, this is set to 0, and therefore clips off no bases on the right side of reads from single-stranded libraries whose UDG treatment is set to 'none'. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).
Modifies bamUtil's trimBam parameter:
-R
Specify the number of bases to clip off reads from 'left' (5 prime) end of read for single-stranded half-UDG libraries.
integer
Specify the number of bases to clip off reads from 'left' (5 prime) end of read for single-stranded half-UDG libraries. By default, this is set to 0, and therefore clips off no bases on the left side of reads from single-stranded libraries whose UDG treatment is set to 'half'. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).
Modifies bamUtil's trimBam parameter:
-L
Specify the number of bases to clip off reads from 'right' (3 prime) end of read for single-stranded half-UDG libraries.
integer
Specify the number of bases to clip off reads from 'right' (3 prime) end of read for single-stranded half-UDG libraries. By default, this is set to 0, and therefore clips off no bases on the right side of reads from single-stranded libraries whose UDG treatment is set to 'half'. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).
Modifies bamUtil's trimBam parameter:
-R
Specify to turn on soft-trimming instead of hard masking.
boolean
Specify to turn on soft-trimming instead of hard masking of bases. By default, nf-core/eager uses hard trimming, which sets trimmed bases to 'N' with quality '!' in the BAM output. Turn this on to use soft-trimming instead, which masks reads at the read ends using the CIGAR string instead.
Modifies bamUtil's trimBam parameter:
-c
Options for variant calling
Specify to turn on genotyping of BAM files.
boolean
Specify to turn on genotyping. --genotyping_source
and --genotyping_tool
must also be provided together with this option.
Specify which input BAM to use for genotyping.
string
Specify which BAM file to use for genotyping, depending on what BAM processing modules you have turned on. Options are: 'raw' (to use the reads used as input for damage manipulation); 'pmd' (for pmdtools output); 'trimmed' (for base-clipped BAMs. Base-clipped-PMD-filtered BAMs if both filtering and trimming are requested); 'rescaled' (for mapDamage2 rescaling output).
Warning: Depending on the parameters you provided, 'raw' can refer to all mapped reads, filtered reads (if BAM filtering has been performed), or the deduplicated reads (if deduplication was performed).
Specify which genotyper to use.
string
Specify which genotyper to use. Current options are: pileupCaller, ANGSD, GATK UnifiedGenotyper (v3.5), GATK HaplotypeCaller (v4) or FreeBayes.
Note that while UnifiedGenotyper is more suitable for low-coverage ancient DNA (HaplotypeCaller does de novo assembly around each variant site), be aware GATK v3.5 it is officially deprecated by the Broad Institute (but is used here for compatibility with MultiVCFAnalyzer).
Specify to skip generation of VCF-based variant calling statistics with bcftools.
boolean
Specify to disable running of bcftools stats
against VCF files from GATK and FreeBayes genotypers.
This will automatically include the FASTA reference for INDEL-related statistics.
Specify the ploidy of the reference organism.
integer
2
Specify the desired ploidy value of your reference organism for genotyping with GATK or FreeBayes. E.g. if you want to allow heterozygous calls this value should be >= 2.
Modifies GATK UnifiedGenotyper parameter:
--sample_ploidy
Modifies GATK HaplotypeCaller parameter:--sample-ploidy
Modifies FreeBayes parameter:-p
Specify the base mapping quality to be used for genotyping with pileupCaller.
integer
30
Specify the minimum base quality to be used when generating the samtools mpileup used as input for genotyping with pileupCaller.
Modifies samtools mpileup parameter:
-Q
.
Specify the minimum mapping quality to be used for genotyping with pileupCaller.
integer
30
Specify the minimum mapping quality to be used when generating the samtools mpileup used as input for genotyping with pileupCaller.
Modifies samtools mpileup parameter:
-q
.
Specify the path to SNP panel in BED format for pileupCaller.
string
Specify a SNP panel in the form of a BED file of sites at which to generate a pileup for pileupCaller.
Specify the path to SNP panel in EIGENSTRAT format for pileupCaller.
string
Specify a SNP panel in EIGENSTRAT format of sites to be called with pileupCaller.
Specify the SNP calling method to use for genotyping with pileupCaller.
string
Specify the SNP calling method to use for genotyping. 'randomHaploid' will randomly sample a read overlapping the SNP and produce a homozygous genotype with the allele supported by that read (often called 'pseudohaploid' or 'pseudodiploid'). 'randomDiploid` will randomly sample two reads overlapping the SNP and produce a genotype comprised of the two alleles supported by the two reads. 'majorityCall' will produce a genotype that is homozygous for the allele that appears in the majority of reads overlapping the SNP.
Modifies pileupCaller parameters:
--randomHaploid
--randomDiploid
--majorityCall
Specify the calling mode for transitions with pileupCaller.
string
Specify if genotypes of transition SNPs should be called, set to missing, or excluded from the genotypes respectively.
Modifies pileupCaller parameter:
--skipTransitions
--transitionsMissing
Specify GATK phred-scaled confidence threshold.
integer
30
Specify a GATK genotyper phred-scaled confidence threshold of a given SNP/INDEL call.
Modifies GATK UnifiedGenotyper or HaplotypeCaller parameter:
-stand_call_conf
Specify VCF file for SNP annotation of output VCF files for GATK.
string
^\S+\.vcf$
Specify VCF file for output VCF SNP annotation, e.g. if you want to annotate your VCF file with 'rs' SNP IDs. Check GATK documentation for more information. Gzip not accepted.
Specify the maximum depth coverage allowed for genotyping with GATK before down-sampling is turned on.
integer
250
Specify the maximum depth coverage allowed for genotyping before down-sampling is turned on. Any position with a coverage higher than this value will be randomly down-sampled to this many reads.
Modifies GATK UnifiedGenotyper parameter:
-dcov
Specify GATK UnifiedGenotyper output mode.
string
Specify GATK UnifiedGenotyper output mode to use when producing the output VCF (i.e. produce calls for every site or just confidence sites.)
Modifies GATK UnifiedGenotyper parameter:
--output_mode
Specify UnifiedGenotyper likelihood model.
string
Specify GATK UnifiedGenotyper likelihood model, i.e. whether to call only SNPs or INDELS etc.
Modifies GATK UnifiedGenotyper parameter:
--genotype_likelihoods_model
Specify to keep the BAM output of re-alignment around variants from GATK UnifiedGenotyper.
boolean
Specify to output the BAMs that have realigned reads (with GATK (v3) IndelRealigner) around possible variants for improved genotyping with GATK UnifiedGenotyper in addition to the standard VCF output.
These BAMs will be stored in the same folder as the corresponding VCF files.
Specify to supply a default base quality if a read is missing a base quality score.
integer
-1
Specify a value to set base quality scores for genotyping with GATK UnifiedGenotyper, if reads are missing this information. Might be useful if you have 'synthetically' generated reads (e.g. chopping up a reference genome). Default is set to -1
which is to not set any default quality (turned off).
Modifies GATK UnifiedGenotyper parameter:
--defaultBaseQualities
Specify GATK HaplotypeCaller output mode.
string
Specify the type of sites that should be included in the output VCF after genotyping with GATK HaplotypeCaller (i.e. produce calls for every site or just confidence sites).
Modifies GATK HaplotypeCaller parameter:
--output_mode
Specify HaplotypeCaller mode for emitting reference confidence calls.
string
Specify GATK HaplotypeCaller mode for emitting reference confidence calls.
Modifies GATK HaplotypeCaller parameter:
--emit-ref-confidence
Specify minimum required supporting observations of an alternate allele to consider a variant in FreeBayes.
integer
1
Specify the minimum count of observations supporting an alternate allele within a single individual in order to evaluate the position during genotyping with FreeBayes.
Modifies FreeBayes parameter:
-C
Specify to skip over regions of high depth by discarding alignments overlapping positions where total read depth is greater than specified in FreeBayes.
integer
Specify to skip over regions of high depth by discarding alignments overlapping positions where total read depth is greater than the specified value during genotyping with FreeBayes. This is set to 0 by default, which deactivates this behaviour.
Modifies FreeBayes parameter:
-g
Specify which ANGSD genotyping likelihood model to use.
string
Specify which genotype likelihood model to use in ANGSD.
Modifies ANGSD parameter:
-GL
Specify the formatting of the output VCF for ANGSD genotype likelihood results.
string
Specifies what type of genotyping likelihood file format will be output by ANGSD.
The options refer to the following descriptions respectively:
binary
: binary output of all 10 log genotype likelihoodbeagle_binary
: beagle likelihood filebinary_three
: binary 3 times likelihoodtext
: text output of all 10 log genotype likelihoods.
See the ANGSD documentation for more information on which to select for your downstream applications.
Modifies ANGSD parameter:
-doGlf
Options for the calculation of ratio of reads to one chromosome/FASTA entry against all others.
Specify to turn on mitochondrial to nuclear ratio calculation.
boolean
Specify to turn on estimation of the ratio of mitochondrial to nuclear reads.
Specify the name of the reference FASTA entry corresponding to the mitochondrial genome.
string
MT
Specify the FASTA entry in the reference file specified as --fasta
, which acts as the mitochondrial 'chromosome' to base the ratio calculation on. The tool only accepts the first section of the header before the first space. The default chromosome name is based on hs37d5/GrCH37 human reference genome.
Options for the calculation of mapping statistics
Specify to turn off the computation of library complexity estimation with preseq.
boolean
Specify to turn off the computation of library complexity estimation.
Specify which mode of preseq to run.
string
Specify which mode of preseq to run.
From the preseq documentation:
c curve is used to compute the expected complexity curve of a mapped read file with a hypergeometric formula
lc extrap is used to generate the expected yield for theoretical larger experiments and bounds on the number of distinct reads in the library and the associated confidence intervals, which is computed by bootstrapping the observed duplicate counts histogram.
Specify the step size (i.e., sampling regularity) of preseq.
integer
1000
Specify the step size of preseq's c_curve and lc_extrap methods. This can be useful when few reads are present and allow preseq to be used for extrapolation of shallow sequencing results.
Modifies preseq parameter:
-s
Specify the maximum number of terms that preseq's lc_extrap mode will use.
integer
100
Specify the maximum number of terms that preseq's lc_extrap mode will use.
Modifies preseq lc_extrap parameter:
-x
Specify the maximum extrapolation to use for preseq's lc_extrap mode.
integer
10000000000
Specify the maximum extrapolation that preseq's lc_extrap mode will perform.
Modifies preseq lc_extrap parameter:
-e
Specify number of bootstraps to perform in preseq's lc_extrap mode.
integer
100
Specify the number of bootstraps preseq's lc_extrap mode will perform to calculate confidence intervals.
Modifies preseq lc_extrap parameter:
-n
Specify confidence interval level for preseq's lc_extrap mode.
number
0.95
Specify the allowed level of confidence intervals used for prerseq's lc_extrap mode.
Modifies preseq lc_extrap parameter:
-c
Specify to turn on preseq defects mode to extrapolate without testing for defects in lc_extrap mode.
boolean
Specify to activate defects mode of preseq lc_extrap
, which runs the extrapolation without testing for defects.
Modifies preseq lc_extrap parameter:
-D
Specify to turn off coverage calculation with Qualimap.
boolean
Specify path to SNP capture positions in BED format for coverage calculations with Qualimap.
string
Options for calculating and filtering for characteristic ancient DNA damage patterns.
Specify to turn off ancient DNA damage calculation.
boolean
Specify to turn off computation of DNA damage profiles.
Specify the tool to use for damage calculation.
string
Specify the tool to be used for damage calculation. DamageProfiler is generally faster than mapDamage2, but the latter has an option to limit the number of reads used. This can significantly speed up the processing of very large files, where the damage estimates are already accurate after processing only a fraction of the input.
Specify the maximum misincorporation frequency that should be displayed on damage plot.
number
0.3
Specify the maximum misincorporation frequency that should be displayed in the damage plot.
Modifies DamageProfiler parameter:
-yaxis_dp_max
or mapDamage2 parameter:--ymax
Specify number of bases of each read to be considered for plotting damage estimation.
integer
25
Specify the number of bases to be considered for plotting nucleotide misincorporations.
Modifies DamageProfiler parameter:
-t
or mapDamage2 parameter:-m
Specify the length filter for DamageProfiler.
integer
100
Specify the number of bases which are considered for frequency computations.
Modifies DamageProfiler parameter:
-l
Specify the maximum number of reads to consider for damage calculation with mapDamage.
integer
Specify the maximum number of reads used for damage calculation in mapDamage2. This can be used to significantly reduce the amount of time required for damage assessment. Note that a too low value can also obtain incorrect results.
Modifies mapDamage2 parameter:
-n
Options for calculating reference annotation statistics (e.g. gene coverages)
Specify to turn on calculation of number of reads, depth and breadth coverage of features in reference with bedtools.
boolean
Specify to turn on the bedtools module, producing statistics for breadth (or percent coverage), and depth (or X fold) coverages.
Modifies bedtools coverage parameter:
-mean
Specify path to GFF or BED file containing positions of features in reference file for bedtools.
string
Specify the path to a GFF/BED containing the feature coordinates (or any acceptable input for bedtools coverage
). Must be in quotes.
Options for removing host-mapped reads
Specify to turn on creation of pre-adapter-removal and/or read-pair-merging FASTQ files without reads that mapped to reference (e.g. for public upload of privacy sensitive non-host data).
boolean
Specify to recreate pre-adapter-removal and/or read-pair-merging FASTQ files but without reads that mapped to reference (e.g. for public upload of privacy-sensitive non-host data)
Specify the host-mapped read removal mode.
string
Specify the host-mapped read removal mode.
Modifies extract_map_reads.py parameter: -m
Options for the estimation of contamination in human data
Specify to turn on nuclear contamination estimation for genomes with ANGSD.
boolean
Specify to run nuclear DNA contamination estimation with ANGSD.
Specify the name of the chromosome to be used for contamination estimation with ANGSD.
string
X
Specify the name of the chromosome to be used for contamination estimation with ANGSD as specified in your FASTA/BAM header, e.g. 'X' for hs37d5 or 'chrX' for hg19
Specify the first position on the chromosome to be used for contamination estimation with ANGSD.
integer
5000000
Specify the beginning of the genetic range that should be utilised for nuclear contamination estimation with ANGSD.
Specify the last position on the chromosome to be used for contamination estimation with ANGSD.
integer
154900000
Specify the end of the genetic range that should be utilised for nuclear contamination estimation with ANGSD.
Specify the minimum mapping quality reads should have for contamination estimation with ANGSD.
integer
30
Specify the minimum mapping quality reads should have for contamination estimation with ANGSD.
Modifies ANGSD parameter:
-minMapQ
Specify the minimum base quality reads should have for contamination estimation with ANGSD.
integer
30
Specify the minimum base quality reads should have for contamination estimation with ANGSD.
Modifies ANGSD parameter:
-minQ
Specify path to HapMap file of chromosome for contamination estimation with ANGSD.
string
${projectDir}/assets/angsd_resources/HapMapChrX.gz
Specify a path to HapMap file of chromosome for contamination estimation with ANGSD. The haplotype map, or "HapMap", records the location of haplotype blocks and their tag SNPs.
Options for the calculation of genetic sex of human individuals.
Specify to turn on sex determination for genomes mapped to human reference genomes with Sex.DetERRmine.
boolean
Specify to run genetic sex determination.
Specify path to SNP panel in BED format for error bar calculation.
string
Specify a BED file with SNPs to be used for X-/Y-rate calculation. Running without this parameter will considerably increase runtime, and render the resulting error bars untrustworthy. Theoretically, any set of SNPs that are distant enough that two SNPs are unlikely to be covered by the same read can be used here. The programme was coded with the 1240k panel in mind.