XClose

UCL Division of Biosciences

Home
Menu

vcfProcess

- An R function for post-processing VCF SNP files of haploid organisms

The standard output of variant calling software is the VCF format. These files present information on genomic variants of tested samples, such as the genotype, read depth and quality scores, in a standardised format. Unfortunately many frequently used phylogenetic and population genetic tools, such as BEAST and RaxML, do not accept the VCF format as an input.

In addition, the output from variant callers will also contain calls that are low quality at some sites due to low coverage or sequencing error. These calls will need to be removed or assigned with a character for 'unknown'. Variant calls that are in regions that are classified as repeats or from mobile elements may also be removed before further analysis.

Here we present a function using the R software for converting variant caller output in the VCF format to files that can be used as an input for downstream analysis tools, such as in the FASTA format. There is also the built-in functionality to remove or assign uncertain call characters to variants that are in excluded regions or have low quality.

At present this function is intended for haploid organisms that have been tested with variant calling software primarily designed for diploid organisms, such as Samtools or GATK. Additionally, the function will only process Single Nucleotide Polymorphisms (SNPs) and will not take into account insertions and deletions (indels), although there is the option to output a file containing any indels for further analysis. The output will be files containing a high-confidence assigned nucleotide (or uncertain character) for each individual at each high quality SNP position in a suitable format for further analysis.

For more information and user-settings please see the help documentation, or feel free to contact the author: b.sobkowiak.12@ucl.ac.uk