Estimate genome size.pl

scripts to estimate genome size and coverage from kmer distribution generated by jellyfish

View the Project on GitHub josephryan/estimate_genome_size.pl

Welcome to estimate_genome_size.pl GitHub Page

These scripts can be used to estimate genome size and coverage from fastq files containing sequencing reads

you will need JELLYFISH: http://www.cbcb.umd.edu/software/jellyfish/

If you want to use jellyplot.pl you will need GNUPLOT: http://www.gnuplot.info/

A typical session:

$ # count k-mers (see jellyfish documentation for options)
gzip -dc reads1.fastq.gz reads2.fastq.gz | jellyfish count -m 31 -o fastq.counts -C -s 10000000000 -U 500 -t 30 /dev/fd/0 

# generate a histogram
jellyfish histo fastq.counts_0 > fastq.counts_0.histo

# generate a pdf graph of the histogram
jellyplot.pl fastq.counts_0.histo

# look at fastq.counts_0.histo.pdf and identify the approximate peak

# use find_valleys.pl to help pinpoint the actual peak
find_valleys.pl fastq.counts_0.histo

# estimate the size and coverage
estimate_genome_size.pl --kmer=31 --peak=42 --fastq=reads1.fastq.gz reads2.fastq.gz

NOTES about the typical session:

1. it is helpful to run with multiple kmer sizes to see if this has an effect (change -m in first jellyfish command)

2. -s should be adjusted to the size of your RAM in the first jellyfish command (see jellyfish manual)

3. The first jellyfish command includes '-U 500' this tells jellyfish not to output k-mer with count > 500. It is recommended to run without this the first time and then rerun with an uppercount that includes the first peak. This makes the files much easier to manage.

See the following for the principle behind the script

https://banana-slug.soe.ucsc.edu/bioinformatic_tools:jellyfish

http://seqanswers.com/forums/archive/index.php/t-10988.html

https://banana-slug.soe.ucsc.edu/bioinformatic_tools:quake

If you don't believe this works

Simulate some next generation data based on an already-sequenced genome. I used the program ART to do this with the Hydra magnipapillata genome:

http://www.niehs.nih.gov/research/resources/software/biostatistics/art/

art_illumina --paired --in Hm_genomic.fa --out Hm_genomic_wildtype --len 100 --fcov 43 --mflen 180 --sdev 3.5

I also simulated perfect paired-end data with a sliding window. In both cases I recovered the correct coverage and size.

Authors and Contributors

Joseph F. Ryan, Ph.D. (@josephryan)

Support or Contact

There is Documentation on the wiki.

If you have a negative or positive experience with estimate_genome_size.pl, I am interested in hearing about it. E-Mail: josephryan@yahoo.com or leave something on the wiki.

Citing estimate_genome_size.pl

Ryan, J. F. (2013). estimate_genome_size.pl (version 0.03) [Computer software]. Bergen, Norway: Sars International Centre for Marine Molecular Biology. Retrieved from http://josephryan.github.com/estimate_genome_size.pl/