scripts to estimate genome size and coverage from kmer distribution generated by jellyfish
View the Project on GitHub josephryan/estimate_genome_size.pl
These scripts can be used to estimate genome size and coverage from fastq files containing sequencing reads
you will need JELLYFISH: http://www.cbcb.umd.edu/software/jellyfish/
If you want to use jellyplot.pl you will need GNUPLOT: http://www.gnuplot.info/
A typical session:
$ # count k-mers (see jellyfish documentation for options)
gzip -dc reads1.fastq.gz reads2.fastq.gz | jellyfish count -m 31 -o fastq.counts -C -s 10000000000 -U 500 -t 30 /dev/fd/0
# generate a histogram
jellyfish histo fastq.counts_0 > fastq.counts_0.histo
# generate a pdf graph of the histogram
jellyplot.pl fastq.counts_0.histo
# look at fastq.counts_0.histo.pdf and identify the approximate peak
# use find_valleys.pl to help pinpoint the actual peak
find_valleys.pl fastq.counts_0.histo
# estimate the size and coverage
estimate_genome_size.pl --kmer=31 --peak=42 --fastq=reads1.fastq.gz reads2.fastq.gz
NOTES about the typical session:
1. it is helpful to run with multiple kmer sizes to see if this has an effect (change -m in first jellyfish command)
2. -s should be adjusted to the size of your RAM in the first jellyfish command (see jellyfish manual)
3. The first jellyfish command includes '-U 500' this tells jellyfish not to output k-mer with count > 500. It is recommended to run without this the first time and then rerun with an uppercount that includes the first peak. This makes the files much easier to manage.
https://banana-slug.soe.ucsc.edu/bioinformatic_tools:jellyfish
http://seqanswers.com/forums/archive/index.php/t-10988.html
https://banana-slug.soe.ucsc.edu/bioinformatic_tools:quake
Simulate some next generation data based on an already-sequenced genome. I used the program ART to do this with the Hydra magnipapillata genome:
http://www.niehs.nih.gov/research/resources/software/biostatistics/art/
art_illumina --paired --in Hm_genomic.fa --out Hm_genomic_wildtype --len 100 --fcov 43 --mflen 180 --sdev 3.5
I also simulated perfect paired-end data with a sliding window. In both cases I recovered the correct coverage and size.
Joseph F. Ryan, Ph.D. (@josephryan)
There is Documentation on the wiki.
If you have a negative or positive experience with estimate_genome_size.pl, I am interested in hearing about it. E-Mail: josephryan@yahoo.com or leave something on the wiki.
Ryan, J. F. (2013). estimate_genome_size.pl (version 0.03) [Computer software]. Bergen, Norway: Sars International Centre for Marine Molecular Biology. Retrieved from http://josephryan.github.com/estimate_genome_size.pl/