These scripts can be used to estimate genome size and coverage from fastq files containing sequencing reads
you will need JELLYFISH: http://www.cbcb.umd.edu/software/jellyfish/
If you want to use jellyplot.pl you will need GNUPLOT: http://www.gnuplot.info/
A typical session:
$ # count k-mers (see jellyfish documentation for options) gzip -dc reads1.fastq.gz reads2.fastq.gz | jellyfish count -m 31 -o fastq.counts -C -s 10000000000 -U 500 -t 30 /dev/fd/0 # generate a histogram jellyfish histo fastq.counts_0 > fastq.counts_0.histo # generate a pdf graph of the histogram jellyplot.pl fastq.counts_0.histo # look at fastq.counts_0.histo.pdf and identify the approximate peak # use find_valleys.pl to help pinpoint the actual peak find_valleys.pl fastq.counts_0.histo # estimate the size and coverage estimate_genome_size.pl --kmer=31 --peak=42 --fastq=reads1.fastq.gz reads2.fastq.gz
NOTES about the typical session:
1. it is helpful to run with multiple kmer sizes to see if this has an effect (change -m in first jellyfish command)
2. -s should be adjusted to the size of your RAM in the first jellyfish command (see jellyfish manual)
3. The first jellyfish command includes '-U 500' this tells jellyfish not to output k-mer with count > 500. It is recommended to run without this the first time and then rerun with an uppercount that includes the first peak. This makes the files much easier to manage.
Simulate some next generation data based on an already-sequenced genome. I used the program ART to do this with the Hydra magnipapillata genome:
art_illumina --paired --in Hm_genomic.fa --out Hm_genomic_wildtype --len 100 --fcov 43 --mflen 180 --sdev 3.5
I also simulated perfect paired-end data with a sliding window. In both cases I recovered the correct coverage and size.
Joseph F. Ryan, Ph.D. (@josephryan)
There is Documentation on the wiki.
If you have a negative or positive experience with estimate_genome_size.pl, I am interested in hearing about it. E-Mail: email@example.com or leave something on the wiki.
Ryan, J. F. (2013). estimate_genome_size.pl (version 0.03) [Computer software]. Bergen, Norway: Sars International Centre for Marine Molecular Biology. Retrieved from http://josephryan.github.com/estimate_genome_size.pl/