scripts to estimate genome size and coverage from kmer distribution generated by jellyfish

These scripts can be used to estimate genome size and coverage from fastq files containing sequencing reads

you will need JELLYFISH:

If you want to use you will need GNUPLOT:

A typical session:

$ # count k-mers (see jellyfish documentation for options)
gzip -dc reads1.fastq.gz reads2.fastq.gz | jellyfish count -m 31 -o fastq.counts -C -s 10000000000 -U 500 -t 30 /dev/fd/0 

# generate a histogram
jellyfish histo fastq.counts_0 > fastq.counts_0.histo

# generate a pdf graph of the histogram fastq.counts_0.histo

# look at fastq.counts_0.histo.pdf and identify the approximate peak

# use to help pinpoint the actual peak fastq.counts_0.histo

# estimate the size and coverage --kmer=31 --peak=42 --fastq=reads1.fastq.gz reads2.fastq.gz

NOTES about the typical session:

1. it is helpful to run with multiple kmer sizes to see if this has an effect (change -m in first jellyfish command)

2. -s should be adjusted to the size of your RAM in the first jellyfish command (see jellyfish manual)

3. The first jellyfish command includes '-U 500' this tells jellyfish not to output k-mer with count > 500. It is recommended to run without this the first time and then rerun with an uppercount that includes the first peak. This makes the files much easier to manage.

See the following for the principle behind the script

If you don't believe this works

Simulate some next generation data based on an already-sequenced genome. I used the program ART to do this with the Hydra magnipapillata genome:

art_illumina --paired --in Hm_genomic.fa --out Hm_genomic_wildtype --len 100 --fcov 43 --mflen 180 --sdev 3.5

I also simulated perfect paired-end data with a sliding window. In both cases I recovered the correct coverage and size.

Authors and Contributors

Joseph F. Ryan, Ph.D. (@josephryan)

Support or Contact

There is Documentation on the wiki.

If you have a negative or positive experience with, I am interested in hearing about it. E-Mail: or leave something on the wiki.


Ryan, J. F. (2013). (version 0.03) [Computer software]. Bergen, Norway: Sars International Centre for Marine Molecular Biology. Retrieved from