TutorialΒΆ

To begin, download the latest version of Boiler from https://github.com/jpritt/boiler. Add the main directory to your path and make sure you have Python version 3 or higher. You will also need SAMtools, which you can download from samtools.sourceforge.net.

Download the SAM dataset here and move it to your working directory.

Run

mkdir compressed
python3 boiler.py compress --frag-len-z-cutoff 0.125 accepted_hits.sam compressed/compressed.bl

If all goes well, you should see something like this (exact output may change with future versions):

Set fragment length cutoff to z=0.125000 (33165) based on length distribution
0.84 % of pairs are longer than the cutoff
Using fragment length cutoff of 33165
Not splitting mates on different strands
Not splitting discordant
0 cross-bundle reads unmatched
Minimum bundle length: 12
Maximum bundle length: 206957
Average bundle length: 2514
1097 cross-bundle buckets
Compressed size: 29682
Approximately 3979761 / 6972093 = 57.081295% of compressed file is coverage
Finished compressing

You should now have a file compressed/compressed.bl roughly 4.3 MB in size.

Now let’s query all of the bundles that Boiler found in chromosome 2L:

python3 boiler.py query --bundles --chrom 2L compressed/compressed.bl bundles.txt

You should now have a file bundles.txt containing all of the bundles used by Boiler. Type

head bundles.txt

to see the first few lines of this file:

7478        9485
9841        21430
21825       23108
23180       24034
24856       25219
25404       26251
26333       33987
34045       35094
36182       37317
37538       37931

To query the coverage in the first bundle, run

python3 boiler.py query --coverage --chrom 2L --start 7478 --end 9485 compressed/compressed.bl coverage.txt

coverage.txt should now contain a comma-separated vector containing the coverage at every base in the interval [7478, 9485). Finally, to query the reads in the first bundle, run

python3 boiler.py query --reads --chrom 2L --start 7478 --end 9485 compressed/compressed.bl reads.sam

reads.sam is a SAM file with no header, containing all the aligned reads in the interval [7478, 9485). Type

head reads.sam

to see the first few reads in this bundle, which should look like this:

2L:0        0       2L      7772    50      76M     *       0       0       *       *       NH:i:1
2L:1        0       2L      7795    50      76M     *       0       0       *       *       NH:i:1
2L:2        0       2L      7808    50      76M     *       0       0       *       *       NH:i:1
2L:3        0       2L      7863    50      76M     *       0       0       *       *       NH:i:1
2L:4        0       2L      8073    50      44M112N32M      *       0       0       *       *       XS:A:+  NH:i:1
2L:5        0       2L      8595    50      76M     *       0       0       *       *       NH:i:1
2L:6        0       2L      8781    50      76M     *       0       0       *       *       NH:i:1
2L:7        0       2L      8852    50      76M     *       0       0       *       *       NH:i:1
2L:8        0       2L      8963    50      76M     *       0       0       *       *       NH:i:1
2L:9        0       2L      8969    50      76M     *       0       0       *       *       NH:i:1

Finally, let’s decompress the compressed file by running

python3 boiler.py decompress compressed/compressed.bl expanded.sam

The resulting SAM file is unsorted – to sort and convert it to BAM, run

samtools view -bS expanded.sam | samtools sort - expanded