The 4DN Consortium will be generating a lot of Hi-C data, which will be processed and made available to the scientific community via the 4DN Data Portal. The raw sequencing reads will be made available as FASTQ files (or lossless BAM files), and the mapped reads will be made available as BAM files. But file format(s) are needed to store and distribute pairwise contacts and contact matrices. The format(s) must be easy to use but should also be efficient in terms of disk space and access time.
The 4DN Omics Working Group has investigated a variety of options for formats for different stages of the processing pipeline. (See this document for details on desired features and considerations.) In particular, two general-purpose binary formats exist to describe contact matrices: the Juicer/Juicebox .hic format (https://github.com/theaidenlab/juicebox) and the Cooler .cool format (https://github.com/mirnylab/cooler). A small study comparing running times and disk usage for these formats resulted in roughly similar results. The remaining question is whether these two formats differ in terms of usability.
We are currently performing a study to evaluate the usability of the different formats. It should be noted that the two formats contain very similar information; what is being evaluated are the tools around each format and how easy it will be to develop more tools around them (think bam/sam, samtools, Rsamtools etc.). The focus of the study is on the end user who would like to perform various bioinformatics analyses on these files. But, in certain cases, users may also need to generate the files, and evaluation on such tasks is also welcome. This page provides a basic introduction of these file formats and pointers to files for the usability study.
When you complete the study, please share your experience with us using this google form.
Introduction and Documentation
The Juicer/Juicebox suite of tools and the .hic file format are described in Aiden Lab website. The Juicer pipeline takes fastq files as input and creates .hic files containing normalized contact matrices at multiple resolutions, as well as domain and loop calls. The Juicebox tools read in .hic files and provide visualization and advanced customization of downstream analyses. To create .hic files, users can start with fastq files and run the whole Juicer pipeline or start with the intermeditate text file for valid read pairs, the merged_nodups “pre” format. For advanced users, the .hic file schema is available here.
The .hic format features:
- Contact matrices in multiple resolutions and summary statistics stored in one file
- Java and C bindings
- Command line tools
- Extant suite of analysis tools
- Extant visualization tool.
.hic files for a number of papers can be downloaded from this link. The reviewers can use any sample of their choice, from this set or elsewhere. From this set, we can recommend that reviewers use one, two, or three of the following samples to test out different sequencing depths.
- GM12878 sample from Rao and Huntley et al. Cell 2014 with 4.9B contacts.
- K562 sample from Rao and Huntley et al. Cell 2014 with 930M contacts.
- HUVEC sample from Rao and Huntley et al. Cell 2014 with 460M contacts.
Introduction and Documentation
The cooler library and .cool file format are described in the Mirny Lab github repo. The cooler python library and command line tools take in a text file for read pairs or contact matrices; store the information in the sparse, compressed, binary .cool format; include utilities for performing out-of-core contact matrix balancing; and perform fast range queries on a contact matrix. The .cool format is a sparse, compressed, binary persistent storage format for Hi-C contact matrices based on HDF5. HDF5 is a general purpose binary container format for large scientific datasets, with bindings in multiple languages. Therefore .cool files can be read in as HDF5 files natively in different languages.
The .cool format features:
- Flexibility to store one or multiple matrices with varying bin sizes
- python library
- Command line tools
- HDF5, which has native bindings in practically all languages
- out of memory iterative matrix balancing, that can work on very large matrices.
.cool files for a number of papers can be downloaded from this link. The reviewers can use any sample of their choice, from this set or elsewhere. From this set, we can recommend that reviewers one, two, or three of the following samples to test out different sequencing depths.
Other reference files
Here are pointers to reference maps that can be of interest while studying the Hi-C contact matrices:
- GM12878 histone marks from ENCODE (bigWig tracks)
- GM12878 CTCF peak calls from ENCODE (narrowPeak files from 4 different labs processed separately)
- GM12878 CTCF peak calls with motif orientation (Aiden Lab) from http://aidenlab.org/data.html.
- GM12878 loop calls from Rao and Huntley et al. Cell 2014 (Aiden Lab) from http://aidenlab.org/data.html.
- K562 histone marks from ENCODE (bigWig tracks)
- K562 CTCF peak calls from ENCODE (narrowPeak files from 5 different labs processed separately)
- K562 loop calls from Rao and Huntley et al. Cell 2014 (Aiden Lab) from http://aidenlab.org/data.html.
- HUVEC reference epigone series from ENCODE
- HUVEC CTCF peak calls from ENCODE
- HUVEC loop calls from Rao and Huntley et al. Cell 2014 (Aiden Lab) from http://aidenlab.org/data.html.