Correlation analysis

The CODEX similarity analysis tool allows users to compare selected ChIP-Seq experiments peak profiles. We opted to compute similarity using the Dice coefficient instead of the traditional Pearson correlation.

Binding events or peaks in a ChIP-Seq experiment represent interaction of a transcription factor protein with DNA in the genome. When comparing multiple experiments we transform each peak profile into binary vectors within a matrix, with column as experiments and rows as genomic regions. If a TF binds to a particular region the event is given 1, while is given in the event of no binding. As the number of experiments in the matrix increase, the number of regions bound by a single transcription factor increase disproportionally. Therefore, using Pearson correlation coefficient to compute the pairwise correlation on such data will give a mostly negative coeffcient (close to zero) due to the overwhelming number of zeros in the matrix.

This observation led us to consider the meaning of negative correlation when dealing with ChIP-Seq data. In this case, negative correlation does not mean that binding profiles are opposite but rather that transcription factors bind at different genomic locations. Hence, the Pearson correlation has little information content in this context. Therefore, we chose instead to look at the coefficient of agreement between two experiments using the Dice coefficient. The Dice coefficient is designed to measure similarity between asymetric binary vectors.