ChExMix: the ChIP-exo mixture modelExample of ChExMix analysis

ChExMix aims to characterize protein-DNA binding subtypes in ChIP-exo experiments. ChExMix assumes that different regulatory complexes will result in different protein-DNA crosslinking signatures in ChIP-exo data, and thus analysis of ChIP-exo sequencing tag patterns should enable detection of multiple protein-DNA binding modes for a given regulatory protein. ChExMix uses a mixture modeling framework to probabilistically model the genomic locations and subtype membership of protein-DNA binding events, leveraging both ChIP-exo tag enrichment patterns and DNA sequence information. In doing so, ChExMix offers a more principled and robust approach to characterizing binding subtypes than simply clustering binding events using motif information.

|| citation || download || source || manual | prerequisites | running chexmix | options | example ||

Citation

Yamada, et al.“Characterizing protein-DNA binding event subtypes in ChIP-exo data”. BioRxiv 266536.
This paper will also be presented at RECOMB 2018.

Download

  • ChExMix version 0.51 (2021-06-03): JAR
  • ChExMix version 0.5 (2020-07-09): JAR
  • ChExMix version 0.45 (2019-12-11): JAR
  • ChExMix version 0.4 (2019-04-25): JAR
  • ChExMix version 0.3 (2018-01-15): JAR
  • ChExMix version 0.2 (2018-10-03): JAR
  • ChExMix version 0.1 (2018-02-14): JAR

Source

Open source code (released under the MIT license) is available from: https://github.com/seqcode/chexmix.
If you want to build the code yourself, you will need to first download and build the seqcode-core library (https://github.com/seqcode/seqcode-core) and add its build/classes and lib directories to your CLASSPATH.

Manual

Prerequisites

  1. ChExMix requires Java 8+.
  2. You need MEME installed and available in $PATH if you want to find subtypes using motifs (tested with MEME version 4.11.3).
  3. ChExMix loads all data to memory, so you will need a lot of available memory if you are running analysis over many conditions or large datasets.

Running ChExMix

Note that ChExMix performs a lot of EM optimization of binding events along the genome, and also integrates motif-finding via MEME. Therefore, expect it to be very time and memory intensive.

Running from a jar file:

java -Xmx20G -jar chexmix.jar <options - see below>

In the above, the “-Xmx20G” argument tells java to use up to 20GB of memory. If you have installed source code from github, and if all classes are in your CLASSPATH, you can run ChExMix as follows:

java -Xmx20G org.seqcode.projects.chexmix.ChExMix <options - see below>

Options

Required/important options are highlighted in red.

General Options

  • --out <prefix>: Output file prefix. All output will be put into a directory with the prefix name.
  • --threads <n>: Use n threads during binding event detection (default=1)
  • --verbose: Flag to print intermediate files and extra output.

Specifying the Genome:

  • --geninfo <genome info file>: This file should list the lengths of all chromosomes on separate lines using the format chrName\<tab\>chrLength. You can generate a suitable file from UCSC 2bit format genomes using the UCSC utility “twoBitInfo”. The chromosome names should be exactly the same as those used in your input list of genomic regions.
    The genome info files for some UCSC genome versions:
    | hg18 | hg19 | hg38 | mm8 | mm9 | mm10 | rn4 | rn5 | danRer6 | ce10 | dm3 | sacCer2 | sacCer3 |
  • --seq <fasta seq directory> : A directory containing fasta format files corresponding to every named chromosome is required if you want to find subtypes run motif-finding or use a motif-prior within ChExMix.
  • --back <path> : A file containing Markov background model for the genome is required if you want to run motif-finding or use a motif-prior within ChExMix.
    The Markov background model files for some species:
    | human | mouse | Drosophila | S. cerevisiae | E. coli |

Loading Data:

  • --exptCONDNAME-REPNAME <file>: Defines a file containing reads from a signal experiment. Replace CONDNAME and REPNAME with appropriate condition and replicate labels.
  • --ctrlCONDNAME-REPNAME <file>: Optional arguments. Defines a file containing reads from a control experiment. Replace CONDNAME and REPNAME with appropriate labels to match a signal experiment (i.e. to tell ChExMix which condition/replicate this is a control for). If you leave out a REPNAME, this file will be used as a control for all replicates of CONDNAME.
  • --format <SAM/BAM/BED/IDX>: Format of data files. All files must be the same format if specifying experiments on the command line. Supported formats are SAM/BAM, BED, and IDX index files.

Instead of using the above options to specify each and every ChIP-seq data file on the command-line, you can instead use a design file:

  • ‒‒design <file name> : A file that specifies the data files and their condition/replicate relationships. See here for an example design file. The file should be formatted to contain the following pieces of information for each data file, in this order and tab-separated:
    • File name
    • Label stating if this experiment is “signal” or “control”
    • File format (SAM/BAM/BED/IDX) – mixtures of formats are allowed in design files
    • Condition name
    • Replicate name (optional for control experiments – if used, the control will only be used for the corresponding named signal replicate)
    • Optional. Experiment type for this replicate: chipseq/chipexo.
    • Optional. Fixed per-base read count limit for this replicate. “P” gives a read count limit that varies along the genome according to how neighboring bases are distributed. Default is a global per-base limit that is estimated from a Poisson distribution.

Limits on how many reads can have their 5′ end at the same position in each replicate. Please use the options if your experiments contain many PCR duplicates:

  • --readfilter: Flag to turn on filtering reads. Default = no read filter
  • ‒‒fixedpb <value>: Fixed per-base limit.
  • ‒‒poissongausspb <value> : Filter per base using a Poisson threshold parameterized by a local Gaussian sliding window (i.e. look at neighboring positions to decide what the per-base limit should be).
  • Default behavior is to estimate a global per-base limit from a Poisson distribution parameterized by the number of reads divided by the number of mappable bases in the genome. The per-base limit is set as the count corresponding to the 10^-7 probability level from the Poisson.
  • ‒‒nonunique : Flag to use non-unique reads.
  • ‒‒mappability <value> : Fraction of the genome that is mappable for these experiments. Default=0.8.
  • ‒‒nocache : Flag to turn off caching of the entire set of experiments (i.e. run slower with less memory)

Scaling data:

Default behavior is to scale signal to corresponding controls using the Normalization of ChIP-seq (NCIS) method described by Liang & Keles, BMC Bioinformatics 2012.

  • ‒‒noscaling : Flag to turn off auto estimation of signal vs control scaling factor.
  • ‒‒medianscale : Flag to use scaling by median ratio of binned tag counts. Default = scaling by NCIS.
  • ‒‒regressionscale : Flag to use scaling by regression on binned tag counts. Default = scaling by NCIS.
  • ‒‒sesscale : Flag to use scaling by SES (Diaz, et al. Stat Appl Genet Mol Biol. 2012).
  • ‒‒fixedscaling <scaling factor> : Multiply control counts by total tag count ratio and then by this factor. Default: scaling by NCIS.
  • ‒‒scalewin <window size> : Window size for estimating scaling ratios. Default is 10Kbp. Use something much smaller if scaling via SES (e.g. 200bp).
  • ‒‒plotscaling : Flag to plot diagnostic information for the chosen scaling method.

Training ChExMix:

  • --round <int>: Max. model update rounds (default=3).
  • --nomodelupdate: Flag to turn off binding model updates.
  • --minmodelupdateevents <int>: Minimum number of events to support an update (default=50)
  • --prlogconf <value>: Poisson log threshold for potential region scanning (default=-6)
  • --fixedalpha <int>: Impose this alpha (default: set automatically). The alpha parameter is a sparse prior on binding events in the ChExMix model. It can be interpreted as a minimum number of reads that each binding event must be responsible for in the model.
  • --alphascale <value>: Alpha scaling factor (default=1.0). Increasing this parameter results in stricter binding event calls.
  • --betascale <value>: Beta scaling factor (default=0.05). The beta parameter is a sparse prior on binding event subtype assignment in the ChExMix model. Increasing this parameter may result in inaccurate subtype assignment.
  • --epsilonscale <value>: Epsilon scaling factor (default=0.2). The epsilon parameter control a balance between motif and read distribution in subtype assignment. Increasing this parameter will increase the contribution of motif and decreases the contribution of read distribution.
  • --peakf <file>: File of peaks to initialize component positions.
  • --galaxyhtml: Flag to produce a html output appropriate for galaxy.
  • --exclude <file>: File containing a set of regions to ignore during ChExMix training. It’s a good idea to exclude the mitochondrial genome and other ‘blacklisted’ regions that contain artifactual accumulations of reads in both ChIP-exo and control experiments. ChExMix will waste time trying to model binding events in these regions, even though they will not typically appear significantly enriched over the control (and thus will not be reported to the user). See the format of an exclude region file here (example for mm9).
  • --excludebed <file>: Bed file containing a set of regions to ignore during ChExMix training.

Binding event reporting mode:
Controls which events to put into .events file.

  • --standard: Report events that pass significance threshold in condition as a whole (default mode)
  • --lenient: Report events that pass significance in >=1 replicate or the condition as a whole.
  • --lenientplus: Report events that pass significance in condition OR (>=1 replicate AND no significant difference between replicates).

Finding ChExMix subtypes:

1. Using motifs:

  • --motfile <file>: File of motifs in transfac format to initialize subtype motifs.
  • --memepath <path>: Path to the meme bin dir (default: meme is in $PATH). MEME path is required for motif finding.
  • --nomotifs <value>: Flag to turn off motif-finding & motif priors
  • --nomotifprior <value>: Flag to turn off motif priors only
  • --mememinw <value>: minw arg for MEME (default=6).
  • --mememaxw <value>: maxw arg for MEME (default=18).
  • --memenmotifs <int>: Number of motifs MEME should find in each condition (default=3)
  • --memeargs <args>: Additional args for MEME (default: -dna -mod zoops -revcomp -nostatus)
  • --minroc <value>: Minimum motif ROC value (default=0.7)
  • --seqrmthres <value>: Filter out sequences with motifs below this threshold for recursively finding motifs (default=0.1)
  • --minmodelupdaterefs <int>: Minimum number of motif reference to support an subtype distribution update (default=25)

2. Using ChIP-exo tag distributions:

  • --noclustering: Flag to turn off read distribution clustering
  • --pref <value>: Preference value for read distribution clustering (default=-0.1)
  • --numcomps <int>: Number of components to cluster (default=500)
  • --win <int>: Read profile window size (default=150)

Reporting binding events:

  • --q <value>: Q-value minimum (default=0.01)
  • --minfold <value>: Minimum event fold-change vs scaled control (default=1.5)

Example

This example runs ChExMix v0.1 on a simulated dataset. The simulated data is made by mixing mutually-exclusive binding events from yeast Reb1 and Abf1 ChIP-exo datasets. This ensures that there are two distinct binding event subtypes in the dataset. The version of ChExMix and all files required to run this analysis are in this file: chexmix-yeast-example.tar.gz.

Let’s demonstrate how ChExMix works in two different modes. Note that you will need to install MEME to run the first example.

Defining and assigning subtypes using both DNA motif discovery and ChIP-exo tag distributions:

java -Xmx10G -jar chexmix.public.jar --geninfo sacCer3.info --threads 16 --expt Abf1_Reb1_merged_toppeaks.bam --ctrl masterNotag_20161101.bam --format BAM --back sacCer3.back --exclude sacCer3_exludes.txt --seq ~/group/genomes/sacCer3/ --memepath path_to_meme_bin/ --mememinw 6 --mememaxw 18 --out testchexmix_20180214_reb1_abf1_all > testchexmix_20180214_reb1_abf1_all.out 2>&1

Expected results can be found here: testchexmix_20180214_reb1_abf1_all

Defining and assigning subtypes using only ChIP-exo tag distributions:

java -Xmx10G -jar chexmix.public.jar --geninfo sacCer3.info --threads 16 --expt Abf1_Reb1_merged_toppeaks.bam --ctrl masterNotag_20161101.bam --format BAM --back sacCer3.back --exclude sacCer3_exludes.txt --nomotifs --out testchexmix_20180214_reb1_abf1_all_nomotifs > testchexmix_20180214_reb1_abf1_all_nomotifs.out 2>&1

Expected results can be found here: testchexmix_20180214_reb1_abf1_all_nomotifs

Note that due to some stochasticity between runs, your results may not match those above exactly, but should be broadly similar.