MultiGPS is a framework for analyzing collections of multi-condition ChIP-seq datasets and characterizing differential binding events between conditions. In analyzing multiple-condition ChIP-seq datasets, MultiGPS encourages consistency in the reported binding event locations across conditions and provides accurate estimation of ChIP enrichment levels at each event.
An integrated model of multiple-condition ChIP-seq data reveals predeterminants of Cdx2 binding
S Mahony, MD Edwards, EO Mazzoni, RI Sherwood, A Kakumanu, CA Morrison, H Wichterle, DK Gifford
PLoS Computational Biology (2014) 10(3):e1003501
Open source code (released under the MIT license) is available from: https://github.com/seqcode/multigps.
Simulated multi-condition ChIP-seq datasets for benchmarking differential binding detectors can be downloaded here.
- You need Java 8+ installed to run MultiGPS.
- MultiGPS loads all data to memory, so you will need a lot of available memory if you are running analysis over many conditions or large datasets.
- You need MEME installed if you want to use a motif prior (tested with MEME version 4.10.2).
- You need R, Bioconductor, and edgeR installed if you want to run analysis of differential binding events (tested with R version 3.2.0 and edgeR version 3.8.6).
- If you want to build the code yourself, you will need to first download build the seqcode-core library (https://github.com/seqcode/seqcode-core) and add its build/classes and lib directories to your CLASSPATH.
Note that MultiGPS performs a lot of EM optimization of binding events along the genome and across experimental conditions, and also integrates motif-finding via MEME. Therefore, expect it to be very time and memory intensive.
Run the JAR from the command-line as follows:
java -Xmx20G -jar multigps.jar <options - see below>
In the above, the “-Xmx20G” argument tells java to use up to 20GB of memory. If you have installed source code from github, and if all classes are in your CLASSPATH, you can run MultiGPS as follows:
java -Xmx20G org.seqcode.projects.multigps.MultiGPS <options - see below>
Required/important options are highlighted in red below.
‒‒out <prefix>: Output file prefix. All output will be put into a directory with the prefix name.
‒‒threads <n>: Use n threads during binding event detection. Default is 1 thread.
‒‒verbose: Flag to print intermediate files and some extra output.
‒‒config <config file>: All options can be specified in a name<space>value text file, where name is the name of the option without the “
‒‒“. Options specified in a config file are over-ridden by command-line args.
Specifying the genome:
‒‒geninfo <genome info file>: This file should list the lengths of all chromosomes on separate lines using the format chrName<tab>chrLength. You can generate a suitable file from UCSC 2bit format genomes using the UCSC utility “twoBitInfo”. The chromosome names should be exactly the same as those used in your ChIP-seq read files. See some examples below.
‒‒seq <fasta seq directory>: A directory containing fasta format files corresponding to every named chromosome is required if you want to run motif-finding or use a motif-prior within MultiGPS.
‒‒exptCONDNAME-REPNAME <file name>: Defines a file containing reads from a signal experiment. Replace CONDNAME and REPNAME with appropriate condition and replicate labels.
‒‒ctrlCONDNAME-REPNAME <file name>: Optional arguments. Defines a file containing reads from a control experiment. Replace CONDNAME and REPNAME with appropriate labels to match a signal experiment (i.e. to tell MultiGPS which condition/replicate this is a control for). If you leave out a REPNAME, this file will be used as a control for all replicates of CONDNAME.
‒‒format <SAM/BED/IDX>: Format of data files. All files must be the same format if specifying experiments on the command line. Supported formats are SAM/BAM, BED, and IDX index files.
Instead of using the above options to specify each and every ChIP-seq data file on the command-line, you can instead use a design file:
‒‒design <file name>: A file that specifies the data files and their condition/replicate relationships. See here for an example design file. The file should be formatted to contain the following pieces of information for each data file, in this order and tab-separated:
- File name
- Label stating if this experiment is “signal” or “control”
- File format (SAM/BAM/BED/IDX) – mixtures of formats are allowed in design files
- Condition name
- Replicate name (optional for control experiments – if used, the control will only be used for the corresponding named signal replicate)
- Optional. Experiment type for this replicate: chipseq/chipexo.
- Optional. Fixed per-base read count limit for this replicate. “P” gives a read count limit that varies along the genome according to how neighboring bases are distributed. Default is a global per-base limit that is estimated from a Poisson distribution.
Limits on how many reads can have their 5′ end at the same position in each replicate:
‒‒fixedpb <value>: Fixed per-base limit.
‒‒poissongausspb <value>: Filter per base using a Poisson threshold parameterized by a local Gaussian sliding window (i.e. look at neighboring positions to decide what the per-base limit should be).
- Default behavior is to estimate a global per-base limit from a Poisson distribution parameterized by the number of reads divided by the number of mappable bases in the genome. The per-base limit is set as the count corresponding to the 10^-7 probability level from the Poisson.
‒‒nonunique: Flag to use non-unique reads.
‒‒mappability <value>: Fraction of the genome that is mappable for these experiments. Default=0.8.
‒‒nocache: Flag to turn off caching of the entire set of experiments (i.e. run slower with less memory)
Default behavior is to scale signal to corresponding controls using the Normalization of ChIP-seq (NCIS) method described by Liang & Keles, BMC Bioinformatics 2012.
‒‒noscaling: Flag to turn off auto estimation of signal vs control scaling factor.
‒‒medianscale: Flag to use scaling by median ratio of binned tag counts. Default = scaling by NCIS.
‒‒regressionscale: Flag to use scaling by regression on binned tag counts. Default = scaling by NCIS.
‒‒sesscale: Flag to use scaling by SES (Diaz, et al. Stat Appl Genet Mol Biol. 2012).
‒‒fixedscaling <scaling factor>: Multiply control counts by total tag count ratio and then by this factor. Default: scaling by NCIS.
‒‒scalewin <window size>: Window size for estimating scaling ratios. Default is 10Kbp. Use something much smaller if scaling via SES (e.g. 200bp).
‒‒plotscaling: Flag to plot diagnostic information for the chosen scaling method.
‒‒d <file name>: Binding event read distribution file for initializing models. The true distribution of reads around binding events is estimated during MultiGPS training. See examples here for file format. A default initial distribution appropriate for ChIP-seq data is used if this option is not specified.
‒‒r <int value>: Maximum number of training rounds for updating binding event read distributions. Default = 3.
‒‒nomodelupdate: Flag to turn off binding model updates.
‒‒minmodelupdateevents <int value>: Minimum number of events to support an update of the read distribution. Default = 500.
‒‒nomodelsmoothing: Flag to turn off binding model smoothing (default = smooth with a cubic spline).
‒‒splinesmoothparam <value>: Smoothing parameter for smoothing cubic spline. Default = 30.
‒‒gaussmodelsmoothing: Flag to turn on Gaussian model smoothing (default = smooth with a cubic spline).
‒‒gausssmoothparam <value>: Gaussian smoothing std dev. Default = 3.
‒‒jointinmodel: Flag to allow joint events in model updates (default=do not).
‒‒fixedmodelrange: Flag to keep binding model range fixed to initial size. Default: automatically adapt range.
‒‒prlogconf <value>: Poisson log threshold for potential region scanning. Default = -6.
‒‒fixedalpha <value>: Impose this alpha. The alpha parameter is a sparse prior on binding events in the MultiGPS model. It can be interpreted as a minimum number of reads that each binding event must be responsible for in the model. Default: estimate alpha automatically.
‒‒alphascale <value>: Alpha scaling factor. Increasing this parameter results in stricter binding event calls. Default = 1.0.
‒‒mlconfignotshared: Flag to not share component configs in the ML step. This mainly affects the quantification of binding levels for binding events that are not shared but are located at nearby locations across experiments.
‒‒exclude <file name>: File containing a set of regions to ignore during MultiGPS training. It’s a good idea to exclude the mitochondrial genome and other ‘blacklisted’ regions that contain artifactual accumulations of reads in both ChIP-seq and control experiments. MultiGPS will waste time trying to model binding events in these regions, even though they will not typically appear significantly enriched over the control (and thus will not be reported to the user). See the format of an exclude region file here (example for mm9).
‒‒noposprior: Flag to turn off inter-experiment positional prior (default=on).
‒‒probshared <value>: Probability that events are shared across conditions (default=0.9).
‒‒nomotifs: Flag to turn off motif-finding & motif priors.
‒‒nomotifprior: Flag to turn off motif priors only.
‒‒memepath <path>: Path to the meme bin dir (default: meme is in $PATH).
‒‒memenmotifs <int value>: Number of motifs MEME should find for each condition (default=3).
‒‒mememinw <minw>: Minimum motif width arg for MEME (default=6).
‒‒mememaxw <maxw>: Maximum motif width arg for MEME (default=18).
‒‒memeargs <args>: Additional args for MEME (default: -dna -mod zoops -revcomp -nostatus).
Reporting binding events:
‒‒q <value>: Minimum Q-value (corrected p-value) of reported binding events. Default = 0.001.
‒‒minfold <value>: Minimum event fold-change vs scaled control. Default = 1.5.
‒‒nodifftests: Flag to turn off differential enrichment tests.
‒‒rpath <path>: Path to the R bin dir (default: R is in $PATH). Note that you need to install edgeR separately.
‒‒edgerod <value>: EdgeR overdispersion parameter. Default = 0.15.
‒‒diffp <value>: Minimum p-value for reporting differential enrichment. Default = 0.01.
This example runs multiGPS v0.5 on Cdx2 ChIP-seq data from three cell-lines. Note that while the ChIP-seq data is the same as used for our Cdx2 analysis in our PLoS Computational Biology paper, the numbers of discovered binding events and differential binding events are not the same as reported there. The differences are due to an updated version of MultiGPS, and updated version of edgeR, and a more permissive differential binding event p-value (p<10^-2).
java -Xmx28G -jar multigps.jar ‒‒geninfo ~/genomes/mm9/mm9.info ‒‒threads 4 ‒‒exclude mm9_excludes.txt ‒‒design cdx2.design ‒‒verbose ‒‒probshared 0.99 ‒‒memepath ~/software/meme_4.9.0/bin ‒‒mememinw 6 ‒‒mememaxw 16 ‒‒seq ~/genomes/mm9 ‒‒out Cdx2 >Cdx2.out 2>&1
Binding event read distribution files:
- Distribution file suitable for initializing transcription factor ChIP-seq analyses: ctcf_chipseq.distrib.txt
- Distribution file suitable for initializing transcription factor ChIP-exo analyses: reb1_chipexo.distrib.txt
Genome info files for some UCSC genome versions: