SeqUnwinder is a framework for characterizing class-discriminative motifs in a collection of genomic loci that have several (overlapping) annotation labels.
Kakumanu, Akshay, et al. “Deconvolving sequence features that discriminate between overlapping regulatory annotations.” bioRxiv (2017): 100511.
Open source code (released under the MIT license) is available from: https://github.com/seqcode/sequnwinder.
- SeqUnwinder requires Java 8+.
- SeqUnwinder implements a multi-threaded version of ADMM to train the model. Hence, when using large datasets (tens of thousands of genomic sites), it is advisable to run in a system that allows multiprocessing.
- SeqUnwinder depends on MEME (tested with MEME version 4.10.2).
Building from Source:
If you want to build the code yourself, you will need to first download and build the seqcode-core library (https://github.com/seqcode/seqcode-core) and add its build/classes and lib directories to your CLASSPATH.
On a typical dataset (~20,000 sites and ~8 annotation labels) SeqUnwinder takes a couple of hours to run.
Running from a jar file:
java -Xmx20G -jar sequnwinder.jar <options - see below>
In the above, the “-Xmx20G” argument tells java to use up to 20GB of memory. If you have installed source code from github, and if all classes are in your CLASSPATH, you can run SeqUnwinder as follows:
java -Xmx20G org.seqcode.projects.sequnwinder.SeqUnwinder <options - see below>
Required/important options are highlighted in bold.
--out <String>: Prefix for the output files.
--threads <value>: Number of threads to use. Default=4.
--debug: Flag to run in debug mode.
--memepath <path>: Path to the meme bin dir (default: meme in $PATH)
Specifying the Genome:
--geninfo <genome info file>: This file should list the lengths of all chromosomes on separate lines using the format chrName\<tab\>chrLength. You can generate a suitable file from UCSC 2bit format genomes using the UCSC utility “twoBitInfo”. The chromosome names should be exactly the same as those used in your input list of genomic regions.
--seq <fasta seq directory>: A directory containing fasta format files corresponding to every named chromosome is required.
Input Genomic Regions:
--genregs <Genomic regions with annotations filename>OR
--genseqs <Sequences with annotations filename>: A tab delimited file of a list of genomic points/sequences and corresponding annotations/labels. A simple example :
GenRegs file: chr10:100076604 enhancer;shared chr6:100316177 promoter;celltypeA GenSeqs file: ATTGC....TTA enhancer;shared CGTAA....GGT promoter;celltypeA
--win <integer value>: Size of the genomic regions in bp. Default = 150.
--makerandregs: Flag to make random genomic regions as an extra outgroup class in classification (only applicable when genome is provided).
SeqUnwinder Model Options:
--mink <value>: Minimum length of K-mer to consider. Default = 4.
--maxk <value>: Maximum length of K-mer to consider. Default = 5.
For most SeqUnwinder analysis described in the manuscript, K-mers of lengths 4 and 5 showed optimal performance. However, with larger datasets (with more data instances for training), maxk can be increased to 6 or 7.
--r <value>: Regularization co-efficient in the model. For most SeqUnwinder applications, with ~20k genomic sites and ~6 labels and K-mers of 4 and 5, a value of 10.0 has been very effective. However, the optimal value could change with datasets. One might want to use a range of values and choose the one that performs best (in terms of test accuracy).
--x <value>: Number of folds for cross validation. Default = 3.
--mergelow: Flag to merge subclasses with fewer than 200 sites with other relevant classes. By default, all subclasses with less than 200 sites are removed.
Other SeqUnwinder options: (highly recommend using default options)
--minscanlen <value>: Minimum length of the window to scan *K*-mer models. Default=8.
--maxscanlen <value>: Maximum length of the window to scan *K*-mer models. Default=14.
--hillsthresh <value>: Scoring threshold to identify hills. Default=0.1.
--mememinw <value>: minw arg for MEME. Default=6.
--mememaxw <value>: maxw arg for MEME. Default=13. This value should always be less than “maxScanLen”.
--memesearchwin <value>: Window around hills to search for discriminative motifs. Default=16
--a <int>: Maximum number of allowed ADMM iterations. Default=500.
This example runs SeqUnwinder v0.1.2 on simulated dataset used in the SeqUnwinder manuscript (Figure 2B). Simulated sequences to run this example can be found [here].
java -Xmx20G -jar sequnwinder.jar --out example --threads 10 --debug --memepath path-to-meme --geninfo mm10.info --seq path-to-genomes/mm10/ --genseqs simulateOverlap.seqs --win 150 --mink 4 --maxk 5 --r 10 --x 3 --maxscanlen 15
Results can be found [here].