Prerequisites
|
- Unix system with RepeatMasker 3.2.0 or higher installed
- A perl script included in the COSEG package depends on the RepeatMasker 3.2.0 (and higher) release. RepeatMasker installation instructions can be found
here
- Sequence Search Engine
COSEG requires a set of pre-aligned sequences of roughly equal length
in cross_match output format. In order to generate these alignments,
you could use Phil Green's cross_match program or RMBlast.
- GraphViz Program [optional]
COSEG outputs tree data in the GraphViz format. Installing the GraphViz
program will allow you to convert this format into many different
graphics file formats for viewing.
|
Installation
|
- Download COSEG
Latest Released Version: 7/30/2024:
https://github.com/rmhubley/coseg/archive/refs/tags/coseg-0.2.4.tar.gz
Previous Released Version: 10/16/2023:
https://github.com/rmhubley/coseg/archive/refs/tags/coseg-0.2.3.tar.gz
- Unpack Distribution
Unpack the distribution in your home directory or in a temporary location ( ie. /tmp ).
- cd /mytmp/location/
- gunzip coseg-#.#.#.tar.gz
- tar xvf coseg-#.#.#.tar
- Build
To compile the C program run:
- Test
See tutorial below for running COSEG on the provided example
datasets.
- Install
- Edit the Makefile and change INSTALLDIR to the location where you
want the files installed.
- run: "make install"
|
ALU Example Run
|
- Given the sample ALU dataset provided by Alkes Price in the
original distribution of his code ( files have been renamed ):
- ALU.seqs: Human sequence data from alignment to AluSx consensus )
- ALU.ins : Insertion sequences from the alignments
- ALU.cons: AluSx consensus
Note: that positions 116-135 (inclusive) are
lowercase while all remaining positions are uppercase.
This is how we encode positions within the consensus
that have lower quality and should not be considered
in this analysis.
- Run analysis:
runcoseg.pl -d -m 50 -c ALU.cons -s ALU.seqs -i ALU.ins
NOTE: In this run we must tell coseg to treat lower case
characters in the consensus as a blacklist designation
by using the "-d" flag.
- Create png/svg files:
Coseg outputs a visualization of the subfamily tree in
both GraphViz and SVG formats. The SVG format can be viewed
directly in any web browser. The GraphViz format needs to
be converted to a graphics format ( ie. PNG ) using
the GraphViz command line tools ( http://www.graphviz.org/ )
or loaded using the graphViz webapp ( https://mdaines.github.io/viz.js/ ).
The SVG output may be resized by changing the first line
in the SVG file to reflect the output size you would like.
For example the default is to display the entire graph in
the current size of the web browser window:
<svg height="100%" width="100%" viewbox="0 0 90.625 146.25">
Changing height/width to:
<svg height="147" width="91" viewbox="0 0 90.625 146.25">
|
Output Files
|
The following files are named after the aligned sequence file
name as a prefix.
- *.log -- A log of mutation sites found and the order of the
clustering.
- *.subfamililies.seq -- name, count, P-value and consensus sequence of
each subfamily found by our algorithm. (For subfamilies not in the
original scaffold, we also include in parentheses the P-value of
the scaffold subfamily from which it is derived).
- *.assign -- for each of the elements, lists the
subfamily to which the algorithm has assigned it.
- *.tree.svg -- evolutionary tree of the subfamilies, in SVG format.
- *.tree.viz -- evolutionary tree of the subfamilies, in GRAPHVIZ format.
|
Input Files
|
|
Running Using Your Own Data
|
- Cross_match a reference sequence against a genome or database.
cross_match line1copies consensus -M 25p41g.matrix
-gap_init -25 -gap_ext -5 -minscore 200
-minmatch 6 -alignments -bandwidth 50 -word_raw > LINE1
The example file LINE1 included in this distribution was created
using the command line above and can be used directly in the following
steps.
- Determine consensus range to use for analysis ( ie. 298 - 797 bp )
- Create input files to alkes programs:
preprocessAlignments.pl -maxEdgeGap 10
-minConsRange 298
-maxConsRange 797
-alignments LINE1
This will create 3 new files: LINE1.seqs
LINE1.ins
LINE1.cons
NOTE: Use the -w flag to preprocessAlignments if you use WUBlast
to perform the alignments.
- Run analysis:
runcoseg.pl -t -m 50 -c LINE1.cons -s LINE1.seqs
-i LINE1.ins
NOTE: In this example we use two new flags. The first "-a" indicates
we want to use the newer pvalue calculation developed by
Andy Siegel. The second flag "-t" indicates we want to use
3 bp co-segregating mutations as well as 2bp co-segregating
mutations when developing subfamilies.
- Open up a web browser and point it at the file LINE1.seqs.tree.svg.
Most browsers support zooming in on svg files. If you want to render
the SVG file larger by default simply edit the *.svg file and change the
line:
<svg height="100%" width="100%" viewbox="0 0 90.625 146.25">
to reflect a fixed size for the graph. The "viewbox" values give
the absolute size of the drawing so a 1:1 scale would be:
<svg height="145.25" width="90.625" viewbox="0 0 90.625 146.25">
- If you would like to produce a SVG file without node label or
using either the divergence, P-value or subfamily size as the
label, simply rerun ./postprocess.pl without the "-l" flag or
with "-l div", "-l pv", "-l c" accordingly.
|
Experimental
|
Using the refiner contained in the RepeatModeler package one may
submit subfamily members for consensus refinement. There is an
experimental script called refineConsSeqs.pl that will run this
analysis:
- From the directory where your coseg results can be
found run:
refineConsSeqs.pl -subConsFile mycosegrun.subfamilies.seq
Where "mycosegrun" is the prefix of the coseg run.
|
Release Notes
|
coseg-0.2.4
- IUB codes in input sequences caused the code to segfault.
Coseg will now randomly choose a nucleotide to substitute
each time it encounters one in in the input sequence. It
will also inform the user when doing so. Thanks to David
Ray for reporting this and suggesting the fix.
- Fixed the svg tag so that the files
will directly load in HTML5 web browsers.
- Calculation of divergence has been improved.
We now use kimura substition distance with CpG
site accounting modifications instead of the
mixed substition and indel calculation.
- Produce a warning if emutfrac is not within expected
range, rather than crashing.
coseg-0.2.3
- Update to Makefile to support newer toolchains
coseg-0.2.2
- Create a *.svg graph file without the need
to download/use GraphViz. The layout is
handled by an adaptation of algorithm
developed by Atze van der Ploeg. The SVG
file produced supports various labeling
options and subfamily details displayed
when a node is hovered over.
- Changed the default colormap for the graph
output. Now warm colors denote more diverged
subfamilies in the tree while cooler colors
represent younger subfamilies. To restore
the original color scheme use the new "-o"
flag to coseg.
- Added parameter to control the minimum distance
between diagnostic sites. Now the user can override
the historic value of 10 using the -u flag.
- Improved error reporting when there is a mismatch
between an individual sequence length and the
consensus length in the input files.
- Fixed a bug that caused coseg to segfault.
- Added experimental script refineConsSeqs.pl. This
script uses the RepeatModeler application to build
and refine the consensus sequences for each
subfamily.
coseg-0.2.1
- Improved code documentation
- Single mutation significance cutoff ( SIGMATHRESH ) was pre-calculated for Alkes Alu analysis and hardcoded. This version calculates the correct sigma cutoff using the length of the input sequence.
- Fixed bug with implementation of Siegel's pValue calculation which caused a segfault -- found by Neal Platt.
- Switched default pvalue method to Andy Siegel's method and provided a new "-k" switch to use Alkes Price's method.
- Fixed bug where the program was exiting when calculations fell below the precision of the machine ( epsilon ). Message given was "Below epsilon..." and the runcoseg.pl script moved on even though coseg failed.
coseg-0.2.0
|