The Institute for Systems Biology COSEG Download

COSEG is a program which automatically identifies repeat subfamilies using significant co-segregating ( 2-3 bp ) mutations.

Prerequisites
  1. Unix system with RepeatMasker 3.2.0 or higher installed
    • A perl script included in the COSEG package depends on the RepeatMasker 3.2.0 (and higher) release. RepeatMasker installation instructions can be found here
  2. Sequence Search Engine
    COSEG uses alignment data generated by a sequence search engine. Cross_match and WUBlast are supported by COSEG.
  3. GraphViz Program [optional]
    COSEG outputs tree data in the GraphViz format. Installing the GraphViz program will allow you to convert this format into many different graphics file formats for viewing.
Installation
  1. Download COSEG
    Latest Released Version: 8/3/2010: coseg-0.2.1.tar.gz
    Previous Version: 8/8/2008: coseg-0.2.0.tar.gz
  2. Unpack Distribution
    Unpack the distribution in your home directory or in a temporary location ( ie. /tmp ).
    • cd /mytmp/location/
    • gunzip coseg-#.#.#.tar.gz
    • tar xvf coseg-#.#.#.tar
  3. Build
    To compile the C program run:
    • cd coseg
    • make
    • Edit preprocessAlignments.pl: Change line 114:
                 use lib "/usr/local/RepeatMasker";
               
      To point to your local RepeatMasker installation.
  4. Test
    See tutorial below for running COSEG on the provided example datasets.
  5. Install
    • Edit the Makefile and change INSTALLDIR to the location where you want the files installed.
    • run: "make install"
ALU Example Run
  1. Given the sample ALU dataset provided by Alkes Price in the original distribution of his code ( files have been renamed ):

    • ALU.seqs: Human sequence data from alignment to AluSx consensus )
    • ALU.ins : Insertion sequences from the alignments
    • ALU.cons: AluSx consensus

    Note: that positions 116-135 (inclusive) are lowercase while all remaining positions are uppercase. This is how we encode positions within the consensus that have lower quality and should not be considered in this analysis.

  2. Run analysis:
    runcoseg.pl -d -m 50 -c ALU.cons -s ALU.seqs -i ALU.ins
    NOTE: In this run we must tell coseg to treat lower case characters in the consensus as a blacklist designation by using the "-d" flag.

  3. Create png/svg files:

    The visualization output of coseg is in the GraphViz format. To convert this data into a graphics different graphics format you can use the "dot" program in the GraphViz package. The dot program will creae hierarchical graph from the data.

    ie:

             dot ALU.seqs.tree.viz -Tpng -o ALU.seqs.tree.viz.png
    
             dot ALU.seqs.tree.viz -Tsvg -o ALU.seqs.tree.viz.svg
           
    NOTE: Most modern browsers will display the SVG graphics format without a plugin.
Output Files
The following files are named after the aligned sequence file name as a prefix.
  • *.log -- A log of mutation sites found and the order of the clustering.
  • *.subfamililies.seq -- name, count, P-value and consensus sequence of each subfamily found by our algorithm. (For subfamilies not in the original scaffold, we also include in parentheses the P-value of the scaffold subfamily from which it is derived).
  • *.assign -- for each of the elements, lists the subfamily to which the algorithm has assigned it.
  • *.tree.viz -- evolutionary tree of the subfamilies, in GRAPHVIZ format.
Input Files
  • *.cons -- Consensus File: The consensus used to generate the aligned seqeunce file. This is in fasta format.
  • *.seqs -- Aligned Sequences File: Near full length aligned sequences to the consensus file ( one per line ). Deletions are represented by 1 or more "-" characters. Insertions of any length are represented by a single "+" character.
  • *.ins -- Inserts File: Inserts pulled from the alignments are stored in this file. The aligned sequence file ordering is used in this file. Each insert is encoded as:
                [<position>:<sequence>][<position>:<sequence>]..
    
        ie.
                 3:AA 10:AGTTTA
        
    A script is provided to build these files automatically given either wu-blast or cross_match alignment files -- see below.
Running Using Your Own Data
  1. Cross_match a reference sequence against a genome or database.
           cross_match line1copies consensus -M 25p41g.matrix
                       -gap_init -25 -gap_ext -5 -minscore 200
                       -minmatch 6 -alignments -bandwidth 50 -word_raw > LINE1
           
    The example file LINE1 included in this distribution was created using the command line above and can be used directly in the following steps.

  2. Determine consensus range to use for analysis ( ie. 298 - 797 bp )

  3. Create input files to alkes programs:
           preprocessAlignments.pl -maxEdgeGap 10
                                    -minConsRange 298
                                    -maxConsRange 797
                                    -alignments LINE1
    
           This will create 3 new files: LINE1.seqs
                                         LINE1.ins
                                         LINE1.cons
           
    NOTE: Use the -w flag to preprocessAlignments if you use WUBlast to perform the alignments.

  4. Run analysis:
           runcoseg.pl -t -m 50 -c LINE1.cons -s LINE1.seqs
                       -i LINE1.ins
           
    NOTE: In this example we use two new flags. The first "-a" indicates we want to use the newer pvalue calculation developed by Andy Siegel. The second flag "-t" indicates we want to use 3 bp co-segregating mutations as well as 2bp co-segregating mutations when developing subfamilies.

  5. Create png/svg files
           dot LINE1.seqs.tree.viz -Tpng -o LINE1.seqs.tree.viz.png
    
           dot LINE1.seqs.tree.viz -Tsvg -o LINE1.seqs.tree.viz.svg
           

  6. Open up a web browser and point it at the file LINE1.seqs.tree.viz.svg. Most browsers support zooming in on svg files. If you want to render .viz file larger by default simply edit the *.viz file and change the line:
                 size="8,10";
         
    to something larger. I.e:
                 size="16,20";
         
Release Notes
coseg-0.2.1
  • Improved code documentation
  • Single mutation significance cutoff ( SIGMATHRESH ) was pre-calculated for Alkes Alu analysis and hardcoded. This version calculates the correct sigma cutoff using the length of the input sequence.
  • Fixed bug with implementation of Siegel's pValue calculation which caused a segfault -- found by Neal Platt.
  • Switched default pvalue method to Andy Siegel's method and provided a new "-k" switch to use Alkes Price's method.
  • Fixed bug where the program was exiting when calculations fell below the precision of the machine ( epsilon ). Message given was "Below epsilon..." and the runcoseg.pl script moved on even though coseg failed.

coseg-0.2.0

  • First release of COSEG.
COSEG is licensed under the Open Source License v2.1.
Institute for Systems Biology
This server is made possible by funding from the National Human Genome Research Institute (NIGRI grant # RO1 HG002939-01) 2003.