RepeatMasker Frequently Asked Questions

General

What are genomic intersperesed repeats?
What does RepeatMasker do?
What citation should I use for the software/website?
Why isn't there a 1-1 relationship between RM repeats and Repbase repeats?

What are genomic interspersed repeats?

In the mid 1960's scientists discovered that many genomes contain stretches of highly repetitive DNA sequences ( see Reassociation Kinetics Experiments, and C-Value Paradox ). These sequences were later characterized and placed into five categories:

Simple Repeats - Duplications of simple sets of DNA bases (typically 1-5bp) such as A, CA, CGG etc.
Tandem Repeats - Typically found at the centromeres and telomeres of chromosomes these are duplications of more complex 100-200 base sequences.
Segmental Duplications - Large blocks of 10-300 kilobases which are that have been copied to another region of the genome.
Interspersed Repeats
1. Processed Pseudogenes, Retrotranscripts, SINES - Non-functional copies of RNA genes which have been reintegrated into the genome with the assitance of a reverse transcriptase.
2. DNA Transposons -
3. Retrovirus Retrotransposons -
4. Non-Retrovirus Retrotransposons ( LINES ) -

Currently up to 50% of the human genome is repetitive in nature and as improvements are made in detection methods this number is expected to increase.

What does RepeatMasker do?

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns).

Sequence comparisons in RepeatMasker are performed by the program cross_match, an efficient implementation of the Smith-Waterman-Gotoh algorithm developed by Phil Green.

What citation should I use for the software/website?

Please use the following for both the website and the RepeatMasker software package:

Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-4.0.
2013-2015 <http://www.repeatmasker.org>.

Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0.
1996-2010 <http://www.repeatmasker.org>.

Please use the following for the RepeatModeler software:

Smit, AFA, Hubley, R. RepeatModeler Open-1.0.
2008-2015 <http://www.repeatmasker.org>.

Why isn't there a 1-1 relationship between RM repeats and Repbase repeats?

Currently RepeatMasker's library and Repbase are not equivalent. RepeatMasker's library contains a further level of curation where sequences are optimized for searching with RepeatMasker. This optimization improves both the search time, selectivity, sensitivity and clarity of annotation when used with RepeatMasker.

LINE fragmentation is one example of this divergence. Since LINE copies tend to be 5' truncated, full-length models of the detailed subfamily structure apparent in the well-represented and fast-evolving 3' end are difficult to obtain. Rather than comparing the query to a large number of full-length (6-8 kb) consensus sequences that are identical except for the very 3' end, we often fragment LINE models into domains (e.g. 5' end, ORF2-region, and 3' end) which are transparently merged in the RepeatMasker annotation as if the matching was done to a full-length consensus. Our nomenclature and fragmentation thus create a many-to-one relationship with Repbase full-length entries.

RepeatMasker annotation also may include names not in Repbase due to a name generalization/specialization feature of the program. Name generalization is the renaming of elements when the alignment is contained in a subsequence shared or very similar to that found in other models or the alignment quality is poor. The name chosen is a general name for the family and often is not a specific sequence in Repbase. Name specialization is a similar process by which fragments names are refined by RepeatMasker using information such as the flanking LTRs of a retrovirus-like element for which the internal sequence is not represented in the database.

Other cases in which a mapping isn't straightforward include differences in consensus sequence (both composition and length), and additional low-complexity/simple repeat models used only in RepeatMasker.

Institute for Systems Biology
This server is made possible by funding from the National Human Genome Research Institute (NHGRI grant # RO1 HG002939).