Computing Patterns in Strings
Francine Blanchet-Sadri
University of North Carolina
P.O. Box 26170
Greensboro, NC 27402-6170
Contact Information
Email: blanchet@uncg.edu
Home page: http://www.uncg.edu/~blanchet
Phone: (336) 256-1125, Fax: (336) 256-0439
Project URL
http://www.uncg.edu/~blanchet/nsf/computing-patterns.html
List of Supported Students
- Phaedra Agius
- Arundhati Anavekar
- Ajay Chriscoe
- Crystal Davis
- Stacy Duncan
- Firdouse Gama
- Liem Mai
- Margaret Moorefield
- Brian Shirey
- Lili Zhang
Project Award Information
- NSF Award Number: CCF-0207673.
- Duration: June 1, 2002 - September 30, 2005
- Title: RUI: Computing Patterns in Strings
Keywords
Strings, Patterns, Periods, Repetitions, Covers, Seeds, Approximate
Patterns, Distance Measures, String Algorithms, Dynamic Programming.
Project Summary
The problem of computing patterns in sequences or strings of characters
from a finite alphabet has important applications in numerous areas of
computer science, notably in data compression, information theory, and
pattern matching. This problem has also important applications in biology.
The stimulus for such recent works is the study of biological sequences
such as DNA and protein that play a central role in molecular biology.
DNA sequences can be viewed as quite long but finite strings of nucleotides
of 4 different types, while protein sequences can be viewed as finite strings
of amino acids of 20 possible types. Patterns such as periodicities and
repetitions make up a significant fraction of both DNA and protein sequences.
Although the functions of these patterns are not well understood, they
appear important for understanding the expression, regulation and evolution
of a biological sequence. These patterns can be used to identify the sequence
among other sequences, an application that plays a role in genetic fingerprinting.
Repetitions in biological sequences have been associated with human genetic
diseases. They also complicate multiple sequence alignment because matches
may be present in numerous places.
The literature has generally
considered problems in which a period u of a repetition is invariant.
It has been required that occurrences of u match each other exactly.
Due to the action of evolutionary mutation, patterns in biological strings
are seldom exact but rather approximate. It therefore becomes necessary
to recognize u' as an occurence of u if the distance between
u' and u is bounded by a certain threshold. Several definitions
of distance have been proposed like the Hamming distance which
counts the minimum number of character substitutions required to transform
u' into u, and the edit distance which counts the
minimum number of substitutions, insertions, and deletions of characters
required to transform u' into u. Although there is an enormous
literature dealing with approximate pattern matching according to these
and other definitions of distance, very little is known on approximate
repetitions, a version of repetitions where errors are allowed, and much
remains to be done.
Given the importance of patterns
in biological strings and the exponential growth in the size of the DNA
database, it is important to develop efficient methods for detecting these
patterns. This project studies patterns such as periodicities, repetitions,
covers, and seeds and their approximate versions built upon various commonly
used distance measures. It is natural to consider approximate string matching
techniques when developing algorithms for approximate patterns. Techniques
that might prove useful include recent combinatorial techniques related
to partial strings which are strings where a number of gap characters are
allowed. Techniques also include the cover array, the highest scoring paths
in weighted grid graphs, the probabilistic models that have been proposed
for repetitions, and the subtree max gap problem which seems to be a powerful
tool in parallel algorithm design. Several students, undergraduate and graduate,
were involved in different aspects of this project.
Publications
Please refer to http://www.uncg.edu/~blanchet
for a complete publications list.
- F. Blanchet-Sadri and Ajay Chriscoe, "
Local Periods and Binary Partial Words: An Algorithm.
" Theoretical Computer Science, Vol. 314 (2004) 189-216.
- F. Blanchet-Sadri, "Codes, Orderings, and Partial Words."
Theoretical Computer Science, Vol. 329 (2004) 177-202 .
- F. Blanchet-Sadri and S. Duncan, "
Partial Words and the Critical Factorization Theorem."
Journal of Combinatorial Theory, Series A,
Vol. 109 (2005) 221-245.
- F. Blanchet-Sadri, "Primitive Partial Words."
Discrete Applied Mathematics, Vol. 148 (2005) 195-213.
- Lili Zhang and F. Blanchet-Sadri,
"
Algorithms for Approximate k-Covering of Strings."
International Journal of Foundations of Computer Science,
Vol. 16 (2005) 1231-1251.
- F. Blanchet-Sadri and Arundhati R. Anavekar,
"
Testing Primitivity on Partial Words."
Discrete Applied Mathematics, Vol. 155 (2007) 279-287.
- F. Blanchet-Sadri and Margaret Moorefield,
"
Pcodes of Partial Words."
- F. Blanchet-Sadri, C.D. Davis, Joel Dodge, Robert Mercas and Margaret Moorefield,
"Unbordered
Partial Words." Discrete Applied Mathematics, to appear.
- F. Blanchet-Sadri and Brian Shirey,
"Periods,
Partial Words, and a Result of Guibas and Odlyzko."
Presentation
- Mots partiels et periodicite,
LIAFA, Laboratoire d'Informatique Algorithmique:
Fondements et Applications, University of Paris 7, Paris, France,
October 15, 2004.
Project Impact
The following students have been involved in various stages of
this project:
- Phaedra Agius, "Patterns in Strings." M.A. Thesis, University of North
Carolina at Greensboro, April 2003
(nominated for the UNCG Outstanding Thesis Award 2003,
and for the 2005 Master's Thesis Award of the Conference of
Southern Graduate Schools).
- Arundhati Anavekar,
"
Testing Primitivity on Partial Words."
(Independent Study), July 2004-January 2005.
- Cheng-Tao Chen,
"Periods and Binary
Partial Words with Two Holes."
M.S. Thesis,
University of North Carolina at Greensboro, April 2004.
- Ajay Chriscoe, "Local Periods
and Binary Partial Words: An Algorithm.
" (Undergraduate project), June 2002 - May 2003.
- Ajay Chriscoe, "Approximate
Patterns in Strings.
" (Undergraduate project), June 2003 - May 2004.
- Ajay Chriscoe, "Partial Words
and the Critical Factorization Theorem.
" (Undergraduate project), May 2004 - August 2004.
- Crystal Davis, "Unbordered Partial Words." M.A. Thesis, University
of North Carolina at Greensboro, April 2005
(nominated for the UNCG Outstanding Thesis Award 2005, and for the
2007 Master's Thesis Award of the Conference of Southern Graduate
Schools).
- Stacy Duncan, "Periods in Strings." M.A. Thesis, University of North
Carolina at Greensboro, April 2004
(nominated for the UNCG Outstanding Thesis Award 2004).
- Firdouse Gama,
"Periods and Binary Partial Words."
(REU Participant), December 2004-October 2005.
- Donald Luhmann, "Compatibility Between Strings." (Independent study),
June 2002 - July 2002.
- Liem Mai, "Approximate Covers of Strings." M.S. Thesis, University
of North Carolina at Greensboro, April 2003.
- Margaret Moorefield,
"Pcodes."
M.S. Thesis, University of North Carolina
at Greensboro, April 2005.
- Margaret Moorefield,
"Unbordered
Partial Words." May 2005.
- Baouyen Pham, "Approximate Periods of Strings." (Independent study),
June 2002 - August 2002.
- Brian Shirey,
"Periods, Partial Words,
and a Result of Guibas and Odlyzko."
(Undergraduate project), Summer 2003, Spring 2004 - Fall 2005.
- Lili Zhang, "
Algorithms for Approximate k-Covering."
M.S. Thesis, University of North Carolina at Greensboro,
March 2004.
|
|
Acknowledgement: This material is based upon work
supported
by the National Science Foundation under grant No. CCF-0207673.
Disclaimer: Any opinions, findings, and conclusions or
recommendations
expressed in this material are those of the authors and do not necessarily
reflect the views of the National Science Foundation.
|