Computing Patterns of Strings, NSF project, Francine Blanchet-Sadri

Computing Patterns in Strings
Francine Blanchet-Sadri
University of North Carolina
P.O. Box 26170
Greensboro, NC 27402-6170

Contact Information
Email: blanchet@uncg.edu
Home page: http://www.uncg.edu/~blanchet
Phone: (336) 256-1125, Fax: (336) 256-0439

Project URL
http://www.uncg.edu/~blanchet/nsf/computing-patterns.html

List of Supported Students

Phaedra Agius
Arundhati Anavekar
Ajay Chriscoe
Crystal Davis
Stacy Duncan
Firdouse Gama
Liem Mai
Margaret Moorefield
Brian Shirey
Lili Zhang

Project Award Information

NSF Award Number: CCF-0207673.
Duration: June 1, 2002 - September 30, 2005
Title: RUI: Computing Patterns in Strings

Keywords
Strings, Patterns, Periods, Repetitions, Covers, Seeds, Approximate Patterns, Distance Measures, String Algorithms, Dynamic Programming.

Project Summary

The problem of computing patterns in sequences or strings of characters from a finite alphabet has important applications in numerous areas of computer science, notably in data compression, information theory, and pattern matching. This problem has also important applications in biology. The stimulus for such recent works is the study of biological sequences such as DNA and protein that play a central role in molecular biology. DNA sequences can be viewed as quite long but finite strings of nucleotides of 4 different types, while protein sequences can be viewed as finite strings of amino acids of 20 possible types. Patterns such as periodicities and repetitions make up a significant fraction of both DNA and protein sequences. Although the functions of these patterns are not well understood, they appear important for understanding the expression, regulation and evolution of a biological sequence. These patterns can be used to identify the sequence among other sequences, an application that plays a role in genetic fingerprinting. Repetitions in biological sequences have been associated with human genetic diseases. They also complicate multiple sequence alignment because matches may be present in numerous places.

The literature has generally considered problems in which a period u of a repetition is invariant. It has been required that occurrences of u match each other exactly. Due to the action of evolutionary mutation, patterns in biological strings are seldom exact but rather approximate. It therefore becomes necessary to recognize u' as an occurence of u if the distance between u' and u is bounded by a certain threshold. Several definitions of distance have been proposed like the Hamming distance which counts the minimum number of character substitutions required to transform u' into u, and the edit distance which counts the minimum number of substitutions, insertions, and deletions of characters required to transform u' into u. Although there is an enormous literature dealing with approximate pattern matching according to these and other definitions of distance, very little is known on approximate repetitions, a version of repetitions where errors are allowed, and much remains to be done.

Given the importance of patterns in biological strings and the exponential growth in the size of the DNA database, it is important to develop efficient methods for detecting these patterns. This project studies patterns such as periodicities, repetitions, covers, and seeds and their approximate versions built upon various commonly used distance measures. It is natural to consider approximate string matching techniques when developing algorithms for approximate patterns. Techniques that might prove useful include recent combinatorial techniques related to partial strings which are strings where a number of gap characters are allowed. Techniques also include the cover array, the highest scoring paths in weighted grid graphs, the probabilistic models that have been proposed for repetitions, and the subtree max gap problem which seems to be a powerful tool in parallel algorithm design. Several students, undergraduate and graduate, were involved in different aspects of this project.

Publications
Please refer to http://www.uncg.edu/~blanchet for a complete publications list.

F. Blanchet-Sadri and Ajay Chriscoe, " Local Periods and Binary Partial Words: An Algorithm. " Theoretical Computer Science, Vol. 314 (2004) 189-216.
F. Blanchet-Sadri, "Codes, Orderings, and Partial Words." Theoretical Computer Science, Vol. 329 (2004) 177-202 .
F. Blanchet-Sadri and S. Duncan, " Partial Words and the Critical Factorization Theorem." Journal of Combinatorial Theory, Series A, Vol. 109 (2005) 221-245.
F. Blanchet-Sadri, "Primitive Partial Words." Discrete Applied Mathematics, Vol. 148 (2005) 195-213.
Lili Zhang and F. Blanchet-Sadri, " Algorithms for Approximate k-Covering of Strings." International Journal of Foundations of Computer Science, Vol. 16 (2005) 1231-1251.
F. Blanchet-Sadri and Arundhati R. Anavekar, " Testing Primitivity on Partial Words." Discrete Applied Mathematics, Vol. 155 (2007) 279-287.
F. Blanchet-Sadri and Margaret Moorefield, " Pcodes of Partial Words."
F. Blanchet-Sadri, C.D. Davis, Joel Dodge, Robert Mercas and Margaret Moorefield, "Unbordered Partial Words." Discrete Applied Mathematics, to appear.
F. Blanchet-Sadri and Brian Shirey, "Periods, Partial Words, and a Result of Guibas and Odlyzko."

Presentation

Mots partiels et periodicite, LIAFA, Laboratoire d'Informatique Algorithmique: Fondements et Applications, University of Paris 7, Paris, France, October 15, 2004.

Project Impact
The following students have been involved in various stages of this project:

Phaedra Agius, "Patterns in Strings." M.A. Thesis, University of North Carolina at Greensboro, April 2003 (nominated for the UNCG Outstanding Thesis Award 2003, and for the 2005 Master's Thesis Award of the Conference of Southern Graduate Schools).
Arundhati Anavekar, " Testing Primitivity on Partial Words." (Independent Study), July 2004-January 2005.
Cheng-Tao Chen, "Periods and Binary Partial Words with Two Holes." M.S. Thesis, University of North Carolina at Greensboro, April 2004.
Ajay Chriscoe, "Local Periods and Binary Partial Words: An Algorithm. " (Undergraduate project), June 2002 - May 2003.
Ajay Chriscoe, "Approximate Patterns in Strings. " (Undergraduate project), June 2003 - May 2004.
Ajay Chriscoe, "Partial Words and the Critical Factorization Theorem. " (Undergraduate project), May 2004 - August 2004.
Crystal Davis, "Unbordered Partial Words." M.A. Thesis, University of North Carolina at Greensboro, April 2005 (nominated for the UNCG Outstanding Thesis Award 2005, and for the 2007 Master's Thesis Award of the Conference of Southern Graduate Schools).
Stacy Duncan, "Periods in Strings." M.A. Thesis, University of North Carolina at Greensboro, April 2004 (nominated for the UNCG Outstanding Thesis Award 2004).
Firdouse Gama, "Periods and Binary Partial Words." (REU Participant), December 2004-October 2005.
Donald Luhmann, "Compatibility Between Strings." (Independent study), June 2002 - July 2002.
Liem Mai, "Approximate Covers of Strings." M.S. Thesis, University of North Carolina at Greensboro, April 2003.
Margaret Moorefield, "Pcodes." M.S. Thesis, University of North Carolina at Greensboro, April 2005.
Margaret Moorefield, "Unbordered Partial Words." May 2005.
Baouyen Pham, "Approximate Periods of Strings." (Independent study), June 2002 - August 2002.
Brian Shirey, "Periods, Partial Words, and a Result of Guibas and Odlyzko." (Undergraduate project), Summer 2003, Spring 2004 - Fall 2005.
Lili Zhang, " Algorithms for Approximate k-Covering." M.S. Thesis, University of North Carolina at Greensboro, March 2004.

Acknowledgement: This material is based upon work supported by the National Science Foundation under grant No. CCF-0207673.

Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.