HarambeNet06 Regex
Regular expressions are incredibly useful in searching large chunks of
text, whether the text is a work of literature or the genome of a plant
or animal.
Here's a
Java program
(via Java webstart)
that provides a GUI that allows the user to query
a dictionary of English words using regular expressions. For example,
using the query below finds any word with a three letter repeat, e.g.,
assassin and alfalfa.
(...)\1
Here's a screenshot of the program finding the matches for the reglar
expression.
If you could scroll the entire list of matches, which words would you
see? (circle those that match the regex).
- butterer
- impinging
- instantaneous
- memento
- murmur
- nonsense
More than one part of a regexp can be tagged. For example.
(.)(.)(.)\3\2\1$
matches as shown in the screenshot below.
Note that the palindrome must end the word because of the $ that matches
a word-end. If the caret symbols '^' is used at the front of the
regexp
pattern we get only two matches: "hannah" and "redder". Which of
the following do NOT match ^(.)(.)\2\1? For each
that doesn't match, provide a regexp that would match the palindromic
part
of the word.
- cassette
- essence
- millimiter
- oppose
- semester
Regex Thinking and Studying Questions
-
The regular expression (....)\1 generates exactly one match, the
word beriberi. A small modification, using the regular
expression (....).\1 generates two matches: bandstands
and hodgepodge. Explain why beriberi does not
match the second regular expression and why the second expression has
the two matches it does.
-
If the regular expression above is changed to
(....).*\1 then 13
matches are found as shown below.
- atherosclerosis
- bandstands
- beriberi
- hodgepodge
- kinnickinnic
- knickerbocker
- knickerbockers
- lightweight
- misunderstander
- misunderstanders
- nationalization
- rationalization
- rationalizations
Explain broadly why there are more matches, including all three
described in part A. Explain specifically why atherosclerosis
matches. Finally, circle the five out of the thirteen that match
(.....).*\1 (note: one extra dot in the parentheses).
-
The regular expression sp[ai]s generates twelve matches as
follows. Explain why despise and spasm match this regular
expression.
- despise
- despised
- despises
- despising
- dispassionate
- spasm
- spastic
- trespass
- trespassed
- trespasser
- trespassers
- trespasses
-
This regular expression ^(.{2,4})\1$ generates a list of 10 words
as follows. Based on your knowledge of regular expressions and your
ability to analyze data, provide an explanation of how the regular
expression works in generating this list. Note that we did not discuss
the {2,4} part of the expression, you have to offer an
explanation of this based on what you know and what the data show.
- beriberi
- booboo
- coco
- dada
- isis
- mama
- mimi
- murmur
- papa
- toto
-
To find seven-letter palindromes, someone enters
this regular expression (.)(.)(.).\3\2\1$. This generates
seven matches, but only the last two are seven letter
palindromes. Explain why precipice matches this regular
expression and how to fix the regex so that only palindromes match it.
- analyticity
- interpret
- precipice
- recognizing
- reinterpret
- reviver
- rotator
-
Recall that a start codon is ATG and that a stop codon
is any of the three TAG, TGA, or TAA. The Java
code below is an attempt to find start/stop codon pairs in a strand of
DNA. A run is shown for the strand indicated in the program.
Here's the code
import java.util.regex.*;
public class Restrict
{
static String dna = "ATGxxxTAG...ATGyyyyzzzTGA...ATGwwwwTAA...ATGaaaaaaa";
// 012345678901234567890123456789012345678901234567890123
public static void main(String[] args){
Pattern starter = Pattern.compile("(ATG).*?(TAG|TGA|TAA)");
Matcher match = starter.matcher(dna);
while (match.find()){
System.out.println(match.start()+ " "+match.end());
System.out.println(match.group());
System.out.println("---");
}
}
}
When this code is run it generates the three matches shown below on the
left. However, when the regular expression is changed so that the
question mark is removed, that is it becomes
"(ATG).*(TAG|TGA|TAA)", then the output generated shows only one
match as displayed on the right below (this is a separate run of the
code).
\begin{verbatim} Run/executed with ? in regex Run/executed
without ? in regex
| With ? in Regex
| Without ? in Regex
|
0 9
ATGxxxTAG
---
12 25
ATGyyyyzzzTGA
---
28 38
ATGwwwwTAA
---
|
0 38
ATGxxxTAG...ATGyyyyzzzTGA...ATGwwwwTAA
---
|
Provide an explanation of the different behavior based on your knowledge
of regular expressions and your ability to reason. We did not discuss
the question mark as part of regular expressions. When it's used, the
regular expression matches are called reluctant. When the
question mark is not used, a match is called greedy. These terms
may help in your explanation.
Owen L. Astrachan
Last modified: Wed Jul 12 06:56:16 EDT 2006