HarambeNet06 Regex

Regular expressions are incredibly useful in searching large chunks of text, whether the text is a work of literature or the genome of a plant or animal.

Here's a Java program (via Java webstart) that provides a GUI that allows the user to query a dictionary of English words using regular expressions. For example, using the query below finds any word with a three letter repeat, e.g., assassin and alfalfa.

   (...)\1

Here's a screenshot of the program finding the matches for the reglar expression. assissin screen shot

If you could scroll the entire list of matches, which words would you see? (circle those that match the regex).

  1. butterer
  2. impinging
  3. instantaneous
  4. memento
  5. murmur
  6. nonsense

More than one part of a regexp can be tagged. For example.

  (.)(.)(.)\3\2\1$
matches as shown in the screenshot below.

three letter pals

Note that the palindrome must end the word because of the $ that matches a word-end. If the caret symbols '^' is used at the front of the regexp pattern we get only two matches: "hannah" and "redder". Which of the following do NOT match ^(.)(.)\2\1? For each that doesn't match, provide a regexp that would match the palindromic part of the word.

  1. cassette
  2. essence
  3. millimiter
  4. oppose
  5. semester


Regex Thinking and Studying Questions

  1. The regular expression (....)\1 generates exactly one match, the word beriberi. A small modification, using the regular expression (....).\1 generates two matches: bandstands and hodgepodge. Explain why beriberi does not match the second regular expression and why the second expression has the two matches it does.
    
    
    
    
    
    
    
    
  2. If the regular expression above is changed to (....).*\1 then 13 matches are found as shown below.

    1. atherosclerosis
    2. bandstands
    3. beriberi
    4. hodgepodge
    5. kinnickinnic
    6. knickerbocker
    7. knickerbockers
    8. lightweight
    9. misunderstander
    10. misunderstanders
    11. nationalization
    12. rationalization
    13. rationalizations

    Explain broadly why there are more matches, including all three described in part A. Explain specifically why atherosclerosis matches. Finally, circle the five out of the thirteen that match (.....).*\1 (note: one extra dot in the parentheses).

    
    
    
    
    
    
    
    
  3. The regular expression sp[ai]s generates twelve matches as follows. Explain why despise and spasm match this regular expression.

    1. despise
    2. despised
    3. despises
    4. despising
    5. dispassionate
    6. spasm
    7. spastic
    8. trespass
    9. trespassed
    10. trespasser
    11. trespassers
    12. trespasses
    
    
    
    
    
    
  4. This regular expression ^(.{2,4})\1$ generates a list of 10 words as follows. Based on your knowledge of regular expressions and your ability to analyze data, provide an explanation of how the regular expression works in generating this list. Note that we did not discuss the {2,4} part of the expression, you have to offer an explanation of this based on what you know and what the data show.

    1. beriberi
    2. booboo
    3. coco
    4. dada
    5. isis
    6. mama
    7. mimi
    8. murmur
    9. papa
    10. toto
    
    
  5. To find seven-letter palindromes, someone enters this regular expression (.)(.)(.).\3\2\1$. This generates seven matches, but only the last two are seven letter palindromes. Explain why precipice matches this regular expression and how to fix the regex so that only palindromes match it.

    1. analyticity
    2. interpret
    3. precipice
    4. recognizing
    5. reinterpret
    6. reviver
    7. rotator
    
    
    
    
    
    
    
    
    
  6. Recall that a start codon is ATG and that a stop codon is any of the three TAG, TGA, or TAA. The Java code below is an attempt to find start/stop codon pairs in a strand of DNA. A run is shown for the strand indicated in the program.

    Here's the code

    import java.util.regex.*; public class Restrict { static String dna = "ATGxxxTAG...ATGyyyyzzzTGA...ATGwwwwTAA...ATGaaaaaaa"; // 012345678901234567890123456789012345678901234567890123 public static void main(String[] args){ Pattern starter = Pattern.compile("(ATG).*?(TAG|TGA|TAA)"); Matcher match = starter.matcher(dna); while (match.find()){ System.out.println(match.start()+ " "+match.end()); System.out.println(match.group()); System.out.println("---"); } } } When this code is run it generates the three matches shown below on the left. However, when the regular expression is changed so that the question mark is removed, that is it becomes "(ATG).*(TAG|TGA|TAA)", then the output generated shows only one match as displayed on the right below (this is a separate run of the code). \begin{verbatim} Run/executed with ? in regex Run/executed without ? in regex

    With ? in Regex Without ? in Regex
    0 9                                     
    ATGxxxTAG                               
    ---                                     
    12 25
    ATGyyyyzzzTGA
    ---
    28 38
    ATGwwwwTAA
    ---
    
    0 38
    ATGxxxTAG...ATGyyyyzzzTGA...ATGwwwwTAA
    ---
    

    Provide an explanation of the different behavior based on your knowledge of regular expressions and your ability to reason. We did not discuss the question mark as part of regular expressions. When it's used, the regular expression matches are called reluctant. When the question mark is not used, a match is called greedy. These terms may help in your explanation.


    Owen L. Astrachan
    Last modified: Wed Jul 12 06:56:16 EDT 2006