Java Regex - Atomic Groups

[Updated: Jan 23, 2016, Created: Jan 17, 2016]

Atomic groups don't back off (backtrack) once a match is successful. Atomic groups basically discards/forgets the subsequent parts in the group once a token matches.

Syntax:

<?>X> where X can be literal(s) or any valid regular expression(s). Atomic groups are not capturing themselves but can have nested capturing groups


Examples:

To understand what atomic groups are and how they work, let's consider step by step examples.

/* Without atomic group.*/
Pattern.compile("p(ut|u)t")
.matcher("putt")
.find();//matches: 'putt' at 0-4 //'putt' /* This is also without atomic group. The literal part of the expression 'p' matches input starting part 'p'. Then the left site of the pipe, 'ut' will match the remaining input 'ut'. There's nothing left for the last literal part of the expression 't', so this attempt fails. The engine remembers that there's another part of the alteration so it will backtrack and will try with other alteration 'u' (right side of the pipe). This time it will have a match.*/ Pattern.compile("p(ut|u)t")
.matcher("put")
.find();//matches: 'put' at 0-3 //'put' /* Now declare the group as atomic by putting '?>' at the start of the group. Here we don't have a match. That's because, the engine doesn't backtrack for atomic groups. The first 'p' of input string matches the literal part 'p' of the expression. Then 'ut' of the input string matches with alteration part 'ut'. Now nothing left for the last literal 't' of the expression and the the engine won't backtrack (because the group is atomic) to try other alteration. This situation is only valid if engine has a match with current iteration but doesn't have overall match. In other words, The atomic group does not backtrack after the first match of alteration even though there might be an overall match*/ Pattern.compile("p(?>ut|u)t")
.matcher("put")
.find();//no matches /* . In this example the left alteration doesn't match because 'ut' of the input string doesn't satisfy the expression alteration part 'it' so it will backtrack, even though the group is atomic.*/ Pattern.compile("p(?>it|u)t")
.matcher("put")
.find();//matches: 'put' at 0-3 //'put' /* Let's rewrite our very first example, this time making the group atomic. It matches cause we still have one remaining 't' in the input string to match the last literal 't' of the expression*/ Pattern.compile("p(?>ut|u)t")
.matcher("putt")
.find();//matches: 'putt' at 0-4 //'putt' /* Atomic groups can have nested capturing group.*/ Pattern.compile("p(?>(ut|u))\\1")
.matcher("putut")
.find();//matches: 'putut' at 0-5 //'putut' Pattern.compile("p(?>(ut|u))\\1")
.matcher("puu")
.find();//matches: 'puu' at 0-3 //'puu'

As we saw in above examples how we can use the atomic groups. They are similar to Possessive Quantifier as both can be used to avoid further permutations, after failing at a certain point. We use them in the situations where remaining permutations are not likely to be successful either. The primary purpose of using them is to get some performance improvements.

Let's explore some more examples. Assume we want to scan our html pages and want to remove all invalid image tags. They are invalid because they don't have any attributes. So we are simply looking for <img>. Some pages also have old style image tags as well i.e. <image>. For simplicity and for not distracting from our main topic we are not going to focus on corresponding closing tags.

/* Let's start without declaring the group as atomic.*/
Pattern.compile("<(image|img)>")
.matcher("<image>")
.find();//matches: '<image>' at 0-7 //'<image>' Pattern.compile("<(image|img)>")
.matcher("<img>")
.find();//matches: '<img>' at 0-5 //'<img>' /* Now declare the group as atomic. The input '<img>' still matches. Because first alteration part 'image' doesn't match at all. Here we have to understand clearly that we are still able to use other alteration 'img' because the current alteration 'image' doesn't match.*/ Pattern.compile("<(?>image|img)>")
.matcher("<img>")
.find();//matches: '<img>' at 0-5 //'<img>' /* In this example the input string has a tag without closing '>' (it might be a valid html construct as it might have some attributes after that point, but we are not interested in that in our example). We are not declaring the group as atomic this time.*/ Pattern.compile("<(image|img)>")
.matcher("<image")
.find();//no matches /* In above example the overall match failed but the engine had a successful match of 'image' for the first alteration but it couldn't match last '>', so it had to backtrack and to try other alteration 'img'. Isn't that wasteful backtracking for other alteration? Given that first alteration matches and overall match is not successful then the current position of the input string is not satisfying the outside condition (in this example the closing tag '>').Let's make it atomic to avoid useless backtracking and improve some performance.*/ Pattern.compile("<(?>image|img)>")
.matcher("<image")
.find();//no matches

Example Project


Dependencies and Technologies Used :

  • JDK 1.8
  • Maven 3.0.4

Regex Atomic Groups Select All Download
  • atomic-groups
    • src
      • main
        • java
          • com
            • logicbig
              • example

See Also