Java Regex - Reluctant Quantifiers

[Updated: Jun 13, 2017, Created: Jan 12, 2016]

As mentioned in the last tutorial, we can modify the behavior of default greedy quantifiers (+, *, ? and { }) by appending another meta-character at the end. By doing so we are effectively turning the default behaviour into one of the two types of behaviors, which are termed as Reluctant and Possessive quantifiers. In this tutorial we are going to explore Reluctant quantifiers.

As opposed to Greedy quantifiers (which prefers to find the longest possible match over the shorter ones) , Reluctant Quantifier prefers shortest matches over a long match. That means they start out matching smaller parts, in case if smaller parts fail then they try longer ones. The last thing they try is the entire input string.

We can change default greedy quantifier into reluctant by appending ? at the end.

Let's explore details along with examples

Regex Construct/Terms Examples
The Plus Reluctant Quantifier:

It matches one or more times Reluctantly. It is done so by matching the smaller part of the input string instead of matching the entire or bigger parts first.

For example, consider the pattern "x+?" and the input string "xx xxx". The engine will start out matching smaller parts. So the match will be five "x"s instead of two "xx" and "xxx"

As this quantifier only allows matching the strings of lengths at least once, we will never see zero-length matches with them.

As opposed to greedy quantifiers, the possessive quantifiers don't have any notion of backing off, as they don't eat up the biggest possible part in the beginning of the process. Instead they start from the right side and match smaller portion of the input string first.

.find();//matches: 'a' at 0-1, 'b' at 1-2, 'c' at 2-3 //'abc' Pattern.compile("a+?")
.find();//matches: 'a' at 1-2, 'a' at 3-4, 'a' at 4-5 //'babaab' /* Comparing with Greedy +*/ Pattern.compile("a+")
.find();//matches: 'a' at 1-2, 'aa' at 3-5 //'babaab' Pattern.compile("(xyz)+?")
.find();//matches: 'xyz' at 0-3, 'xyz' at 3-6 //'xyzxyz' Pattern.compile("((abc)|(xyz))+?")
.matcher("abcxyz xyzabc")
.find();//matches: 'abc' at 0-3, 'xyz' at 3-6, 'xyz' at 7-10,
//'abc' at 10-13 //'abcxyz xyzabc' /* Comparing with Greedy +*/ Pattern.compile("((abc)|(xyz))+")
.matcher("abcxyz xyzabc")
.find();//matches: 'abcxyz' at 0-6, 'xyzabc' at 7-13 //'abcxyz xyzabc' Pattern.compile("[0-9]+?")
.matcher("9823 340")
.find();//matches: '9' at 0-1, '8' at 1-2, '2' at 2-3, '3' at 3-4,
//'3' at 5-6, '4' at 6-7, '0' at 7-8 //'9823 340' /* Comparing with Greedy +*/ Pattern.compile("[0-9]+")
.matcher("9823 340")
.find();//matches: '9823' at 0-4, '340' at 5-8 //'9823 340' Pattern.compile("d.+?o")
.find();//matches: 'doo' at 0-3, 'dodo' at 3-7 //'doodododo'
The Asterisk Reluctant Quantifier:

It matches zero or more times reluctantly.

It works same as +? except that they supports zero-length matches too. Please visit the last page for details on zero-length matches.

There's one important point to understand:
Consider the pattern, "x*?" and input string "xxx". We can see "xxx" matches the pattern but the engines has to decide on the length of the "xxx". Since the pattern is of reluctant type ("?" at the end), the engine will try to match smallest possible match. For the smallest possible match it will look at the regex pattern again to find out what is the smallest allowed by the quantifier. Here it will find "*" that allows to match "zero or more" so the final match per smallest criteria will be the zero-length match.

That means the pattern of type "X*?" or ".*?" will always return the zero-length matches no matter what the input string is. In practice, of course, we don't use patterns this way, there has to be criteria to avoid zero-length matches, e.g pattern of this kind X.*?E which won't give zero length matches.

Pattern.matches(".*?", ""); //true
.find();//matches: '' at 0-0 /* As Reluctant Quantifier find the shortest matches and for '*' shortest is zero, the next two examples gives all zero-length matches*/ Pattern.compile(".*?")
.find();//matches: '' at 0-0, '' at 1-1, '' at 2-2, '' at 3-3,
//'' at 4-4, '' at 5-5 Pattern.compile("x*?")
.find();//matches: '' at 0-0, '' at 1-1, '' at 2-2, '' at 3-3 Pattern.compile("x.*?")
.find();//matches: 'x' at 0-1, 'x' at 1-2, 'x' at 2-3 //'xxx' /* There's no difference between the following exmaple and the last example, as portion '.*?' in last example is interpreted as zero-lenght match */ Pattern.compile("x")
.find();//matches: 'x' at 0-1, 'x' at 1-2, 'x' at 2-3 //'xxx' Pattern.compile("x.*?e")
.matcher("xylose xylene xe ex")
.find();//matches: 'xylose' at 0-6, 'xyle' at 7-11, 'xe' at 14-16 //'xylose xylene xe ex' /* Here we replace * with + which doesn't allow zero-length*/ Pattern.compile("x.+?e")
.matcher("xylose xylene xe ex")
.find();//matches: 'xylose' at 0-6, 'xyle' at 7-11, 'xe e' at 14-18 //'xylose xylene xe ex' /* Making it greedy*/ Pattern.compile("x.+e")
.matcher("xylose xylene xe ex")
.find();//matches: 'xylose xylene xe e' at 0-18 //'xylose xylene xe ex' Pattern.compile("\\b\\w*?bb\\w*?\\b")
.matcher("bobby blub jibb lobby abba bb")
.find();//matches: 'bobby' at 0-5, 'jibb' at 11-15, 'lobby' at 16-21,
//'abba' at 22-26, 'bb' at 27-29 //'bobby blub jibb lobby abba bb' /* Matching all html tags*/ Pattern.compile("<(.|\n)*?>")
.matcher("<html><body>the page</body></html>")
.find();//matches: '<html>' at 0-6, '<body>' at 6-12,
//'</body>' at 20-27, '</html>' at 27-34 //'<html><body>the page</body></html>' /* Making it greedy, see the difference*/ Pattern.compile("<(.|\n)*>")
.matcher("<html><body>the page</body></html>")
.find();//matches: '<html><body>the page</body></html>' at 0-34 //'<html><body>the page</body></html>'
The Question Mark Reluctant Quantifier:

It matches zero or one, reluctantly.

This quantifier doesn't allow more than one length match. Since it also allows zero length, we will see zero length matches. It is also reluctant so matches of kind X?? will always give zero-length results, just like X*?

.find();//matches: '' at 0-0 Pattern.compile("x.??")
.find();//matches: 'x' at 0-1, 'x' at 1-2, 'x' at 2-3 //'xxx' /* silo is not a match cause the quantifier doesn't allow length more than one.*/ Pattern.compile("s.??o")
.matcher("so solo silo")
.find();//matches: 'so' at 0-2, 'so' at 3-5 //'so solo silo' /* Replacing ? with + still keeping it reluctant*/ Pattern.compile("s.+?o")
.matcher("so solo silo")
.find();//matches: 'so so' at 0-5, 'silo' at 8-12 //'so solo silo' /* This will also match 'My electric guitar'*/ Pattern.compile("My\\s+(electric)??\\s*guitar")
.matcher("My guitar")
.find();//matches: 'My guitar' at 0-9 //'My guitar'
The Curly Brackets Reluctant Quantifier: Using this quantifier, we define min and/or max number of matches. Here are details on the related constructs. Being reluctant, it only matches the smallest part of the provided range.
/* As it's reluctant, it only matches smallest part, so 'de' is not a match*/
.find();//matches: 'abc' at 0-3 //'abcde' /* Making it Greedy*/ Pattern.compile(".{3,}")
.find();//matches: 'abcde' at 0-5 //'abcde' Pattern.compile(".{4,}?")
.find();//matches: 'abcd' at 0-4, 'efgh' at 4-8 //'abcdefghij' Pattern.compile("(\\w{3,}?\\.){2}?\\w{3,}?")
.find();//matches: '' at 0-15 //'' /* This matches sentences that contain between one to ten words.*/ Pattern.compile("[A-Z](\\w*?\\s*?){1,10}?[.?!]")
.matcher("This is a paragraph. That's it.")
.find();//matches: 'This is a paragraph.' at 0-20 //'This is a paragraph. That's it.'

Example Project

Dependencies and Technologies Used :

  • JDK 1.8
  • Maven 3.0.4

Regex Reluctant Quantifier Select All Download
  • reluctant-quantifiers
    • src
      • main
        • java
          • com
            • logicbig
              • example

See Also