Java Regex - Modifier Flags

[Last Updated: Dec 4, 2017]

Java provides a set of flags to override the certain defaults. These flags effect the way Java regex engine matches the pattern.

We can set these flags at the java.util.regex.Pattern construction time by using the overloaded static method compile(String regex, int flags). The flags parameter is a bit mask that may include any of the public static fields provided in Pattern class.

We can also embed these flags in the expression as we will see in examples below.

Case Insensitive Mode

Pattern class constant	Pattern#CASE_INSENSITIVE
Embedded flag expression	(?i)

Examples

/* The default case sensitive matching*/
Pattern.compile("\\b[a-z]+\\b")
       .matcher("Stew Pasta Twinkies")
       .find();//no matches

/* Enable case-insensitive matching*/
Pattern.compile("\\b[a-z]+\\b", Pattern.CASE_INSENSITIVE)
       .matcher("Stew Pasta Twinkies")
       .find();//matches:  'Stew' at 0-4, 'Pasta' at 5-10, 'Twinkies' at 11-19
               //'Stew Pasta Twinkies'

/* Using embedded flag.*/
Pattern.compile("(?i)\\b[a-z]+\\b")
       .matcher("Stew Pasta Twinkies")
       .find();//matches:  'Stew' at 0-4, 'Pasta' at 5-10, 'Twinkies' at 11-19
               //'Stew Pasta Twinkies'

/* These flags can be used anywhere in the expression*/
Pattern.compile("[0-9]*(?i)\\b[a-z]+\\b")
       .matcher("Stew Pasta Twinkies")
       .find();//matches:  'Stew' at 0-4, 'Pasta' at 5-10, 'Twinkies' at 11-19
               //'Stew Pasta Twinkies'

/* Turning off the flag. We can turn off any flags this way*/
Pattern.compile("(?-i)\\b[a-z]+\\b")
       .matcher("Stew Pasta Twinkies")
       .find();//no matches

/* Turning on/off the flags in the middle*/
Pattern.compile("(?i)\\b[a-z]+\\b(?-i)[-][A-Z]")
       .matcher("Stew-A Pasta-B Twinkies-t")
       .find();//matches:  'Stew-A' at 0-6, 'Pasta-B' at 7-14
               //'Stew-A Pasta-B Twinkies-t'

Multi-line Mode

Pattern class constant	Pattern#MULTILINE
Embedded flag expression	(?m)

When this mode is enabled, ^ and $ will be used to match at the start and end of each line.

Examples

/* Default behavior*/
Pattern.compile("^T.*e")
       .matcher("The First line\nThe SecondLine")
       .find();//matches:  'The First line' at 0-14
               //'The First line\nThe SecondLine'

/* Including $ at the end in the regex.*/
Pattern.compile("^T.*e$")
       .matcher("The First line\nThe SecondLine")
       .find();//no matches

/* Enable multiple line mode.*/
Pattern.compile("^T.*e$", Pattern.MULTILINE)
       .matcher("The First line\nThe SecondLine")
       .find();//matches:  'The First line' at 0-14, 'The SecondLine' at 15-29
               //'The First line\nThe SecondLine'

/* Using embedded flag.*/
Pattern.compile("(?m)^T.*e$")
       .matcher("The First line\nThe SecondLine")
       .find();//matches:  'The First line' at 0-14, 'The SecondLine' at 15-29
               //'The First line\nThe SecondLine'

/* Using \r*/
Pattern.compile("(?m)^T.*e$")
       .matcher("The First line\rThe SecondLine")
       .find();//matches:  'The First line' at 0-14, 'The SecondLine' at 15-29
               //'The First line\rThe SecondLine'

/* Using \r\n together*/
Pattern.compile("(?m)^T.*e$")
       .matcher("The First line\n\rThe SecondLine")
       .find();//matches:  'The First line' at 0-14, 'The SecondLine' at 16-30
               //'The First line\n\rThe SecondLine'

Dot-All (Single Line) Mode

Pattern class constant	Pattern#DOTALL
Embedded flag expression	(?s)

When this mode is enabled, line terminators (\n or \r or \r\n) are treated as literal. The dot (.) in regex expression can match them as well . By default the line terminators are the only ones dot doesn't match.

Examples

/* We want to match the given input string as a single line, but we cannot unless we enable DOTALL mode.*/
Pattern.compile("The.*sentence")
       .matcher("The is \n one sentence")
       .find();//no matches

Pattern.compile("The.*sentence", Pattern.DOTALL)
       .matcher("The is \n one sentence")
       .find();//matches:  'The is \n one sentence' at 0-21
               //'The is \n one sentence'

/* Pattern.MULTILINE does the entirely different thing, i.e. Pattern.MULTILINE matches ^ and $ for the line terminator wherever they are. On the other hand, Pattern.DOTALL causes engine to see all line terminator as literal character which can be matched by a dot (.). Enabling MULTILINE with DOTALL doesn't effect DOTALL results*/
Pattern.compile("The.*sentence", Pattern.MULTILINE | Pattern.DOTALL)
       .matcher("The is \n one sentence")
       .find();//matches:  'The is \n one sentence' at 0-21
               //'The is \n one sentence'

/* But , using Pattern.MULTILINE with Pattern.DOTALL doesn't give desired matches for the patterns which are targeting '^' and '$'. That's because DOTALL makes the engine to see 'line terminators' as normal characters.*/
Pattern.compile("^T.*e$", Pattern.MULTILINE | Pattern.DOTALL)
       .matcher("The First line\nThe SecondLine")
       .find();//matches:  'The First line\nThe SecondLine' at 0-29
               //'The First line\nThe SecondLine'

/* Removing DOTALL will get us the right result in above case.*/
Pattern.compile("^T.*e$", Pattern.MULTILINE)
       .matcher("The First line\nThe SecondLine")
       .find();//matches:  'The First line' at 0-14, 'The SecondLine' at 15-29
               //'The First line\nThe SecondLine'

/* Using embedded flag for DOTALL.*/
Pattern.compile("(?s)The.*sentence")
       .matcher("The is a\n one sentence")
       .find();//matches:  'The is a\n one sentence' at 0-22
               //'The is a\n one sentence'

Comments and White-Spaces Mode

Pattern class constant	Pattern#COMMENTS
Embedded flag expression	(?x)

This mode enables white whitespaces and comments in the pattern. Whitespace are treated as they are not there during regex matching time. The embedded comments starting with # are also ignored until the end of a line.

Examples

/* By default we can use spaces as literals.*/
Pattern.compile("\\d+ ft")
       .matcher("2 ft, 5 ft")
       .find();//matches:  '2 ft' at 0-4, '5 ft' at 6-10
               //'2 ft, 5 ft'

/* After enabling comment mode we cannot match space between \\d+ and ft.*/
Pattern.compile("\\d+ ft", Pattern.COMMENTS)
       .matcher("2 ft, 5 ft")
       .find();//no matches

/* Now we have to use \\s for a white-space.*/
Pattern.compile("\\d+\\sft", Pattern.COMMENTS)
       .matcher("2 ft, 5 ft")
       .find();//matches:  '2 ft' at 0-4, '5 ft' at 6-10
               //'2 ft, 5 ft'

/* Extra spaces won't make a difference. There has to be one \\s between \\d+ and ft.*/
Pattern.compile("  \\d+     \\s     ft  ", Pattern.COMMENTS)
       .matcher("2 ft, 5 ft")
       .find();//matches:  '2 ft' at 0-4, '5 ft' at 6-10
               //'2 ft, 5 ft'

/* This expression represents a product code*/
Pattern.compile("\\d{2,5}[a-z]")
       .matcher("22a 33c")
       .find();//matches:  '22a' at 0-3, '33c' at 4-7
               //'22a 33c'

/* Assume we have other part of expression as well. To make our expressions more readable we want to put a standard comment starting with #. But it doesn't work. By default comments starting with # are not supported.*/
Pattern.compile("\\d{2,5}[a-z]#product code")
       .matcher("22a 33c")
       .find();//no matches

/* By default, the regex engine sees comments as part of the regex. That's why we can have match here (just including this example to clear the things up).*/
Pattern.compile("\\d{2,5}[a-z]#product code")
       .matcher("22a 33c#product code")
       .find();//matches:  '33c#product code' at 4-20
               //'22a 33c#product code'

/* To fix problem in the second last example we have to enable comments mode*/
Pattern.compile("\\d{2,5}[a-z]#product code", Pattern.COMMENTS)
       .matcher("22a 33c")
       .find();//matches:  '22a' at 0-3, '33c' at 4-7
               //'22a 33c'

/* For multiple comments, we have to end each comment with a line terminator (\n or \r). The last one doesn't need line terminator.*/
Pattern.compile("\\d{2,5}[a-z]#product code\n\\s\\d{4,4}#customer id", Pattern.COMMENTS)
       .matcher("22a 3434 33c 6767")
       .find();//matches:  '22a 3434' at 0-8, '33c 6767' at 9-17
               //'22a 3434 33c 6767'

/* Using embedded flag.*/
Pattern.compile("(?x)\\d{2,5}[a-z]#product code\n\\s\\d{4,4}#customer id")
       .matcher("22a 3434 33c 6767")
       .find();//matches:  '22a 3434' at 0-8, '33c 6767' at 9-17
               //'22a 3434 33c 6767'

Unicode-Aware Case Folding

Pattern class constant	Pattern#UNICODE_CASE
Embedded flag expression	(?u)

This mode Enables Unicode-aware case folding. This mode works with Pattern#CASE_INSENSITIVE to perform case-insensitive matching in a manner consistent with the Unicode Standard. Further information about unicode case folding can be found here. If we have to use unicode in regex then for case insensitive match we have to enable both Pattern#CASE_INSENSITIVE and Pattern#UNICODE_CASE.

Examples

/* Our regex contains a Latin Extended-A letter \\u00de and our input string contains corresponding capital case \\u00df. They don't match because we didn't enabled case insensitive mode yet.*/
Pattern.compile("\u00de")
       .matcher("\u00fe")
       .find();//no matches

/* Let's enable case insensitive mode. We still don't have match yet. The reason is we haven't enabled unicode case mode.*/
Pattern.compile("\u00de", Pattern.CASE_INSENSITIVE)
       .matcher("\u00fe")
       .find();//no matches

/* Let's enable unicode case mode too. We have a match now.*/
Pattern.compile("\u00de", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE)
       .matcher("\u00fe")
       .find();//matches:  'þ' at 0-1
               //'þ'

/* Using embedded flags, 'i' for case insensitive and 'u' for unicode case.*/
Pattern.compile("(?iu)\u00de")
       .matcher("\u00fe")
       .find();//matches:  'þ' at 0-1
               //'þ'

Literal Parsing

Pattern class constant	Pattern#LITERAL
Embedded flag expression	None

When this flag is specified then the Metacharacters or other constructs in the regex have literal meanings. That means regex '.+' will not match one or more characters but will match exactly '.+' in the input string.

The flags CASE_INSENSITIVE and UNICODE_CASE retain their impact on matching when used along with this flag. The other flags have no effects.

Using this flag is exactly same as using Pattern#Quote

There is no embedded flag character for enabling literal parsing.

Examples

/* Not using literal flag yet.*/
Pattern.compile("[a-z]+")
       .matcher("test")
       .find();//matches:  'test' at 0-4
               //'test'

/* Using literal flag now. The regex engine doesn't match input string based on normal metacharacter matching criteria*/
Pattern.compile("[a-z]+", Pattern.LITERAL)
       .matcher("test")
       .find();//no matches

/* This matches.*/
Pattern.compile("[a-z]+", Pattern.LITERAL)
       .matcher("[a-z]+")
       .find();//matches:  '[a-z]+' at 0-6
               //'[a-z]+'

/* Using Pattern#Quote has the same effect.*/
Pattern.compile(Pattern.Quote("[a-z]+"))
       .matcher("[a-z]+")
       .find();//matches:  '[a-z]+' at 0-6
               //'[a-z]+'

/* CASE_INSENSITIVE flag still has it's effect. Note we have changed the case of 'z' in input string but it still matches.*/
Pattern.compile("[a-z]+", Pattern.CASE_INSENSITIVE | Pattern.LITERAL)
       .matcher("[a-Z]+")
       .find();//matches:  '[a-Z]+' at 0-6
               //'[a-Z]+'

/* Using embedded (?i) doesn't work in this case because Pattern.LITERAL basically does the same thing what Pattern#quote does.*/
Pattern.compile("(?i)[a-z]+", Pattern.LITERAL)
       .matcher("[a-Z]+")
       .find();//no matches

/* Just to confirm the above comment, whe are adding (?i) in the input string and using small 'z'*/
Pattern.compile("(?i)[a-z]+", Pattern.LITERAL)
       .matcher("(?i)[a-z]+")
       .find();//matches:  '(?i)[a-z]+' at 0-10
               //'(?i)[a-z]+'

Unix Lines

Pattern class constant	Pattern#UNIX_LINES
Embedded flag expression	(?d)

Enables Unix lines mode. In this mode, only '\n' is recognized as line terminator. '\r' is treated as literal and can be matched by the dot.

Examples

/* Input string with line terminators \r and \n without any flags.*/
Pattern.compile("^T.*e")
       .matcher("The First line\rThe Second line\nThe third Line")
       .find();//matches:  'The First line' at 0-14
               //'The First line\rThe Second line\nThe third Line'

Pattern.compile("^T.*e$")
       .matcher("The First line\rThe Second line\nThe third Line")
       .find();//no matches

/* Enabling multiline flag*/
Pattern.compile("^T.*e$", Pattern.MULTILINE)
       .matcher("The First line\rThe Second line\nThe third Line")
       .find();//matches:  'The First line' at 0-14, 'The Second line' at 15-30, 
               //'The third Line' at 31-45
               //'The First line\rThe Second line\nThe third Line'

/* Enabling unix lines flag without $. Notice this mode treats \r as a literal matched by the dot.*/
Pattern.compile("^T.*e", Pattern.UNIX_LINES)
       .matcher("The First line\rThe Second line\nThe third Line")
       .find();//matches:  'The First line\rThe Second line' at 0-30
               //'The First line\rThe Second line\nThe third Line'

/* Now using $ as well. Notice this mode doesn't match ^ and $ across multi-lines (just like default mode) unless we enable multiline mode too.*/
Pattern.compile("^T.*e$", Pattern.UNIX_LINES)
       .matcher("The First line\rThe Second line\nThe third Line")
       .find();//no matches

/* Enabling multiline flag along with unix lines flag.*/
Pattern.compile("^T.*e$", Pattern.MULTILINE | Pattern.UNIX_LINES)
       .matcher("The First line\rThe Second line\nThe third Line")
       .find();//matches:  'The First line\rThe Second line' at 0-30, 
               //'The third Line' at 31-45
               //'The First line\rThe Second line\nThe third Line'

/* Using embedded flag d now*/
Pattern.compile("(?d)^T.*e$", Pattern.MULTILINE | Pattern.UNIX_LINES)
       .matcher("The First line\rThe Second line\nThe third Line")
       .find();//matches:  'The First line\rThe Second line' at 0-30, 
               //'The third Line' at 31-45
               //'The First line\rThe Second line\nThe third Line'

Unicode Canonical Equivalence

Pattern class constant	Pattern#CANON_EQ
Embedded flag expression	None

When this flag is specified then two characters will be considered to match if, and only if, their full canonical decompositions match.

We should use this flag Pattern.CANON_EQ to ignore differences in Unicode encodings, unless we are sure our strings contain only US ASCII characters.

This is the situation when using two character together produces another character. For example, the unicode \U006E (the Latin lowercase "n") followed by \U0303 (the combining tilde) is equivalent (as defined by Unicode to be canonically equivalent) to the single unicode character \U00F1 (a lowercase letter of the Spanish alphabet).

More information about canonical equivalence here.

There is no embedded flag character for this mode.

Examples

/* Without canonical equivalence mode.*/
Pattern.compile("n\u0303")
       .matcher("\u00f1")
       .find();//no matches

/* With canonical equivalence mode.*/
Pattern.compile("n\u0303", Pattern.CANON_EQ)
       .matcher("\u00f1")
       .find();//matches:  'ñ' at 0-1
               //'ñ'

Pattern.compile("\u00f1", Pattern.CANON_EQ)
       .matcher("n\u0303")
       .find();//matches:  'n?' at 0-2
               //'n?'

Example Project

Dependencies and Technologies Used:

JDK 1.8
Maven 3.0.4

Regex Modifer Flags

Select All

Download

regex-modifier-flags
- src
  - main
    - java
      - com
        logicbig
        example
        
        RegexModifierFlags.java
        
        RegexUtil.java
- pom.xml

Java Regex - Modifier Flags

Case Insensitive Mode

Examples

Multi-line Mode

Examples

Dot-All (Single Line) Mode

Examples

Comments and White-Spaces Mode

Examples

Unicode-Aware Case Folding

Examples

Literal Parsing

Examples

Unix Lines

Examples

Unicode Canonical Equivalence

Examples

Example Project

See Also