Java Regex - Backreferences

[Updated: Jan 23, 2016, Created: Jan 15, 2016]

Backreference is a way to repeat a capturing group. Unlike referencing a captured group inside a replacement string, a backreference is used inside a regular expression by inlining it's group number preceded by a single backslash. For example the ([A-Za-z])[0-9]\1. The group '([A-Za-z])' is back-referenced as \\1. This is not same as the writing expression [A-Za-z][0-9][A-Za-z] as this expression is actually reapplying the same pattern at the end but the expression ([A-Za-z])[0-9]\1 is reapplying the same "matched substring" that will be captured by group 1 during runtime.


.find();//matches: 'a9a' at 0-3 //'a9a' /* Now let's repeat the same group at the end instead of backreferencing. We also changed the input string a little. It matches because last 'b' in the input string is satisfying the last expression [A-Za-z]*/ Pattern.compile("[A-Za-z][0-9][A-Za-z]")
.find();//matches: 'a9b' at 0-3 //'a9b' /* It doesn't match because the substring captured by the group 1 is not repeated for the last part of the expression '\\1'*/ Pattern.compile("([A-Za-z])[0-9]\\1")
.find();//no matches /* The capturing group must be defined before it's used as back-reference. That means there's no such thing as 'forward-referencing'. This example gives no exceptions but the match fails.*/ Pattern.compile("\\1[0-9]([A-Za-z])")
.find();//no matches /* Find two or more consecutive characters, using backreference.*/ Pattern.compile("(\\w)\\1+")
.find();//matches: 'SSS' at 7-10, 'YYY' at 17-20, 'GG' at 22-24,
/*Regex breakdown: (\\w)\\1+
(Starting the first group. It will be numbered as 1.
 \\wExactly one character word.
)Ending the first group.
\\1Back-referencing group number 1
+One or more times. That means our back-referenced character can be just one character or can repeat multiple times.

*/ /* This pattern looks for a group of one or more consecutive characters, followed by the same group of characters.*/ Pattern.compile("([a-z]+)\\1")
.matcher("happiness www banana")
.find();//matches: 'pp' at 2-4, 'ss' at 7-9, 'ww' at 10-12,
//'anan' at 15-19 //'happiness www banana'
/*Regex breakdown: ([a-z]+)\\1
(Start the capturing group.
 [a-z]Any character from a to z.
 +Repeat the character class one or more times
)End the capturing group
\\1Back-referencing group number 1

*/ /* This pattern looks for a group of one or more consecutive characters, followed by the last character of the same characters sequence. Why last character? That's because the group is repeated multiple times, the quantifier '(....)+' is outside of the group. That means the group 1 will be captured multiple times to have a complete match which includes the backreference \\1 too. The captured group will have only one character each time. On each failure it will discard the last captured character and re-capture the group for the next attempt. As it's also a greedy quantifier it will try to find the longest match, notice the match of 'www' rather than 'ww'. In this case we can imagine ([a-z])+ => ww, last captured value being second w and \\1=>w. In case of 'happiness', ([a-z])+ => happines and \\1 => s*/ Pattern.compile("([a-z])+\\1")
.matcher("happiness www banana")
.find();//matches: 'happiness' at 0-9, 'www' at 10-13 //'happiness www banana'
/*Regex breakdown: ([a-z])+\\1
(Start the capturing group.
 [a-z]Any character from a to z.
)End the capturing group
 +Repeat the capturing group, one or more times
\\1Back-referencing group number 1

*/ /* Changing the last example to make the quantifier reluctant by adding ?. Notice the difference this time.*/ Pattern.compile("([a-z])+?\\1")
.matcher("happiness www banana")
.find();//matches: 'happ' at 0-4, 'iness' at 4-9, 'ww' at 10-12 //'happiness www banana'

Backreferencing Named Capturing Groups

As capturing group can be given a name, we can backreference them using their name. Group can be given name by using syntax (?<name>X) and we can backreference them using the syntax \\k<name> .


.matcher("a9a c0c d68")
.find();//matches: 'a9a' at 0-3, 'c0c' at 4-7 //'a9a c0c d68'
/*Regex breakdown: (?<myGroup>[A-Za-z])[0-9]\\k<myGroup>
(Start the capturing group
 ?<myGroup>Naming this group as 'myGroup'
 [A-Za-z]Any character from A to Z or a to z.
)Closing the group
[0-9]Any digit from 0 to 9
\\k<myGroup>Backreferencing group named 'myGroup'

*/ /* We can still use group number*/ Pattern.compile("(?<myGroup>[A-Za-z])[0-9]\\1")
.matcher("a9a c0c d68")
.find();//matches: 'a9a' at 0-3, 'c0c' at 4-7 //'a9a c0c d68' Pattern.compile("(?<CHAR>(\\w){2,2})[\\S]*\\k<CHAR>")
.matcher("Cimicic galagala buffalo buffalo")
.find();//matches: 'icic' at 3-7, 'galaga' at 8-14 //'Cimicic galagala buffalo buffalo'
/*Regex breakdown: (?<CHAR>(\\w){2,2})[\\S]*\\k<CHAR>
(Start the capturing group
 ?<CHAR>Naming this group as 'CHAR'
 (\\w){2,2}Any word having exactly two characters.
)Closing the group
[\\S]*Anything except for white spaces.
\\k<CHAR>Backreferencing group named 'CHAR'


Example Project

Dependencies and Technologies Used :

  • JDK 1.8
  • Maven 3.0.4

Regex Back References Select All Download
  • regex-backreferences
    • src
      • main
        • java
          • com
            • logicbig
              • example

See Also