- Pattern of regex is applied on String from left to right and source char in a match can’t be reused.
- For example, regex “121” will match “31212142121” only twice as “121___121”.
System.out.println("Using String matches method: " + str.matches(".bb"));
System.out.println("Using Pattern matches method: " + Pattern.matches(".bb", str));
Pattern.matches(“[a - e1 - 3].”, “d#”)
import java.util.regex package;
public class PatternExample {
public static void main(String[] args) {
try {
Pattern pattern = Pattern.compile(".xx.");
Matcher matcher = pattern.matcher("MxxY");
System.out.println("Input String matches regex - " + matcher.matches());
pattern = Pattern.compile("*xx*"); // bad regular expression
} catch (PatternSyntaxException pse) {
System.out.println(e.getMessage());
}
}
}
Capturing Groups
- () in regex is used to treat multiple characters as a single unit.
- portion of input matching the capturing group is saved into memory and can be recalled using Backreference.
- matcher.groupCount() method - find number of capturing groups.
- For example, ((a)(bc)) contains 3 capturing groups – ((a)(bc)), (a) and (bc).
- You can use Backreference in regular expression with backslash () and then the number of groups to be recalled.
System.out.println(Pattern.matches("(\\w\\d)\\1", "a2a2")); //true \1 is a2
System.out.println(Pattern.matches("(\\w\\d)\\1", "a2b2")); //false \1 is a2
System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B2AB")); //true \1 is AB
System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B3AB")); //false \2 is B2
RegEx Keywords
| Regex Basics | Description |
|---|---|
| ^ | The start of a string |
| $ | The end of a string |
| . | Wildcard which matches any character, except newline (\n). |
| | | Matches a specific character or group of characters on either side (e.g. a|b corresponds to a or b) |
| \ | Used to escape a special character |
| a | The character “a” |
| ab | The string “ab” |
| Quantifiers | Description |
|---|---|
| * | Used to match 0 or more of the previous (e.g. xy*z could correspond to “xz”, “xyz”, “xyyz”, etc. |
| ? | Matches 0 or 1 of the previous |
| + | Matches 1 or more of the previous |
| {5} | Matches exactly 5 |
| {5,} | Matches 5 or more. |
| {5, 10} | Matches everything between 5-10 |
| Character Classes | Description |
|---|---|
| \s | Matches a whitespace character |
| \S | Matches a non-whitespace character |
| \w | Matches a word character |
| \W | Matches a non-word character |
| \d | Matches one digit |
| \D | Matches one non-digit |
| [\b] | A backspace character |
| \c | A control character |
| Special Characters | Description |
|---|---|
| \n | Matches a newline |
| \t | Matches a tab |
| \r | Matches a carriage return |
| \ZZZ | Matches octal character ZZZ |
| \xZZ | Matches hex character ZZ |
| \0 | A null character |
| \v | A vertical tab |
| Groups | Description |
|---|---|
| (xyz) | Grouping of characters |
| (?:xyz) | Non-capturing group of characters |
| [xyz] | Matches a range of characters (e.g. x or y or z) |
| [^xyz] | Matches a character other than x or y or z |
| [a-q] | Matches a character from within a specified range |
| [0-7] | Matches a digit from within a specified range |
| String Replacements | Description |
|---|---|
| $` | Insert before matched string |
| $’ | Insert after matched string |
| $+ | Insert last matched |
| $& | Insert entire match |
| $n | Insert nth captured group |
| Assertions | Description |
|---|---|
| (?=xyz) | Positive lookahead |
| (?!xyz) | Negative lookahead |
| ?!= or ?<! | Negative lookbehind |
| \b | Word Boundary (usually a position between /w and /W) |
| ?# | Comment |
Pattern and Matcher
- Pattern object with flags, Pattern.CASE_INSENSITIVE enables case insensitive matching.
- Pattern class also provides split(String) method that is similar to String class split() method.
- Pattern class toString() - regex String from which pattern was compiled.
- Matcher classes have start() and end() index methods that show precisely where the match was found in the input string.
- Matcher class provides - replaceAll(String replacement) and replaceFirst(String replacement).
package com.journaldev.util;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexExamples {
public static void main(String[] args) {
// using pattern with flags
Pattern pattern = Pattern.compile("ab", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher("ABcabdAb");
// using Matcher find(), group(), start() and end() methods
while (matcher.find()) {
System.out.println("Found the text \"" + matcher.group() +
"\" starting at " + matcher.start() +
" index and ending at index " + matcher.end());
}
// using Pattern split() method
pattern = Pattern.compile("\\W");
String[] words = pattern.split("one@two#three:four$five");
for (String s: words) {
System.out.println("Split using Pattern.split(): " + s);
}
// using Matcher.replaceFirst() and replaceAll() methods
pattern = Pattern.compile("1*2");
matcher = pattern.matcher("11234512678");
System.out.println("Using replaceAll: " + matcher.replaceAll("_"));
System.out.println("Using replaceFirst: " + matcher.replaceFirst("_"));
}
}
Output of the above java regex example program is.
Found the text "AB" starting at 0 index and ending at index 2
Found the text "ab" starting at 3 index and ending at index 5
Found the text "Ab" starting at 6 index and ending at index 8
Split using Pattern.split(): one
Split using Pattern.split(): two
Split using Pattern.split(): three
Split using Pattern.split(): four
Split using Pattern.split(): five
Using replaceAll: _345_678
Using replaceFirst: _34512678
Common Matchings
Matching an Email Address
/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,5})$/
- Group 1 ([a-z0-9_.-]+)
- In this section of the expression, we match one or more lowercase letters between a-z, numbers between 0-9, underscores, periods, and hyphens. The expression is then followed by an @ sign.
- Group 2 ([\da-z.-]+)
- Next, the domain name must be matched which can use one or more digits, letters between a-z, periods, and hyphens. The domain name is then followed by a period ..
- Group 3 ([a-z.]{2,5})
- Lastly, the third group matches the top level domain. This section looks for any group of letters or dots that are 2-5 characters long. This can also account for region-specific top-level domains.
- Therefore, with the regex expression above you can match many of the commonly used emails such as firstname.lastname@domain.com for example.
Matching a Phone Number
/^\b\d{3}[-.]?\d{3}[-.]?\d{4}\b$/
- Section 1 \b\d{3}
- This section begins with a word boundary to tell regex to match the alpha-numeric characters. It then matches 3 of any digit between 0-9 followed by either a hyphen, a period, or nothing [-.]?.
- Section 2 \d{3}
- The second section is quite similar to the first section, it matches 3 digits between 0-9 followed by another hyphen, period, or nothing [-.]?.
- Section 3 \d{4}\b
- Lastly, this section is slightly different in that it matches 4 digits instead of three. The word boundary assertion is also used at the end of the expression. Finally, the end of the string is defined by the $.
- Therefore, with the above regex expression for finding phone numbers, it would identify a number in the format of 123-123-1234, 123.123.1234, or 1231231234.