Find Overlapping Matches
Given a string (line from text file) I would like to find all substrings built like this: [[ words ]] For example: [[foo [[ bar ]] should return both [[foo [[ bar ]] and [[
Solution 1:
Overlapping Matches: Use Lookahead
For a lazy overlap, use this regex:
(?=(\[\[.?*\]\]))
In Python:
import re
pattern = r"(?=(\[\[.*?\]\]))"print(re.findall(pattern, "[[foo [[ bar ]]"))
print(re.findall(pattern, "[[foo]] and [[bar]]"))
Output:
['[[foo [[ bar ]]', '[[ bar ]]']['[[foo]]', '[[bar]]']
For a "greedy overlap", use (?=(\[\[.*\]\]))
Output:
['[[foo [[ bar ]]', '[[ bar ]]']['[[foo]] and [[bar]]', '[[bar]]']
Explanation
- The lookahead
(?= ... )
asserts that what is inside the parentheses can be matched (but doesn't match it, so that we can find overlapping matches) - The parentheses around `([[.*]]) capture the matched string to Group 1
\[\[
matches[[
.*
gredily matches any chars- The star quantifier in
.*?
is made "lazy" by the?
so that the dot only matches as many characters as needed to allow the next token to match (shortest match). Without the?
, the.*
first matches the whole string, then backtracks only as far as needed to allow the next token to match (longest match). \]\]
matches]]
Reference
Solution 2:
This uses a Positive Lookahead assertion for capturing, returning your overlapping matches:
>>> re.findall(r'(?=(\[\[.*?\]\]))', '[[foo [[ bar ]]')
# ['[[foo [[ bar ]]', '[[ bar ]]']>>> re.findall(r'(?=(\[\[.*?\]\]))', '[[foo]] and [[bar]]')
# ['[[foo]]', '[[bar]]']
Note the ?
following the *
quantifier making your match non-greedy..
Post a Comment for "Find Overlapping Matches"