Find Overlapping Matches
Given a string (line from text file) I would like to find all substrings built like this: [[ words ]] For example: [[foo [[ bar ]] should return both [[foo [[ bar ]] and [[
Solution 1:
Overlapping Matches: Use Lookahead
For a lazy overlap, use this regex:
(?=(\[\[.?*\]\]))
In Python:
import re
pattern = r"(?=(\[\[.*?\]\]))"print(re.findall(pattern, "[[foo [[ bar ]]"))
print(re.findall(pattern, "[[foo]] and [[bar]]"))
Output:
['[[foo [[ bar ]]', '[[ bar ]]']['[[foo]]', '[[bar]]']For a "greedy overlap", use (?=(\[\[.*\]\]))
Output:
['[[foo [[ bar ]]', '[[ bar ]]']['[[foo]] and [[bar]]', '[[bar]]']Explanation
- The lookahead
(?= ... )asserts that what is inside the parentheses can be matched (but doesn't match it, so that we can find overlapping matches) - The parentheses around `([[.*]]) capture the matched string to Group 1
\[\[matches[[.*gredily matches any chars- The star quantifier in
.*?is made "lazy" by the?so that the dot only matches as many characters as needed to allow the next token to match (shortest match). Without the?, the.*first matches the whole string, then backtracks only as far as needed to allow the next token to match (longest match). \]\]matches]]
Reference
Solution 2:
This uses a Positive Lookahead assertion for capturing, returning your overlapping matches:
>>> re.findall(r'(?=(\[\[.*?\]\]))', '[[foo [[ bar ]]')
# ['[[foo [[ bar ]]', '[[ bar ]]']>>> re.findall(r'(?=(\[\[.*?\]\]))', '[[foo]] and [[bar]]')
# ['[[foo]]', '[[bar]]']Note the ? following the * quantifier making your match non-greedy..
Post a Comment for "Find Overlapping Matches"