Skip to content Skip to sidebar Skip to footer

Find Overlapping Matches

Given a string (line from text file) I would like to find all substrings built like this: [[ words ]] For example: [[foo [[ bar ]] should return both [[foo [[ bar ]] and [[

Solution 1:

Overlapping Matches: Use Lookahead

For a lazy overlap, use this regex:

(?=(\[\[.?*\]\]))

In Python:

import re
pattern = r"(?=(\[\[.*?\]\]))"print(re.findall(pattern, "[[foo [[ bar ]]"))
print(re.findall(pattern, "[[foo]] and [[bar]]"))

Output:

['[[foo [[ bar ]]', '[[ bar ]]']['[[foo]]', '[[bar]]']

For a "greedy overlap", use (?=(\[\[.*\]\]))

Output:

['[[foo [[ bar ]]', '[[ bar ]]']['[[foo]] and [[bar]]', '[[bar]]']

Explanation

  • The lookahead (?= ... ) asserts that what is inside the parentheses can be matched (but doesn't match it, so that we can find overlapping matches)
  • The parentheses around `([[.*]]) capture the matched string to Group 1
  • \[\[ matches [[
  • .* gredily matches any chars
  • The star quantifier in .*? is made "lazy" by the ? so that the dot only matches as many characters as needed to allow the next token to match (shortest match). Without the ?, the .* first matches the whole string, then backtracks only as far as needed to allow the next token to match (longest match).
  • \]\] matches ]]

Reference

Solution 2:

This uses a Positive Lookahead assertion for capturing, returning your overlapping matches:

>>> re.findall(r'(?=(\[\[.*?\]\]))', '[[foo [[ bar ]]')
# ['[[foo [[ bar ]]', '[[ bar ]]']>>> re.findall(r'(?=(\[\[.*?\]\]))', '[[foo]] and [[bar]]')
# ['[[foo]]', '[[bar]]']

Note the ? following the * quantifier making your match non-greedy..

Post a Comment for "Find Overlapping Matches"