Skip to content Skip to sidebar Skip to footer

Match C++ Strings And String Literals Using Regex In Python

I am trying to match Strings (both between double & single quotes) and String Literals in C++ source files. I am using the re library in Python. I have reached the point where

Solution 1:

You can grab all the string literals with the following regex:

r'(?P<prefix>(?:\bu8|\b[LuU])?)(?:"(?P<dbl>[^"\\]*(?:\\.[^"\\]*)*)"|\'(?P<sngl>[^\'\\]*(?:\\.[^\'\\]*)*)\')|R"([^"(]*)\((?P<raw>.*?)\)\4"'

See the regex demo

Explanation:

  • (?P<prefix>(?:\bu8|\b[LuU])?) - (Group named "prefix") the optional prefix, either u8 (whole word) or L, u, U (as whole words)
  • (?:"(?P<dbl>[^"\\]*(?:\\.[^"\\\\]*)*)" - a double quoted string literal, with the contents between " captured into Group named "dbl". The part is matching ", then 0+ characters other than \ and " followed with any number (0+) of sequences of an escape sequence (\\.) followed with 0+ characters other than \ and " (it is an unrolled version of (?:[^"\\]|\\.)*)
  • | - or
  • \'(?P<sngl>[^\'\\]*(?:\\.[^\'\\]*)*)\') - a single quoted string literal, with the contents between ' captured into Group named "sngl". See details on how it works above.
  • | - or
  • R"([^"(]*)\((?P<raw>.*?)\)\4" - this is a raw string literal part capturing the contents into a group named raw. First, R is matched. Then " followed with 0+ characters other than " and ( while capturing the delimiter value into Group 4 (as all named groups also have their numeric IDs), and then the inside conetents are matched with a lazy construct (use re.S if the strings are multiline), up to the first ) followed with the contents of Group 4 (the raw string literal delimiter), and then the final ".

Sample Python demo:

import re

p = re.compile(r'(?P<prefix>(?:\bu8|\b[LuU])?)(?:"(?P<dbl>[^"\\]*(?:\\.[^"\\]*)*)"|\'(?P<sngl>[^\'\\]*(?:\\.[^\'\\]*)*)\')|R"([^"(]*)\((?P<raw>.*?)\)\4"')
s = "\"text'\\\"here\"\nL'text\\'\"here'\nu8\"text'\\\"here\"\nu'text\\'\"here'\nU\"text'\\\"here\"\nR\"delimiter(text\"'\"here)delimiter\""print(s)
print('---------Regex works below ---------')
for x in p.finditer(s):
    if x.group("dbl"):
        print(x.group("dbl"))
    elif x.group("sngl"):
        print(x.group("sngl"))
    else:
        print(x.group("raw"))

Post a Comment for "Match C++ Strings And String Literals Using Regex In Python"