Skip to content Skip to sidebar Skip to footer

How To Match Abbreviations With Their Meaning With Regex?

I'm looking for a regex pattern that matches the following string: Some example text (SET) that demonstrates what I'm looking for. Energy system models (ESM) are used to find spec

Solution 1:

I suggest using

import re
defcontains_abbrev(abbrev, text):
    text = text.lower()
    ifnot abbrev.isupper():
        returnFalse
    cnt = 0for c in abbrev.lower():
        if text.find(c) > -1:
            text = text[text.find(c):]
            cnt += 1continuereturn cnt == len(abbrev)
 
text= "Some example text (SET) that demonstrates what I'm looking for. Energy system models (ESM) are used to find specific optima (SCO). Some say computer systems (CUST) are cool. In the summer playing outside (OUTS) should be preferred. Stupid example(s) Stupid example(S) Not stupid example (NSEMPLE), bad example (Bexle)"
abbrev_rx = r'\b(([A-Z])\w*(?:\s+\w+)*?)\s*\((\2[A-Z]*)\)'print( [x.group() for x in re.finditer(abbrev_rx, text, re.I) if contains_abbrev(x.group(3), x.group(1))] )

See the Python demo.

The regex used is

(?i)\b(([A-Z])\w*(?:\s+\w+)*?)\s*\((\2[A-Z]*)\)

See the regex demo. Details:

  • \b - word boundary
  • (([A-Z])\w*(?:\s+\w+)*?) - Group 1 (text): an ASCII letter captured into Group 2, then 0+ word chars followed with any 0 or more occurrences of 1+ whitespaces followed with 1+ word chars, as few as possible
  • \s* - 0+ whitespaces
  • \( - a ( char
  • (\2[A-Z]*) - Group 3 (abbrev): same value as in Group 2 and then 0 or more ASCII letters
  • \) - a ) char.

Once there is a match, Group 3 is passed as abbrev and Group 1 is passedas text to the contains_abbrev(abbrev, text) method, that makes sure that the abbrev is an uppercase string and that the chars in abbrev go in the same order as in text, and are all present in the text.

Solution 2:

Just regex won't be enough .. looks like you might a python script for this... this should handle all your scenarios:

import re
a="Some example text (SET) that demonstrates what I'm looking for. Energy system models (ESM) are used to find specific optima (SCO). Some say computer systems (CUST) are cool. In the summer playing outside (OUTS) should be preferred.";
b=re.findall("(\((.*?)\))",a)
a=a.replace(".","")
i=a.split(' ')
for c in b:
   cont=0
   m=[]
   s=i.index(c[0])
   l=len(c[1])
   al=s-l
   for j in range(al,s+1):
       if i[j][0].lower() == c[1][0].lower():
            cont=1if cont == 1:
            m.append(i[j])
   print(' '.join(m))

Output:

Some example text (SET)

Energy system models (ESM)

specific optima (SCO)

computer systems (CUST)

outside (OUTS)

Post a Comment for "How To Match Abbreviations With Their Meaning With Regex?"