How To Match Abbreviations With Their Meaning With Regex?
I'm looking for a regex pattern that matches the following string: Some example text (SET) that demonstrates what I'm looking for. Energy system models (ESM) are used to find spec
Solution 1:
I suggest using
import re
defcontains_abbrev(abbrev, text):
text = text.lower()
ifnot abbrev.isupper():
returnFalse
cnt = 0for c in abbrev.lower():
if text.find(c) > -1:
text = text[text.find(c):]
cnt += 1continuereturn cnt == len(abbrev)
text= "Some example text (SET) that demonstrates what I'm looking for. Energy system models (ESM) are used to find specific optima (SCO). Some say computer systems (CUST) are cool. In the summer playing outside (OUTS) should be preferred. Stupid example(s) Stupid example(S) Not stupid example (NSEMPLE), bad example (Bexle)"
abbrev_rx = r'\b(([A-Z])\w*(?:\s+\w+)*?)\s*\((\2[A-Z]*)\)'print( [x.group() for x in re.finditer(abbrev_rx, text, re.I) if contains_abbrev(x.group(3), x.group(1))] )
See the Python demo.
The regex used is
(?i)\b(([A-Z])\w*(?:\s+\w+)*?)\s*\((\2[A-Z]*)\)
See the regex demo. Details:
\b
- word boundary(([A-Z])\w*(?:\s+\w+)*?)
- Group 1 (text
): an ASCII letter captured into Group 2, then 0+ word chars followed with any 0 or more occurrences of 1+ whitespaces followed with 1+ word chars, as few as possible\s*
- 0+ whitespaces\(
- a(
char(\2[A-Z]*)
- Group 3 (abbrev
): same value as in Group 2 and then 0 or more ASCII letters\)
- a)
char.
Once there is a match, Group 3 is passed as abbrev
and Group 1 is passedas text
to the contains_abbrev(abbrev, text)
method, that makes sure that the abbrev
is an uppercase string and that the chars in abbrev
go in the same order as in text
, and are all present in the text
.
Solution 2:
Just regex won't be enough .. looks like you might a python script for this... this should handle all your scenarios:
import re
a="Some example text (SET) that demonstrates what I'm looking for. Energy system models (ESM) are used to find specific optima (SCO). Some say computer systems (CUST) are cool. In the summer playing outside (OUTS) should be preferred.";
b=re.findall("(\((.*?)\))",a)
a=a.replace(".","")
i=a.split(' ')
for c in b:
cont=0
m=[]
s=i.index(c[0])
l=len(c[1])
al=s-l
for j in range(al,s+1):
if i[j][0].lower() == c[1][0].lower():
cont=1if cont == 1:
m.append(i[j])
print(' '.join(m))
Output:
Some example text (SET)
Energy system models (ESM)
specific optima (SCO)
computer systems (CUST)
outside (OUTS)
Post a Comment for "How To Match Abbreviations With Their Meaning With Regex?"