Python Regex Split Any \w+ With Some Exceptions
Solution 1:
The key is to use a negative lookahead. I think this covers all the examples on your list, but let me know if there's something I missed.
In [549]: re.split(r'(?u)(?![\+&\/@\d+\.\d+Mr\.])\W+', "Mr.Jones says This is@a&test example_cool man+right more/fun 43.35")
Out[549]: ['Mr.Jones', 'says', 'This', 'is@a&test', 'example_cool', 'man+right', 'more/fun', '43.35']
Anything inside the group in the (?!) will not be matched. Let me know if I understood the question correctly.
Solution 2:
I don't think you want to split e-mail addresses like jones@gmail.com
in jones@gmail
and com
, hence I changed your exception requirement fullstops surrounded by digits to full stops followed by an alphanumeric character.
re.split(r'(?u)(?![_/&@.])\W+|(?<!Mr|Dr)\.(?!\w)\W*', unicode_text)
[u'Mr.', u'Jones', u'email', u'jones@gmail.com', u'12.455', u'12', u'254.25', u'says', u'This', u'is@a&test', u'example_cool', u'man', u'right', u'more/fun', u'43.35', u'And', u'so', u'we', u'stopped', u'And', u'then', u'we', u'started', u'again', u'\u0648\u0628\u0639\u062f\u0647\u0627', u'\u0631\u062c\u0639\u0646\u0627', u'\u0625\u0644\u0649', u'\u0627\u0644\u0645\u0646\u0632\u0644', u'\u0648\u0642\u0627\u0628\u0644\u0646\u0627', u'\u0623\u0635\u062f\u0642\u0627\u0621\u0646\u0627', u'\u0648\u0634\u0631\u0628\u0646\u0627', u'\u0627\u0644\u0634\u0627\u064a', u'']
Post a Comment for "Python Regex Split Any \w+ With Some Exceptions"