Regex Tokenizer To Split A Text Into Words, Digits And Punctuation Marks
What I want to do is to split a text into his ultimate elements. For example: from nltk.tokenize import * txt = 'A sample sentences with digits like 2.119,99 or 2,99 are awesome.'
Solution 1:
I created a pattern to try to include periods and commas occurring inside words, numbers. Hope this helps:
txt = "Today it's 07.May 2011. Or 2.999."
regexp_tokenize(txt, pattern=r'\w+([.,]\w+)*|\S+')
['Today', 'it', "'s", '07.May', '2011', '.', 'Or', '2.999', '.']
Post a Comment for "Regex Tokenizer To Split A Text Into Words, Digits And Punctuation Marks"