Skip to content Skip to sidebar Skip to footer

Regex Tokenizer To Split A Text Into Words, Digits And Punctuation Marks

What I want to do is to split a text into his ultimate elements. For example: from nltk.tokenize import * txt = 'A sample sentences with digits like 2.119,99 or 2,99 are awesome.'

Solution 1:

I created a pattern to try to include periods and commas occurring inside words, numbers. Hope this helps:

txt = "Today it's 07.May 2011. Or 2.999."
regexp_tokenize(txt, pattern=r'\w+([.,]\w+)*|\S+')
['Today', 'it', "'s", '07.May', '2011', '.', 'Or', '2.999', '.']

Post a Comment for "Regex Tokenizer To Split A Text Into Words, Digits And Punctuation Marks"