How Can I Get Words After And Before A Specific Token?
I currently work on a project which is simply creating basic corpus databases and tokenizes texts. But it seems I am stuck in a matter. Assume that we have those things: import os,
Solution 1:
I think what you want is:
- (Optionally) a word and a space;
- (Always)
'blue'
; - (Optionally) a space and a word.
Therefore one appropriate regex would be:
r'(?i)((?:\w+\s)?blue(?:\s\w+)?)'
For example:
>>> import re
>>> text = """My blue car
the blue jacket
their blue phone
a blue alien
End sentence. Blue is
is blue.""">>> re.findall(r'(?i)((?:\w+\s)?{0}(?:\s\w+)?)'.format('blue'), text)
['My blue car', 'the blue jacket', 'their blue phone', 'a blue alien', 'Blue is', 'is blue']
See demo and token-by-token explanation here.
Solution 2:
Let's say token is test.
(?=^test\s+.*|.*?\s+test\s+.*?|.*?\s+test$).*
You can use lookahead.It will not eat up anything and at the same time validate as well.
Solution 3:
Regex can be sometimes slow (if not implemented correctly) and moreover accepted answer did not work for me in several cases.
So I went for the brute force solution (not saying it is the best one), where keyword can be composed of several words:
@staticmethoddeffind_neighbours(word, sentence):
prepost_map = []
if word notin sentence:
return prepost_map
split_sentence = sentence.split(word)
for i inrange(0, len(split_sentence) - 1):
prefix = ""
postfix = ""
prefix_list = split_sentence[i].split()
postfix_list = split_sentence[i + 1].split()
iflen(prefix_list) > 0:
prefix = prefix_list[-1]
iflen(postfix_list) > 0:
postfix = postfix_list[0]
prepost_map.append([prefix, word, postfix])
return prepost_map
Empty string before or after the keyword indicates that keyword was the first or the last word in the sentence, respectively.
Post a Comment for "How Can I Get Words After And Before A Specific Token?"