Skip to content Skip to sidebar Skip to footer

How Can I Get Words After And Before A Specific Token?

I currently work on a project which is simply creating basic corpus databases and tokenizes texts. But it seems I am stuck in a matter. Assume that we have those things: import os,

Solution 1:

I think what you want is:

  1. (Optionally) a word and a space;
  2. (Always) 'blue';
  3. (Optionally) a space and a word.

Therefore one appropriate regex would be:

r'(?i)((?:\w+\s)?blue(?:\s\w+)?)'

For example:

>>> import re
>>> text = """My blue car
the blue jacket
their blue phone
a blue alien
End sentence. Blue is
is blue.""">>> re.findall(r'(?i)((?:\w+\s)?{0}(?:\s\w+)?)'.format('blue'), text)
['My blue car', 'the blue jacket', 'their blue phone', 'a blue alien', 'Blue is', 'is blue']

See demo and token-by-token explanation here.

Solution 2:

Let's say token is test.

        (?=^test\s+.*|.*?\s+test\s+.*?|.*?\s+test$).*

You can use lookahead.It will not eat up anything and at the same time validate as well.

http://regex101.com/r/wK1nZ1/2

Solution 3:

Regex can be sometimes slow (if not implemented correctly) and moreover accepted answer did not work for me in several cases.

So I went for the brute force solution (not saying it is the best one), where keyword can be composed of several words:

@staticmethoddeffind_neighbours(word, sentence):
    prepost_map = []

    if word notin sentence:
        return prepost_map

    split_sentence = sentence.split(word)
    for i inrange(0, len(split_sentence) - 1):
        prefix = ""
        postfix = ""

        prefix_list = split_sentence[i].split()
        postfix_list = split_sentence[i + 1].split()

        iflen(prefix_list) > 0:
            prefix = prefix_list[-1]

        iflen(postfix_list) > 0:
            postfix = postfix_list[0]

        prepost_map.append([prefix, word, postfix])

    return prepost_map

Empty string before or after the keyword indicates that keyword was the first or the last word in the sentence, respectively.

Post a Comment for "How Can I Get Words After And Before A Specific Token?"