Skip to content Skip to sidebar Skip to footer

Separate Keywords And @ Mentions From Dataset

I have a huge set of data which has several columns and about 10k rows in more than 100 csv files, for now I am concerned about only one column with message format and from them I

Solution 1:

I suspect you want to use a regular expression. I don't know the exact format that your @ mentions and # keywords are allowed to take, but I would guess that something of the form @([a-zA-Z0-9]+)[^a-zA-Z0-9] would work.

#!/usr/bin/env python3import re

test_string = """Text
"Let's Bounce!😉
Loving the energy & Microphonic Mayhem while…"
RT @IVijayboi: etc etc"""

mention_match = re.compile('@([a-zA-Z0-9]+)[^a-zA-Z0-9]')
for match in mention_match.finditer(test_string):
    print(match.group(1))

hashtag_match = re.compile('#([a-zA-Z0-9]+)[^a-zA-Z0-9]')
for match in hashtag_match.finditer(test_string):
    print(match.group(1))

Hopefully that gives you enough to get started with.

Post a Comment for "Separate Keywords And @ Mentions From Dataset"