How To Apply Pos_tag_sents() To Pandas Dataframe Efficiently
In situations where you wish to POS tag a column of text stored in a pandas dataframe with 1 sentence per row the majority of implementations on SO use the apply method dfData['POS
Solution 1:
Input
$ cat test.csv
ID,Task,label,Text
1,Collect Information,no response,cozily married practical athletics Mr. Brown flat
2,New Credit,no response,active married expensive soccer Mr. Chang flat
3,Collect Information,response,healthy single expensive badminton Mrs. Green flat
4,Collect Information,response,cozily married practical soccer Mr. Brown hierachical
5,Collect Information,response,cozily single practical badminton Mr. Brown flat
TL;DR
>>> from nltk import word_tokenize, pos_tag, pos_tag_sents
>>> import pandas as pd
>>> df = pd.read_csv('test.csv', sep=',')
>>> df['Text']
0 cozily married practical athletics Mr. Brown flat
1 active married expensive soccer Mr. Chang flat
2 healthy single expensive badminton Mrs. Green ...
3 cozily married practical soccer Mr. Brown hier...
4 cozily single practical badminton Mr. Brown flat
Name: Text, dtype: object
>>> texts = df['Text'].tolist()
>>> tagged_texts = pos_tag_sents(map(word_tokenize, texts))
>>> tagged_texts
[[('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('athletics', 'NNS'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')], [('active', 'JJ'), ('married', 'VBD'), ('expensive', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Chang', 'NNP'), ('flat', 'JJ')], [('healthy', 'JJ'), ('single', 'JJ'), ('expensive', 'JJ'), ('badminton', 'NN'), ('Mrs.', 'NNP'), ('Green', 'NNP'), ('flat', 'JJ')], [('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('hierachical', 'JJ')], [('cozily', 'RB'), ('single', 'JJ'), ('practical', 'JJ'), ('badminton', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')]]
>>> df['POS'] = tagged_texts
>>> df
ID Task label \
0 1 Collect Information no response
1 2 New Credit no response
2 3 Collect Information response
3 4 Collect Information response
4 5 Collect Information response
Text \
0 cozily married practical athletics Mr. Brown flat
1 active married expensive soccer Mr. Chang flat
2 healthy single expensive badminton Mrs. Green ...
3 cozily married practical soccer Mr. Brown hier...
4 cozily single practical badminton Mr. Brown flat
POS
0 [(cozily, RB), (married, JJ), (practical, JJ),...
1 [(active, JJ), (married, VBD), (expensive, JJ)...
2 [(healthy, JJ), (single, JJ), (expensive, JJ),...
3 [(cozily, RB), (married, JJ), (practical, JJ),...
4 [(cozily, RB), (single, JJ), (practical, JJ), ...
In Long:
First, you can extract the Text
column to a list of string:
texts = df['Text'].tolist()
Then you can apply the word_tokenize
function:
map(word_tokenize, texts)
Note that, @Boud's suggested is almost the same, using df.apply
:
df['Text'].apply(word_tokenize)
Then you dump the tokenized text into a list of list of string:
df['Text'].apply(word_tokenize).tolist()
Then you can use pos_tag_sents
:
pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )
Then you add the column back to the DataFrame:
df['POS'] = pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )
Solution 2:
By applying pos_tag
on each row, the Perceptron model will be loaded each time (costly operation, as it reads a pickle from disk).
If you instead get all the rows and send them to pos_tag_sents
(which takes list(list(str))
), the model is loaded once and used for all.
See the source.
Solution 3:
Assign this to your new column instead:
dfData['POSTags'] = pos_tag_sents(dfData['SourceText'].apply(word_tokenize).tolist())
Post a Comment for "How To Apply Pos_tag_sents() To Pandas Dataframe Efficiently"