Gensim Phrases Usage To Filter N-grams
Solution 1:
Phrases
has a configurable threshold
parameter which adjusts the statistical cutoff for promoting word-pairs into phrases. (Larger thresholds mean fewer pairs become phrases.)
You can adjust that to try to make a greater proportion of its promoted phrases match your own ad hoc intuition about "interesting" phrases – but this class is still using a fairly crude method, without any awareness of grammar or domain knowledge beyond what's in the corpus. So any value that gets all/most of the phrases you want will likely include many uninteresting ones, or vice-versa.
If you have a priori knowledge that certain word-groups are of importance, you could preprocess the corpus yourself to combine those into single tokens, before (or instead of) the collocation-statistics-based Phrases
process.
Solution 2:
If I understand what you're trying to do, you could try tf_idf on your corpus compared to the tf_idf of a larger, say, standard corpus (wikipedia or something).
from sklearn.feature_extraction.text import
TfidfVectorizertfidf_vectorizer =
TfidfVectorizer(max_df=0.8, max_features=500,min_df=0.2,
stop_words='english', use_idf=True, ngram_range=(1,2))
X = tfidf_vectorizer.transform(docs) # transform the documents to their tf_idf vectors
Look only at ngrams that have a very different value, this of course will only work if you have a large enough number of documents.
Post a Comment for "Gensim Phrases Usage To Filter N-grams"