Skip to content Skip to sidebar Skip to footer

Gensim Phrases Usage To Filter N-grams

I am using Gensim Phrases to identify important n-grams in my text as follows. bigram = Phrases(documents, min_count=5) trigram = Phrases(bigram[documents], min_count=5) for sent

Solution 1:

Phrases has a configurable threshold parameter which adjusts the statistical cutoff for promoting word-pairs into phrases. (Larger thresholds mean fewer pairs become phrases.)

You can adjust that to try to make a greater proportion of its promoted phrases match your own ad hoc intuition about "interesting" phrases – but this class is still using a fairly crude method, without any awareness of grammar or domain knowledge beyond what's in the corpus. So any value that gets all/most of the phrases you want will likely include many uninteresting ones, or vice-versa.

If you have a priori knowledge that certain word-groups are of importance, you could preprocess the corpus yourself to combine those into single tokens, before (or instead of) the collocation-statistics-based Phrases process.

Solution 2:

If I understand what you're trying to do, you could try tf_idf on your corpus compared to the tf_idf of a larger, say, standard corpus (wikipedia or something).

from sklearn.feature_extraction.text import 

TfidfVectorizertfidf_vectorizer = 
    TfidfVectorizer(max_df=0.8, max_features=500,min_df=0.2,
                    stop_words='english', use_idf=True, ngram_range=(1,2))
X = tfidf_vectorizer.transform(docs)  # transform the documents to their tf_idf vectors

Look only at ngrams that have a very different value, this of course will only work if you have a large enough number of documents.

Post a Comment for "Gensim Phrases Usage To Filter N-grams"