Nuts & Bolts of Machine Learning

For more projects visit Github

Keyword Extraction

keyword extraction Image Credit: www.adaringadventure.com

I. Candidate Selection

II. Property Calculation

III. Scoring of potential keywords and final selection

Use punctuations and stopwords as boundaries. Keyword phrases are assumed to be lying between these boundaries (in ideal case tokens should be stemmed as well before this step). These words are often called Candidate Words/Phrases.
RAKE computes the properties of each candidate, which is the sum of score for each word in phrase. The words are scored according to their frequency and typical length of candidate phrase in which they appear.
Rank keywords based on RAKE score

I. Doamin Independence

II. Good Precision

Score(word) = Degree(word) / Frequency(word)

Frequency(word) = Number of times word occurs in document

Degree(word) = Similar to degree of node in a graph. It is number of times a certain word co-occurs with other candidate keyword