News Category Classifier

Aunsh Arekar
Apr 18, 2022
2 min read

Overview

This consists of a dataset having various news over the course of time, their headlines, short description, date of publishing and other such details. Every news belongs to a certain category such as Politics, Sports, Art and so on.

Algorithm Used

For building this classifier, we have used the Naive Bayes classifier which is based on conditional probabilities. To give a little explanation of Naive Bayes, we have

Naive Bayes formula

So from this formula we can understand that it is based on conditional probabilities of hypothesis and Evidence. Here the assumption is that the events are independent of each other.

Step-by-step Implementation

Now we will look at the implementation of Naive Bayes step by step. Python already has an existing library for using Naive Bayes, however we would not be using it and instead building it right from scratch.

Dataset preview

Lets take a small look at the data to understand all its features

As mentioned in the overview, the dataset has columns like category, headline, authors, link, short description and date.

Lets also take a look at the count of news articles in each categories

Unique categories and their counts

From this we can get the categories and their count and understand that there are 41 unique categories with the highest number of articles under Politics of about 32379 headlines and the lowest under Education of about 1004 headlines.

Now we are only concerned with headlines and short description columns. So we first preprocess these columns dropping the alphanumeric characters, digits and turning everything to lower case after we which we combine both columns and make a new column out of it and call it 'preprocess_combo'

Preprocessed Combined column

Making a vocabuary of words

Now we make a vocabulary of the words in this combined column with their respective count. We also drop the rare words like the ones which occur less than 10 times.