News Category Classifier
- Aunsh Arekar
- Apr 18, 2022
- 2 min read
Overview
This consists of a dataset having various news over the course of time, their headlines, short description, date of publishing and other such details. Every news belongs to a certain category such as Politics, Sports, Art and so on.
Algorithm Used
For building this classifier, we have used the Naive Bayes classifier which is based on conditional probabilities. To give a little explanation of Naive Bayes, we have

Naive Bayes formula
So from this formula we can understand that it is based on conditional probabilities of hypothesis and Evidence. Here the assumption is that the events are independent of each other.
Step-by-step Implementation
Now we will look at the implementation of Naive Bayes step by step. Python already has an existing library for using Naive Bayes, however we would not be using it and instead building it right from scratch.
Dataset preview
Lets take a small look at the data to understand all its features

As mentioned in the overview, the dataset has columns like category, headline, authors, link, short description and date.
Lets also take a look at the count of news articles in each categories

Unique categories and their counts
From this we can get the categories and their count and understand that there are 41 unique categories with the highest number of articles under Politics of about 32379 headlines and the lowest under Education of about 1004 headlines.
Now we are only concerned with headlines and short description columns. So we first preprocess these columns dropping the alphanumeric characters, digits and turning everything to lower case after we which we combine both columns and make a new column out of it and call it 'preprocess_combo'

Preprocessed Combined column
Making a vocabuary of words
Now we make a vocabulary of the words in this combined column with their respective count. We also drop the rare words like the ones which occur less than 10 times.

Vocabulary of words with count
Calculating P(A) or Prior Probabilty
Here we calculate P(A) for each class as given in the formula.We also use 'Alpha' which is the smoothing parameter for Laplace smoothing

Calculating P(A)
Calcuating P(B|A)
Next we calculate the conditional probability given by P(B|A) as shown in the formula

Calculating P(B|A)
Calculating the accuracy on training data
Here we make the predictions and find the accuracy on the training data set.
On training set we get an accuracy of about 67%. We also have the report of this giving the values of precision, recall, f-1 score and support

Training set accuracy with report
Calculating the accuracy on testing data
Here we make the predictions and find the accuracy on the testing data set
On training set we get an accuracy of about 56%. We also have the report of this giving the values of precision, recall, f-1 score and support

Testing set accuracy with report
Refrences
Source Code
Comments