top of page
Search

News Category Classifier

  • Writer: Aunsh Arekar
    Aunsh Arekar
  • Apr 18, 2022
  • 2 min read

Overview


This consists of a dataset having various news over the course of time, their headlines, short description, date of publishing and other such details. Every news belongs to a certain category such as Politics, Sports, Art and so on.


Algorithm Used


For building this classifier, we have used the Naive Bayes classifier which is based on conditional probabilities. To give a little explanation of Naive Bayes, we have



Naive Bayes formula


So from this formula we can understand that it is based on conditional probabilities of hypothesis and Evidence. Here the assumption is that the events are independent of each other.


Step-by-step Implementation


Now we will look at the implementation of Naive Bayes step by step. Python already has an existing library for using Naive Bayes, however we would not be using it and instead building it right from scratch.


Dataset preview


Lets take a small look at the data to understand all its features


As mentioned in the overview, the dataset has columns like category, headline, authors, link, short description and date.


Lets also take a look at the count of news articles in each categories


Unique categories and their counts

From this we can get the categories and their count and understand that there are 41 unique categories with the highest number of articles under Politics of about 32379 headlines and the lowest under Education of about 1004 headlines.


Now we are only concerned with headlines and short description columns. So we first preprocess these columns dropping the alphanumeric characters, digits and turning everything to lower case after we which we combine both columns and make a new column out of it and call it 'preprocess_combo'


Preprocessed Combined column


Making a vocabuary of words

Now we make a vocabulary of the words in this combined column with their respective count. We also drop the rare words like the ones which occur less than 10 times.


Vocabulary of words with count


Calculating P(A) or Prior Probabilty

Here we calculate P(A) for each class as given in the formula.We also use 'Alpha' which is the smoothing parameter for Laplace smoothing


Calculating P(A)


Calcuating P(B|A)

Next we calculate the conditional probability given by P(B|A) as shown in the formula


Calculating P(B|A)


Calculating the accuracy on training data

Here we make the predictions and find the accuracy on the training data set.

On training set we get an accuracy of about 67%. We also have the report of this giving the values of precision, recall, f-1 score and support


Training set accuracy with report



Calculating the accuracy on testing data

Here we make the predictions and find the accuracy on the testing data set

On training set we get an accuracy of about 56%. We also have the report of this giving the values of precision, recall, f-1 score and support


Testing set accuracy with report



Refrences




Source Code






 
 
 

Comments


  • Facebook
  • Twitter
  • Instagram

Inner Pieces

123-456-7890

info@mysite.com

© 2023 by Inner Pieces.

Proudly created with Wix.com

Contact

Ask me anything

Thanks for submitting!

bottom of page