Fake Job Posting Detection

Aunsh Arekar
Apr 29, 2022
3 min read

Updated: Apr 29, 2022

Overview

Here we are building a system that is able to detect real job postings from fake job postings. Every so often, job aspirants face issues of having applied to a job seeing its posting on websites, eventually turning out to be fake or fraudulent postings. As a result of which these aspirants end up getting scammed. In order to avoid this we are developing a system that will be able to overcome this problem and thus benefit the job aspirants.

Algorithm Used

In order to build this system we will be making use of the Naive Bayes Classifier. This classifier works on the concept of conditional probabilities of events. The Naive Bayes classifier works on a formula which is given by fig(a) below:

fig(a): Naive Bayes formula

So as we can see, the terms P(class|data) and P(data|class) are the conditional probabilities which can be said as probability of 'class' given 'data' and vice versa.

Now that we have a basic understanding of the algorithm's concept and the formula that it is based on, we can move on with the actual implementation of it.

Step-by-step Implementation

Now let's take a look at the dataset first.

fig(b): Dataset

From this data set we can observe that it has columns having various details of the job postings such as the title, location, department, requirements, whether the job is real or fraudulent which is represented by either 0 or 1, 0 being real and 1 being fraudulent, so on and so forth.

Combining all text data in one column

Now to get started, we first combine all the columns with text in them into a single column called 'combo text' and also make another column with called 'combo text length' having the length of text of this new column as shown below:

fig(c): combined text column with length

We also dropped all the columns that we just combined since they are no longer relevant to us.

Cleaning the data

Next we make a function used to clean the combined text column for better handing and mathematical calculations. This function splits the lines into single words, makes them lower case, removes the stopwords, removes special characters and alphanumeric characters. This is shown in fig(d)

fig(d): cleaned data column

As we can see, the combo text column is cleaned using the text cleaning function that we have made.

A few things that we do for data visualization such as getting the word cloud for the real job postings and the fake ones. This is as shown below in fig(e) and fig(f):

fig(e) : Real Jobs Word Cloud

fig(e) : Fake Jobs Word Cloud

Train Test Split

Now we perform the train test split as usual. We take the combined text column as X and the deciding 'Fraudulent' column as Y.

fig(f): Train Test Split

TF-IDF

Next we do the TF-IDF Vectorization. TF-IDF stands for term frequency Inverse document frequency. It is a measure that is basically used to quantify the importance or relevance of a string in a big chunk of strings. We do this for both, train and test dataset as shown below:

fig(g): TF-IDF

Applying Naive Bayes

Finally we apply the Naive Bayes formula on our training dataset. We make use of the library from sklearn of Naive Bayes to make use of the MultinomialNB function in it. This is what performs the actual Naive Bayes and gets the predictions. This is as shown below:

fig(h): Applying Naive Bayes

From fig(h) we can see that after applying Naive Bayes, we get an accuracy of about 95%

For further evaluation of the model, we have also got the classification report which gives us values such as precision, support, f-1 score and recall. This is as shown below:

fig(i): Classification report

Contribution

The main part came in the data pre-processing stage of particularly cleaning the data since that is what improved the accuracy. So the data cleaning was made more refined by adding more parameters to it such as removal of stop words which made the data more precise to what we wanted.

References

Source Code

Youtube video link

https://youtu.be/YxwU2zOSmyI

Fake Job Posting Detection

Recent Posts

Kommentare

Contact