Fake Job Posting Detection
- Aunsh Arekar
- Apr 29, 2022
- 3 min read
Updated: Apr 29, 2022
Overview
Here we are building a system that is able to detect real job postings from fake job postings. Every so often, job aspirants face issues of having applied to a job seeing its posting on websites, eventually turning out to be fake or fraudulent postings. As a result of which these aspirants end up getting scammed. In order to avoid this we are developing a system that will be able to overcome this problem and thus benefit the job aspirants.
Algorithm Used
In order to build this system we will be making use of the Naive Bayes Classifier. This classifier works on the concept of conditional probabilities of events. The Naive Bayes classifier works on a formula which is given by fig(a) below:

fig(a): Naive Bayes formula
So as we can see, the terms P(class|data) and P(data|class) are the conditional probabilities which can be said as probability of 'class' given 'data' and vice versa.
Now that we have a basic understanding of the algorithm's concept and the formula that it is based on, we can move on with the actual implementation of it.
Step-by-step Implementation
Now let's take a look at the dataset first.

fig(b): Dataset
From this data set we can observe that it has columns having various details of the job postings such as the title, location, department, requirements, whether the job is real or fraudulent which is represented by either 0 or 1, 0 being real and 1 being fraudulent, so on and so forth.
Combining all text data in one column
Now to get started, we first combine all the columns with text in them into a single column called 'combo text' and also make another column with called 'combo text length' having the length of text of this new column as shown below:

fig(c): combined text column with length
We also dropped all the columns that we just combined since they are no longer relevant to us.
Cleaning the data
Next we make a function used to clean the combined text column for better handing and mathematical calculations. This function splits the lines into single words, makes them lower case, removes the stopwords, removes special characters and alphanumeric characters. This is shown in fig(d)

fig(d): cleaned data column
As we can see, the combo text column is cleaned using the text cleaning function that we have made.
A few things that we do for data visualization such as getting the word cloud for the real job postings and the fake ones. This is as shown below in fig(e) and fig(f):

fig(e) : Real Jobs Word Cloud

fig(e) : Fake Jobs Word Cloud
Train Test Split
Now we perform the train test split as usual. We take the combined text column as X and the deciding 'Fraudulent' column as Y.

fig(f): Train Test Split
TF-IDF
Next we do the TF-IDF Vectorization. TF-IDF stands for term frequency Inverse document frequency. It is a measure that is basically used to quantify the importance or relevance of a string in a big chunk of strings. We do this for both, train and test dataset as shown below:

fig(g): TF-IDF
Applying Naive Bayes
Finally we apply the Naive Bayes formula on our training dataset. We make use of the library from sklearn of Naive Bayes to make use of the MultinomialNB function in it. This is what performs the actual Naive Bayes and gets the predictions. This is as shown below:

fig(h): Applying Naive Bayes
From fig(h) we can see that after applying Naive Bayes, we get an accuracy of about 95%
For further evaluation of the model, we have also got the classification report which gives us values such as precision, support, f-1 score and recall. This is as shown below:

fig(i): Classification report
Contribution
The main part came in the data pre-processing stage of particularly cleaning the data since that is what improved the accuracy. So the data cleaning was made more refined by adding more parameters to it such as removal of stop words which made the data more precise to what we wanted.
References
Source Code
Youtube video link
https://youtu.be/YxwU2zOSmyI
Comments