WebScraping Data Cleaning

srnamatej
Aug 3, 2022
4 min read

Updated: Feb 16, 2023

Disclamer: In this post, I share my required knowledge, it does not state that my code is best practice. I am still in the process of learning.

The next step in this project is to clean the data which had been scraped from the website. At the beginning of this project, it is necessary to clean your data. Correct any issues, delete duplicates, look for missing data and get rid of the data which you do not need. These mistakes might lead any algorithm to provide misleading outcomes.

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct. Tableu

The whole process of data cleaning is not strictly defined. This varies based on the structure and type of the dataset. You might put extra effort into cleaning data when you miss some values. In this case, the web scraping process gathers all the data needed.

Even though the process is not strictly defined, there are some common steps that need to be done to clean the data properly.

Deleting duplicate values/rows or unnecessary data,
Missing data or error correction,
Getting rid of all outliers that might cause the algorithm to provide misleading outcomes,
Verify and question data.

Removing duplicate values or unnecessary data:

The current digital world is full of raw data and most of the time you get the dataset that is pretty much raw and unpolished. Those data might be duplicated in some cases or contains data that might not be valid for the analyzed case. Getting rid of unnecessary data might not be the most important part of this task. In most cases, the companies combine the data from different sources to enrich it and get more useful insight.

However, this opens up the high possibility of duplicating the data which is being gathered on a daily bases (in the case of some companies). This risk is also very much applied to this project because I used the daily web scraping technique which gathered the data. Most of the job postings are published for a long period of time ranging from a month up to two months (in some cases might be longer dues to lack of labor force or etc).

In this case, I used two techniques how to get rid of the data. Firstly, I used the keywords in the job postings to filter unwanted job postings and then I drop the data using the pandas drop function.

pandas.DataFrame.drop: Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level. See the user guide <advanced.shown_levels> for more information about the now unused levels - pandas

Secondly, I used the drop duplicates to get rid of all the job postings which had been scraped multiple times due to the time being published on LinkedIn. I wanted to insure that the data is not affected by any job postings which has been web scraped multiple times.

pandas.DataFrame.drop_duplicates: Return DataFrame with duplicate rows removed. Considering certain columns is optional. Indexes, including time indexes are ignored - pandas

Missing data or error correction:

After the data exploration, you might find errors in the dataset in form of some typos, unwanted naming, or incorrect capitalization. If you wanna categorize the data the proper way, the naming should be the same for each data from the same category. It might be very useful to rename/relabel it. In this case, I changed all the names of job postings accordingly - most recruiters tend to make the job postings more attractive.

Also in most cases, you can work with a dataset that might be partially incomplete. During the exploration of data and based on the thesis, you can decide if you wanna drop missing data or replace it. Generally, it's not good to use the algorithms on data that has missing data. So you should consider using those options.

There are risks associated with both options, so be very mindful. If you drop all the rows with missing data you are risking of losing useful information. Make sure, the data is not relevant for the analyzed case.

Another option might be filling in missing data. Usually, the data is filled in based on similarity of other data to the missing values. In some cases you can fill the missing data with negative value, labeling the missing data as "nan", filling the values with mean or median, use grouping by to fill in data or use the machine learning to fill it. Be mindful with using this technique, because you might

Getting rid of all outliers that might cause the algorithm to provide misleading outcomes:

During the exploratory data analysis you can spot some outliers. Outliers are defined as data with abnormal distance from other data.

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal. Before abnormal observations can be singled out, it is necessary to characterize normal observations -NIST

As you can see, in this graph there are few values in dataset which seems to be in abnormal distance from the rest of the data in this population. In this case it might be relevant to get rid of the outliers. Deleting data might help in performance of dataset you work with but also it might improve the algorithm you work with.

This graph shows the outliers in data. There are few recorded values which are in abnormal distance from the rest of a population. — Outlier of price/distance distribution

Verify and question data:

Lastly, but not least. You should always verify and question data. The data you use should make sense, if they do not, you might get wrong conclusions out of the next analysis or the business decisions based on this data. If you find the data not insightful, does not prove your theory or does not show any trends, you probably have wrong set of data.

If you keep verifying and questioning the data, you can save yourself from lot of troubles and headaches and most importantly an embarrassing situations during team meetings.

I share here a link to my Github in case you want to see my code.

Hopefully, it can also help others who want to learn more about Data Science.

Matěj Srna

Back

WebScraping Data Cleaning

Recent Posts

Comments