Web Scraping for the job posting analysis

srnamatej
May 25, 2022
3 min read

Updated: Sep 28, 2023

Disclamer: In this post, I share my required knowledge, it does not state that my code is best practice. I am still in the process of learning.

When I was deciding which project I should go for to enhance my skills and knowledge I realized that Data Science could be used in so many ways. While I was going through all the ideas I was wondering which skills should I possess and require to be able to do those projects.

Suddenly the idea struck me, let's do the analysis on the skill set that I should possess and also which skill set employers required for the data scientists.

For this analysis I needed current raw data to make sure information in job postings were up to date. To require this data I used the web scraping method.

Websites can be divided into two groups as static websites and dynamic websites. Static websites consist of HTML (Hypertext Markup Language) and CSS (Cascading Style Sheets). A dynamic website consists of JavaScript. Dynamic website changes and can be customized by itself on frequent bases and it can be done automatically.

HTML (HyperText Markup Language) is the most basic building block of the Web. It defines the meaning and structure of web content. Other technologies besides HTML are generally used to describe a web page's appearance/presentation (CSS) or functionality/behavior (JavaScript)

Web Scraping:

It's the method of extracting information from the HTML websites. This method can be used if the website does not provide API (Application Programming Interface) which lets you get the data in a structured way upon a request.

For Web Scraping I used Python library called Beautiful Soup 4 and this library uses a parser to help navigate and search for the key elements which I specified to get the necessary data. I used Python’s html.parser for this project.

For Web Scraping, some articles recommend Selenium and its Webdriver, which uses the web browser to go through the websites. However, after some time of digging the articles for Web Scraping, I found that most users recommend the Request library to get the data from the website and avoid the issues with JavaScript. For that reason I deciced to use the Request library.

Lastly, I had to determine which elements of the job postings are relevant for my future analysis. I definitely was looking for a job name, company name, a time when was the job posting posted on LinkedIn, type of the contract, job location, job description for the job requirements, links to a particular job posting, and lastly, I added the date to make sure I can track my scrapings.

HTML uses markups and I had to find the right ones with specific IDs, to specify them for the BeautifulSoup, which finds the right markup and get the text from that part of the website.

HTML uses "markup" to annotate text, images, and other content for display in a Web browser. HTML markup includes special "elements" such as <head>, <title>, <body>, <header>, <footer>, <article>, <section>, <p>, <div>, <span>, <img>, <aside>, <audio>, <canvas>, <datalist>, <details>, <embed>, <nav>, <output>, <progress>, <video>, <ul>, <ol>, <li> and many others.

This example below shows the markups from one job posting on LinkedIn. As you can see the website is divided into each element such as div with class top-card-layout__entity-info-container flex flex-wrap and so on.

In the heading <h1> is stated a job name (Staff Data Scientist), which is essential for my analysis.

<div class="top-card-layout__entity-info-container flex flex-wrap papabear:flex-nowrap">
<div class="top-card-layout__entity-info flex-grow flex-shrink-0 basis-0 babybear:flex-none babybear:w-full babybear:flex-none babybear:w-full">
<h1 class="top-card-layout__title font-sans text-lg papabear:text-xl font-bold leading-open text-color-text mb-0 topcard__title">Staff Data Scientist </h1>
<!-- -->
<!-- -->
<h4 class="top-card-layout__second-subline font-sans text-sm leading-open text-color-text-low-emphasis mt-0.5">

HINTS:

Job postings on LinkedIn only show the 25 jobs per search page. Therefore I had to do the list of the search pages to make sure I would scrape more than 25 jobs each day and then I iterated in the loop through those pages to get the links for actual job posts and then I did the actual scraping to extract the information.

To get this data structured I used the dictionary which I transformed into the DataFrame using the Pandas. Then it is easy to save data into CSV (Comma-separated values). I also kept the separated raw data scraped each day in case I would make any errors and I need to restore the data. In the last part of the code, I made one big file with summarised data that merge all the raw data CSVs.

This script can be modified (to increase the number of jobs scraped each day) and it's reusable - meaning, that I might put this code on the website and have it executed on daily bases.

I share here a link to my Github in case you want to see my code.

Hopefully, it can also help others who want to learn more about Data Science.

Matěj Srna

Back

Web Scraping for the job posting analysis

Recent Posts

Kommentit