Accelerate your data journey. Join our AI Community!

Publication

Scraping Medium Stories with Selenium
Web Scraping

Scraping Medium Stories with Selenium

Author(s): Eugenia Anello 

How to extract data from Medium Search results

photo by Nathan da Silva on Unsplash

Disclaimer: This article is only for educational purposes. We do not encourage anyone to scrape websites, especially those web properties that may have terms and conditions against such actions.

Web scraping is the process to extract data from websites. There are many use cases. We can apply it to scrape posts from social networks, products from Amazon, apartments from Airbnb or Medium posts as I will show.

Medium is a platform, where people can bring new ideas to the surface, spread them and learn new things every day. When I search one topic, there are a lot of articles as results and I would like to use web scraping to get the details of each of the Medium stories. Moreover, I would like to change the order of the articles, to see the most recent stories, instead of having an order based on claps.

I decided to build a crawler using Selenium, that is an open-source project used for browser automation. Selenium supports bindings for all major programming languages, including the language we’ll use in the tutorial: Python. The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox, Microsoft Edge. Here we will use Chrome as browser.

Content:

  1. Prerequisites
  2. Introduction to Selenium
  3. Getting started with Selenium
  4. Interact with elements within the page
  5. Create DataFrame and export it into CSV file

1. Prerequisites

Before we begin this tutorial, set up Python environment in your machine. Install Anaconda in their official page here if you didn’t do it yet. In this story, I will use Jupiter as environment in Windows 10, but feel free to decide your IDE of choice.

Let’s install selenium:

https://medium.com/media/4e92d145c9301bed7a64f08848f1da48/href

We also need to install Chrome webdriver, it’s important to use the selenium library:

Downloads – ChromeDriver – WebDriver for Chrome

Choose the version that matches the version of your browser. Once downloaded, you need to add the chromedriver.exe’s directory to your path. Don’t skip this last step, otherwise the program will give errors.

There are other supported browsers, that have their own drivers available:

2.Introduction of Selenium

The proper documentation about selenium can be found here.

The following methods help to find multiple elements in a web page and return a list:

  • find_elements_by_name
  • find_elements_by_xpath
  • find_elements_by_link_text
  • find_elements_by_partial_link_text
  • find_elements_by_tag_name
  • find_elements_by_class_name
  • find_elements_by_css_selector

In this tutorial, I will only use find_elements_by_xpath function, that extract the elements in a webpage specifying the path using XPath. Xpath is a language that is used for locating nodes in an XML document. The most useful path expressions are listed below:

Credit: w3schools

3. Getting Started with Selenium

Let’s import the libraries:

https://medium.com/media/4e0ff0ceabd4d40d703f514bd9489b24/href

Now we can create an instance of Chrome WebDriver, using the direct path of the location of your downloaded webdriver.

https://medium.com/media/07247214772f0d98a7a2619f6828c3cc/href

After we can create a function, that takes as input a string, that represents the topic you write in the search bar.

https://medium.com/media/82c05a48f5235afe2e3655558e0c1b84/href

In this tutorial, I focused on searching articles about neural networks. The first thing to do with the WebDriver is navigate to the link created through the function defined before. The normal way to do this is by calling get method:

https://medium.com/media/d6bef7b775a34b8d4d6757a5335bed79/href

WebDriver will wait until the page has fully loaded before returning control to our test or script. The loaded search will show only the first ten results. So, we need to scroll to the bottom of the page:

https://medium.com/media/1d45ddc74c02fb54fbb05fc5d73e2a65/href

4. Interact with elements within the page

Now, need to interact with the HTML elements within a page. In the medium search page, we’ll scrape:

  1. story’s title
  2. story’s link
  3. date
  4. number of claps
  5. number of responses
  6. author’s link

For example, we can make a right-click in the title of the first story and click inspect element. At right we see the HTML of the web page. In particular, the highlighted part represents the code of the title.

We select the title of each article using the function find_elements_by_xpath shown before. // is used to select the nodes in the web page from the tag <div> that belongs to the class “section-content”. We write the entire Xpath to extract the titles, which are specified by the tag elements <h3>, used to define headings.

https://medium.com/media/7312addbb57d7806dcb456ad953eecd6/href

Now we extract the links from each story.

The specified path begins from the tag <div> that belong to the class “postArticle-content” and ends with the tag <a>, that is used to define hyperlinks. In order to extract the URL, we need to get the property of the tag <a>, href, that contains the link of the specific article.

https://medium.com/media/131013b04b83ce293aed4c67ab3469a6/href

We’ll do the same procedure for other pieces of information:

https://medium.com/media/838d0de5ba2ea6b19360060d54e33547/href

These variables contain the list of the corresponding elements: dates, number of claps, number of responses and author’s URL.

5. Create DataFrame and export it into CSV file

Now the scraping is finished and we want to fill the values of the dictionary “each_story”. Once the dictionary is complete, we transform the dictionary into a DataFrame.

https://medium.com/media/0ce792973e0bd07a2b74eae270667fc8/href

We want the more recent results at the top. So, we sort the DataFrame by date in descending order.

https://medium.com/media/e26d539cb4a673c62286e61243b9af4a/href

Let’s export the DataFrame into a CSV file:

https://medium.com/media/2dc9d738a9be79d5d82c85be27bfecdb/href

Congratulations! You extracted Medium Search results using Python!

Final thoughts:

Until now we interacted with many languages, Python, HTML, XPath and many other more. If you had some difficulties with some languages, such as HTML and XPath, I suggest you to relook the basics in w3school and do many “inspect” in the webpages to make practice. I hope you found useful this tutorial to extract the various types of information from websites on your own. The code is in github.


Scraping Medium Stories with Selenium was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓