Scraping Medium Stories with Selenium
Last Updated on January 6, 2023 by Editorial Team
Author(s): Eugenia AnelloΒ
How to extract data from Medium SearchΒ results
Disclaimer: This article is only for educational purposes. We do not encourage anyone to scrape websites, especially those web properties that may have terms and conditions against such actions.
Web scraping is the process to extract data from websites. There are many use cases. We can apply it to scrape posts from social networks, products from Amazon, apartments from Airbnb or Medium posts as I willΒ show.
Medium is a platform, where people can bring new ideas to the surface, spread them and learn new things every day. When I search one topic, there are a lot of articles as results and I would like to use web scraping to get the details of each of the Medium stories. Moreover, I would like to change the order of the articles, to see the most recent stories, instead of having an order based onΒ claps.
I decided to build a crawler using Selenium, that is an open-source project used for browser automation. Selenium supports bindings for all major programming languages, including the language weβll use in the tutorial: Python. The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox, Microsoft Edge. Here we will use Chrome asΒ browser.
Content:
- Prerequisites
- Introduction toΒ Selenium
- Getting started withΒ Selenium
- Interact with elements within theΒ page
- Create DataFrame and export it into CSVΒ file
1. Prerequisites
Before we begin this tutorial, set up Python environment in your machine. Install Anaconda in their official page here if you didnβt do it yet. In this story, I will use Jupiter as environment in Windows 10, but feel free to decide your IDE ofΒ choice.
Letβs install selenium:
We also need to install Chrome webdriver, itβs important to use the seleniumΒ library:
Downloads – ChromeDriver – WebDriver for Chrome
Choose the version that matches the version of your browser. Once downloaded, you need to add the chromedriver.exeβs directory to your path. Donβt skip this last step, otherwise the program will giveΒ errors.
There are other supported browsers, that have their own drivers available:
2.Introduction ofΒ Selenium
The proper documentation about selenium can be foundΒ here.
The following methods help to find multiple elements in a web page and return aΒ list:
- find_elements_by_name
- find_elements_by_xpath
- find_elements_by_link_text
- find_elements_by_partial_link_text
- find_elements_by_tag_name
- find_elements_by_class_name
- find_elements_by_css_selector
In this tutorial, I will only use find_elements_by_xpath function, that extract the elements in a webpage specifying the path using XPath. Xpath is a language that is used for locating nodes in an XML document. The most useful path expressions are listedΒ below:
3. Getting Started withΒ Selenium
Letβs import the libraries:
Now we can create an instance of Chrome WebDriver, using the direct path of the location of your downloaded webdriver.
After we can create a function, that takes as input a string, that represents the topic you write in the searchΒ bar.
In this tutorial, I focused on searching articles about neural networks. The first thing to do with the WebDriver is navigate to the link created through the function defined before. The normal way to do this is by calling getΒ method:
WebDriver will wait until the page has fully loaded before returning control to our test or script. The loaded search will show only the first ten results. So, we need to scroll to the bottom of theΒ page:
4. Interact with elements within theΒ page
Now, need to interact with the HTML elements within a page. In the medium search page, weβllΒ scrape:
- storyβs title
- storyβs link
- date
- number ofΒ claps
- number of responses
- authorβs link
For example, we can make a right-click in the title of the first story and click inspect element. At right we see the HTML of the web page. In particular, the highlighted part represents the code of theΒ title.
We select the title of each article using the function find_elements_by_xpath shown before. // is used to select the nodes in the web page from the tag <div> that belongs to the class βsection-contentβ. We write the entire Xpath to extract the titles, which are specified by the tag elements <h3>, used to define headings.
Now we extract the links from eachΒ story.
The specified path begins from the tag <div> that belong to the class βpostArticle-contentβ and ends with the tag <a>, that is used to define hyperlinks. In order to extract the URL, we need to get the property of the tag <a>, href, that contains the link of the specificΒ article.
Weβll do the same procedure for other pieces of information:
These variables contain the list of the corresponding elements: dates, number of claps, number of responses and authorβsΒ URL.
5. Create DataFrame and export it into CSVΒ file
Now the scraping is finished and we want to fill the values of the dictionary βeach_storyβ. Once the dictionary is complete, we transform the dictionary into a DataFrame.
We want the more recent results at the top. So, we sort the DataFrame by date in descending order.
Letβs export the DataFrame into a CSVΒ file:
Congratulations! You extracted Medium Search results usingΒ Python!
Final thoughts:
Until now we interacted with many languages, Python, HTML, XPath and many other more. If you had some difficulties with some languages, such as HTML and XPath, I suggest you to relook the basics in w3school and do many βinspectβ in the webpages to make practice. I hope you found useful this tutorial to extract the various types of information from websites on your own. The code is inΒ github.
Scraping Medium Stories with Selenium was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI