Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Scraping Medium Stories with Selenium
Web Scraping

Scraping Medium Stories with Selenium

Last Updated on January 6, 2023 by Editorial Team

Author(s): Eugenia AnelloΒ 

How to extract data from Medium SearchΒ results

photo by Nathan da Silva onΒ Unsplash

Disclaimer: This article is only for educational purposes. We do not encourage anyone to scrape websites, especially those web properties that may have terms and conditions against such actions.

Web scraping is the process to extract data from websites. There are many use cases. We can apply it to scrape posts from social networks, products from Amazon, apartments from Airbnb or Medium posts as I willΒ show.

Medium is a platform, where people can bring new ideas to the surface, spread them and learn new things every day. When I search one topic, there are a lot of articles as results and I would like to use web scraping to get the details of each of the Medium stories. Moreover, I would like to change the order of the articles, to see the most recent stories, instead of having an order based onΒ claps.

I decided to build a crawler using Selenium, that is an open-source project used for browser automation. Selenium supports bindings for all major programming languages, including the language we’ll use in the tutorial: Python. The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox, Microsoft Edge. Here we will use Chrome asΒ browser.

Content:

  1. Prerequisites
  2. Introduction toΒ Selenium
  3. Getting started withΒ Selenium
  4. Interact with elements within theΒ page
  5. Create DataFrame and export it into CSVΒ file

1. Prerequisites

Before we begin this tutorial, set up Python environment in your machine. Install Anaconda in their official page here if you didn’t do it yet. In this story, I will use Jupiter as environment in Windows 10, but feel free to decide your IDE ofΒ choice.

Let’s install selenium:

We also need to install Chrome webdriver, it’s important to use the seleniumΒ library:

Downloads – ChromeDriver – WebDriver for Chrome

Choose the version that matches the version of your browser. Once downloaded, you need to add the chromedriver.exe’s directory to your path. Don’t skip this last step, otherwise the program will giveΒ errors.

There are other supported browsers, that have their own drivers available:

2.Introduction ofΒ Selenium

The proper documentation about selenium can be foundΒ here.

The following methods help to find multiple elements in a web page and return aΒ list:

  • find_elements_by_name
  • find_elements_by_xpath
  • find_elements_by_link_text
  • find_elements_by_partial_link_text
  • find_elements_by_tag_name
  • find_elements_by_class_name
  • find_elements_by_css_selector

In this tutorial, I will only use find_elements_by_xpath function, that extract the elements in a webpage specifying the path using XPath. Xpath is a language that is used for locating nodes in an XML document. The most useful path expressions are listedΒ below:

Credit: w3schools

3. Getting Started withΒ Selenium

Let’s import the libraries:

Now we can create an instance of Chrome WebDriver, using the direct path of the location of your downloaded webdriver.

After we can create a function, that takes as input a string, that represents the topic you write in the searchΒ bar.

In this tutorial, I focused on searching articles about neural networks. The first thing to do with the WebDriver is navigate to the link created through the function defined before. The normal way to do this is by calling getΒ method:

WebDriver will wait until the page has fully loaded before returning control to our test or script. The loaded search will show only the first ten results. So, we need to scroll to the bottom of theΒ page:

4. Interact with elements within theΒ page

Now, need to interact with the HTML elements within a page. In the medium search page, we’llΒ scrape:

  1. story’s title
  2. story’s link
  3. date
  4. number ofΒ claps
  5. number of responses
  6. author’s link

For example, we can make a right-click in the title of the first story and click inspect element. At right we see the HTML of the web page. In particular, the highlighted part represents the code of theΒ title.

We select the title of each article using the function find_elements_by_xpath shown before. // is used to select the nodes in the web page from the tag <div> that belongs to the class β€œsection-content”. We write the entire Xpath to extract the titles, which are specified by the tag elements <h3>, used to define headings.

Now we extract the links from eachΒ story.

The specified path begins from the tag <div> that belong to the class β€œpostArticle-content” and ends with the tag <a>, that is used to define hyperlinks. In order to extract the URL, we need to get the property of the tag <a>, href, that contains the link of the specificΒ article.

We’ll do the same procedure for other pieces of information:

These variables contain the list of the corresponding elements: dates, number of claps, number of responses and author’sΒ URL.

5. Create DataFrame and export it into CSVΒ file

Now the scraping is finished and we want to fill the values of the dictionary β€œeach_story”. Once the dictionary is complete, we transform the dictionary into a DataFrame.

We want the more recent results at the top. So, we sort the DataFrame by date in descending order.

Let’s export the DataFrame into a CSVΒ file:

Congratulations! You extracted Medium Search results usingΒ Python!

Final thoughts:

Until now we interacted with many languages, Python, HTML, XPath and many other more. If you had some difficulties with some languages, such as HTML and XPath, I suggest you to relook the basics in w3school and do many β€œinspect” in the webpages to make practice. I hope you found useful this tutorial to extract the various types of information from websites on your own. The code is inΒ github.


Scraping Medium Stories with Selenium was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓