A practical guide to web scraping with Ruby

One of the benefits of Ruby's developer-friendly syntax is that it's straightforward to quickly build scripts to automate tasks. Web scraping with Ruby is fun, useful, and straightforward. In this article, we'll explore using HTTParty to pull a web page and check it for a given string. To be specific, we'll build a cron job in Ruby to check if a product is in stock on a website!

Have you ever wanted to be alerted to a product restock before everyone else who submitted their email address in the "notify me when in stock" form? By the end of this article, you'll have built your own software that checks a website for given text and notifies you when it changes.

This step-by-step tutorial goes beyond learning to scrape a web page - you'll create a practical, real-world application with Ruby. We'll start by exploring the necessary dependencies and then move on to writing the core functionality.

Finally, we'll show you how to deploy your Ruby web scraper so it runs on a recurring schedule, keeping you up to date with changes to your target web pages. Whether you're a seasoned Ruby developer or just getting started, this guide will provide you with valuable insights and techniques to enhance your web scraping projects.

Is Ruby a good choice for web scraping?

When it comes to web scraping, Ruby might not be the first language that comes to mind. Python, with its extensive libraries, has long been the go-to choice for many developers, especially those that are already comfortable with Python.

Similarly, JavaScript, with its ability to natively interact with the browser DOM via tools like Puppeteer or Selenium that ship with headless browser functionality, offers popular options for scraping dynamic content. There are also complete no-code options that empower you to scrape websites without writing your own web scraper in Ruby or any other language.

By building your own tool for web scraping using Ruby, you retain control over the scraping process. This flexibility lets you customize the scraper to meet your specific needs, whether that's handling edge cases, managing request frequency, or even integrating the scraper with other tools or APIs like Twilio to enhance the functionality.

Ruby's strong developer ecosystem makes it easy to make API calls and integrate with other services like Twilio, making it easy to extend the functionality of your scraper. For instance, you can quickly set up notifications or automate follow-up actions based on the data you've scraped. You could even turn the scraper into a fully-fledged application, letting users target web pages for keywords and get their own notifications.

Dependencies for web scraping in Ruby

Before we dive into the code itself, let's explore the Ruby gems we'll use to make our web scraping project more efficient and powerful. For this project, we'll need to rely on two gems: HTTParty and twilio-ruby.

HTTParty

First, we’ll need HTTParty, which is a gem that makes it straightforward to make web requests. We'll use HTTParty to make an HTTP GET request to a given URL, and then we'll interpret the response. The gem can also perform other HTTP methods and comes with a lot of tools for handling edge cases, but we'll stick with simple GET requests for our project.

HTTParty provides a clean and easy-to-use API that makes writing HTTP requests more straightforward than using Ruby's Net::HTTP directly. It can also automatically parse JSON and XML responses, saving us the extra step of manually parsing the JSON response.

While this project will only use GET requests, the gem supports all HTTP methods, including POST, PUT, and DELETE, making a good choice in case the requirements of what you're building expand beyond just fetching web pages.

While this gem is great for lightweight web scraping tasks, you should consider alternatives like a headless browser automation with tools like Selenium for larger-scale web scraping projects or those requiring more advanced features like JavaScript rendering, and page interactions. This example only requires us to parse a static HTML page and search it for text, so HTTParty is plenty powerful.

Our parsing of the HTML for this example is limited to a simple string search, but you may wish to do more advanced parsing. The nokogiri Ruby gem is the most popular library for manipulating HTML and XML documents and makes advanced parsing much easier.

Twilio

The second gem we will lean on is twilio-ruby, which is the official Ruby gem for Twilio. Twilio is a comprehensive communications platform that provides a number of products, but we're interested in SMS messaging. We'll use their provided gem to make it straightforward to send ourselves a text when the website text changes.

Twilio's infrastructure ensures high deliverability rates for your notifications, making it a scalable choice to turn a simple web scraper into a commercial product. While this project focuses on SMS/text messages, Twilio supports other communication channels like email, WhatsApp, and even voice, giving you flexibility in how you want to receive your communications.

Twilio offers a free tier for testing projects, which is plenty generous enough for this example. If you intend to send lots of messages, your costs will grow proportional to the amount of messages you're sending. If you're looking for another way to get notifications from a Ruby project, you could send notifications via email using Ruby's built-in net/smtp library, or Rails' ActionMailer.

Web scraping in Ruby building blocks

Ruby apps commonly use gems and Bundler to manage dependencies, so we'll set that up first.

Defining gem dependencies

First, create a new main.rb file in the root directory of a new folder for the project.

In that file, add the following code that will require the 2 dependencies and print out "Hello world!":

require "twilio-ruby"
require "httparty"

puts "Hello world!"

Next, add a Gemfile for your dependencies, called Gemfile.rb. The Gemfile doesn't need to be complicated. We'll set a source for the gems, define the gems we need, and set the Ruby version. It should look like this:

source "https://rubygems.org"

gem "httparty"
gem "twilio-ruby"

ruby File.read(".ruby-version").strip

The last line of this Gemfile points to a file labeled .ruby-version to set which version of Ruby to use for the project, so now you will need to create that file and set the version.

Create a new file called .ruby-version in the root of the project and give it a single line that defines which Ruby version to use by adding the following:

3.0.0

Finally, you're ready to install dependencies. Just run the following command to install the gems from the Gemfile:

bundle install

This will install the dependencies and also create Gemfile.lock, a lockfile specifying which versions were used at install time.

Now, you can run the Hello World script once locally on your machine by running the following command:

ruby main.rb

Understanding the scraper

To meet our project requirements, we need to do a few things.

First, we'll need to get the contents of a given webpage. We'll use HTTParty to make a GET HTTP request to a given endpoint and accept the response as an HTML page in the response body.

Then, we'll check the contents of the HTML document returned in the response body for the string that we're searching for. If that string is not present, we'll send a text message using Twilio to a specified number. This web scraper doesn't need to be super dynamic, so we'll hardcode the URL, the search query, and the phone number. To recap, we'll build a ruby web scraping class, called StockChecker that does three main things:

Get the contents of a given website
Check the website for the presence of a string, something like "Out-of-stock" for this example
If the string is not present, send a text to a specified number using Twilio

Fetch a target webpage with HTTParty

First, we'll write a simple method to check if an item is in stock using HTTParty. It's simple to make a web request with HTTParty, so start with the following code for this method implementation:

def check_stock
  response = HTTParty.get("https://www.example.com/replace-this")
  unless response.body.include?("Out of stock")
    send_in_stock_notification
  end
end

This method assumes you have HTTParty installed, which we'll get to later when we implement the rest of the class. The method specifically checks for the "Out of stock" string on the site, so if the text on the particular product you're searching for is different, just swap it out.

The check_stock method retrieves the webpage and checks it for text, but calls the nonexistent send_in_stock_notification method to handle the notification.

Splitting the responsibilities of the methods up like this follows Ruby's conventions for small methods that are explicit about what they do.

Send a text message notification from the script

You're probably wondering how we'll implement send_in_stock_notificaition method, so we'll get to that next. Paste the following code in the same file as your check_stock method as the implementation of our send_in_stock_notification method:

def send_in_stock_notification
  ACCOUNT_SID = ENV["TWILIO_ACCOUNT_SID"]
  AUTH_TOKEN = ENV["TWILIO_AUTH_TOKEN"]
  SMS_CLIENT = Twilio::REST::Client.new(ACCOUNT_SID, AUTH_TOKEN)
  message = SMS_CLIENT.messages.create(
    from: ENV["FROM_NUMBER"],
    to: ENV["TO_NUMBER"],
    body: "NINJA CREAMI IS IN STOCK"
  )
end

This method has only one job - send a text that the item is in stock. First, it sets ACCOUNT_SID and AUTH_TOKEN to values set from environment variables, then initializes the Twilio client with those values.

Next, it sends a text message to the provided phone number with the text: "NINJA CREAMI IS IN STOCK". This method depends on a number of things unique to your Twilio account. First, create a Twilio account. The free tier is plenty for this project's needs in the short term.

In your Twilio account, take note of an "Account SID", an "Auth Token", and a "From Number".

Web scraping with Ruby Twilio Setup Twilio setup

When running the code on your machine, you'll need to set these environment variables to those values. You'll also set the TO_NUMBER environment variable to your own phone number, as that's where the text message will be sent. Before you run this script on your computer or a server, you'll need to set:

ACCOUNT_SID (this is from your Twilio account)
AUTH_TOKEN (this is from your Twilio account)
FROM_NUMBER (this is from your Twilio account)
TO_NUMBER (this is the number you want the notification to go to)

We could set up the script to accept these as command-line arguments, but using environment variables will make it easier to set them dynamically when we deploy this as a cron job on Render.

Together, the two methods make it easy to understand the intended behavior at a glance:

def check_stock
  response = HTTParty.get("https://www.example.com/replace-this")
  unless response.body.include?("Out of stock")
    send_in_stock_notification
  end
end

def send_in_stock_notification
  ACCOUNT_SID = ENV["TWILIO_ACCOUNT_SID"]
  AUTH_TOKEN = ENV["TWILIO_AUTH_TOKEN"]
  SMS_CLIENT = Twilio::REST::Client.new(ACCOUNT_SID, AUTH_TOKEN)
  message = SMS_CLIENT.messages.create(
    from: ENV["FROM_NUMBER"],
    to: ENV["TO_NUMBER"],
    body: "NINJA CREAMI IS IN STOCK"
  )
end

Finally, swap out the message body for anything you like. I first built this to track down a Ninja Creami, so my message body is set as a clear indication that it's in stock.

You can get more advanced with parsing HTML to get more useful results, but for this example, we'll just search the HTML document returned for the given text.

Turning the web scraper into a proper Ruby class

These two methods are essentially all we need to meet our goals, but we can put them inside a class to make things cleaner and easier to reuse in case you want to modify or expand the project in the future.

You can make a RestockChecker class and move these methods inside it. I put this, along with instantiating the class, requiring dependencies, and using the methods in a main.rb file that looks like this:

require "twilio-ruby"
require "httparty"

class RestockChecker
  include HTTParty

  ACCOUNT_SID = ENV["TWILIO_ACCOUNT_SID"]
  AUTH_TOKEN = ENV["TWILIO_AUTH_TOKEN"]
  SMS_CLIENT = Twilio::REST::Client.new(ACCOUNT_SID, AUTH_TOKEN)

  def initialize(url_to_check, out_of_stock_text)
    @url_to_check = url_to_check
    @out_of_stock_check = out_of_stock_text
  end

  def check_stock
    response = HTTParty.get(@url_to_check)
    unless response.body.include?(@out_of_stock_check)
      send_in_stock_notification
    end
  end

  def send_in_stock_notification
    message = SMS_CLIENT.messages.create(
      from: ENV["FROM_NUMBER"],
      to: ENV["TO_NUMBER"],
      body: "NINJA CREAMI IS IN STOCK: #{@url_to_check}"
    )
  end
end

out_of_stock_text = "Out of Stock"
url = "https://www.ninjakitchen.com/products/ninja-creami-breeze-7-in-1-ice-cream-maker-zidNC201"
restock_checker = RestockChecker.new(url, out_of_stock_text)
restock_checker.check_stock

This changes a few notable things. First, some Ruby code has to instantiate the RestockChecker class before it can do anything. When creating a new RestockChecker ruby object, the caller has to provide both the URL and the text to search for. Given that these are dynamic but not sensitive, we can set them as part of creating the object as opposed to environment variables.

Providing these arguments as part of the object instantiation gives you a single point control over change for both the URL and out-of-stock text, and associating them with a given instance of the class reinforces that that object's job is to check that URL for that text.

You can create more instances of the class and provide new arguments, and those objects would check their URL for their given text. The class defines the behavior and each object instantiated from the class is responsible for executing that behavior given the data provided at creation.

This object-oriented approach makes it easy to expand this Ruby script into a fully-featured Ruby app, but that's beyond the scope of this article.

Cron jobs and deploying on Render

Running this code just once is hardly useful. With this constraint, you may as well have opted to just manually check the site yourself. In Ruby, there are several ways to run this script on a schedule. One such way is to run it as a cron job, also known as a scheduled job.

Render is a popular platform for hosting web applications, including hosting Ruby on Rails apps. Render is a Platform as a Service (PaaS), like Heroku, so it abstracts away the underlying infrastructure from running web services. You can deploy a web app to Render and do not have to understand or maintain the server on which it runs.

Beyond hosting web apps, the Render platform has options for hosting a cron job in the cloud with affordable pricing. Cron jobs are tasks created with cron, a Linux tool for scheduling code to run. You could run a cron job locally, but then you'd need to leave your computer on and awake indefinitely to run our web scraper. Deploying a cron job on Render lets you run the web scraper periodically on infrastructure without supporting the underlying infrastructure.

Render also supports environment variables, which our code needs to access to set the sensitive information it needs to function. You can create a new cron job on Render's platform, link it to the repository that holds the code we wrote, and configure it to run at your specified schedule using a cron expression, which indicates how often the job should be run.

When Render prompts you for an environment, specify Ruby. When prompted for a build command, use bundle install. Finally, when prompted for a command, use bundle exec ruby main.rb. You will then have to manually set environment variables in Render just as you did on your machine for the script to run successfully.

Of course, this is not the only option to run your script regularly. Other platforms support cron jobs, and you could even use Sidekiq or Solid Queue to run it as a scheduled background job or even integrate it into a Rails application and use ActiveJob.

Setting request headers to unblock the scraper

When building a web scraper in Ruby or any programming language, it’s likely you'll eventually encounter websites that block or limit requests from scripts that scrape data like the one we're building. This is often done to prevent automated bots from overwhelming the server.

One of the simplest and most effective ways to avoid a web scraper getting blocked is to set appropriate HTTP request headers, which can make your HTTP requests look more like they’re coming from a normal web browser rather than an automated script. It's important to exercise caution when doing this - be a good citizen of the web and don't send GET requests more frequently than a typical web browser user might.

The example we've written to check a website for a product listing is plenty effective when running once every 10 minutes, so there's no need to run it often enough to cause problems for the host. With that, let's get into the two HTTP headers you should consider setting in your GET requests.

User-Agent header

The User-Agent request header is likely the most critical header to set when scraping. This header tells the server what type of device, operating system, and browser is making the request. Without a User-Agent header, you're sending the default User-Agent provided by HTTParty. This makes your request stand out from other, normal traffic, increasing the chances the request gets blocked.

With HTTPParty, setting the User-Agent header is easy. You could modify the check_stock method to look like this:

def check_stock
  response = HTTParty.get(@url_to_check, headers: { "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36" }))

  unless response.body.include?(@out_of_stock_check)
    send_in_stock_notification
  end
end

Accept header

Another potentially useful header for web scraping with Ruby is the Accept header. The Accept header lets the server know what types of content the caller (client) can accept as a response. This is particularly important when scraping, as it can influence the format of the data returned by the server.

For instance, if a website offers both HTML web page and JSON responses, setting an appropriate Accept header can ensure that you get the format you need. If you're looking for your scraper to receive responses like a web browser would, you should attempt to copy the Accept header from a browser. We can set the Accept header in our check_stock method much like we did for the User-Agent header:

def check_stock
  response = HTTParty.get(@url_to_check, headers: { "Accept" => "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" }))

  unless response.body.include?(@out_of_stock_check)
    send_in_stock_notification
  end
end

Referer header

Another useful header for scraping is the Referer header, which indicates the previous webpage the browser visited before making the current requested page. While less common, some websites use this to verify that requests are coming from within their own site.

Setting this header to another page in the given site is a good way for a request from a script to look more like that of a browser. You can set the Referer header in the check_stock method just like the other headers:

def check_stock
  response = HTTParty.get(@url_to_check, headers: { "Referer" => "yoursite.com/someotherpage" }))

  unless response.body.include?(@out_of_stock_check)
    send_in_stock_notification
  end
end

Better web scraping with Ruby

In this tutorial, we've demonstrated the beginnings of web scraping using Ruby. We have explored the strengths of two powerful gems, HTTParty and Twilio-ruby, and leveraged their capabilities to create a pragmatic solution for monitoring product availability. As with any project, there's always room for enhancement.

For starters, the code as written will send a text message every time it runs and doesn't find the given text, regardless of how many times you've been notified. A useful exercise would be determining how the project could be modified to make only one notification.

If you extend this project's functionality, you'll want to handle exceptions and the potential for a status code other than 200. Expanding the project into a fully functional application would be made easier by first adding tests for the existing functionality.

Beyond this, perhaps the project could be expanded to check multiple sites! If it had a user interface, these sites and the text for which to search could be configured by a non-technical user. If your next steps include more advanced parsing of the HTML response, consider introducing the nokogiri gem to make things more streamlined.

Ruby web scraping, while powerful, should be used judiciously and ethically, respecting your target site's terms of service. As we mentioned earlier in the article, it's important not to send more than an ordinary human user would send to avoid overwhelming the server. I hope you've enjoyed learning more about web scraping and that this tutorial helped you build useful skills. If you'd like to get notified when we publish more content like this, sign up for the Honeybadger newsletter.

Happy scraping!