How to Scrape Data from a Website Using Roborabbit (Part 1)

Data scraping has become essential for businesses and organizations of all sizes to serve different purposes, including e-commerce price comparison, real estate data analysis, data gathering, and market research. In this article, we'll learn how to scrape data using Browserbear.

by Josephine Loo · February 2023 · Updated July 2024

Contents

Data scraping, also known as web scraping, is a technique used to extract large amounts of data from websites and various sources, transforming unstructured data into a structured format. This process can be automated and makes collecting information from various sources more efficient.

To make it more efficient, you can use a cloud-based data scraping tool. It provides greater accessibility, scalability, and flexibility. One such tool is Roborabbit, a cloud-based browser automation tool that you can use to automate various browser tasks, including scraping data.

In this tutorial, we will learn how to scrape data from a website using Roborabbit. To demonstrate how the job title, company, location, salary, and link on a job board can be saved as structured data using Roborabbit, we will use this Roborabbit playground as an example.

scraping a job board (Browserbear playground)

What is Roborabbit

Roborabbit is a scalable, cloud-based browser automation tool that helps you to automate any browser task. From automated website testing, scraping data, taking scheduled screenshots for archiving, to automating other repetitive browser tasks, you can do it with Roborabbit.

Similar to Bannerbear, the featured automated image and video generation tool from the same company, you can easily create no-code automated workflows by integrating it with other apps like Google Sheets, Airtable, and more on Zapier. Besides that, you can also use the REST API to trigger tasks to run in the cloud and receive data from the completed tasks.

Scraping a Website Using Roborabbit

You will need a Roborabbit account to follow this tutorial. If you don't have one, you can create a free account.

1. Create a Task

After logging in to your account, go to Tasks and create a new task.

create a task

2. Enter the Starting URL

Enter the starting URL for your task and click the “Save” button.

The "Go" action will be automatically added as the first step of the task to visit the starting URL:

3. Add Steps - Save Structured Data

Click “Add Step” to add the second step. For this step, we will need to specify:

i) A parent container (in Helper)

ii) The children HTML elements that hold the data (in Data Picker)

the parent container and children HTML elements of the Browserbear playground job board

Note: The children HTML elements must be contained within the parent container.

To locate the parent container and the children HTML elements, we will need to pass a helper config, which is a JSON object that contains the XPath and the HTML snippet of the elements to .

We will use the Roborabbit Helper Chrome extension to help us get the helper config. Install and pin it on your browser.

Browserbear Helper Chrome extension

In Roborabbit, select “Save Structured Data” from the Action selection and you will see a textbox for Element. This is where we will add the helper config that points to the parent container.

step 2 (save structured data)

On the job board, click on the extension icon and hover your mouse over the card. You should see a blue box surrounding it.

using the Browserbear Helper extension

Click on the card to get the helper config. Copy and paste it into the Element textbox in your Roborabbit dashboard.

the helper config that points to the parent container

Roborabbit should list the potential tags that can be used to locate the selected element, with the recommended tag highlighted (article[class=“job”]):

pasting the helper config into the Helper

Select a tag from the list:

This will look for every HTML element with the article tag and the job class (parent container). However, it doesn’t tell Roborabbit which HTML element, in particular, to retrieve the data from.

Therefore, we need to further specify the children HTML elements in the Data Picker. On the job board, click on the job title to get the helper config that refers to it.

selecting the children HTML element (job title) using Browserbear Helper extension

Then, enter a name for the data (job_title), paste the config, and select a tag for the element from the tag list:

Click “Add Data” to add it to your structured data’s result (as shown in Data Preview):

Repeat the same process for other texts like company , location , and salary. For the link to the job, choose href to retrieve the URL.

retrieving the link from an HTML element

After selecting all the data, click the “Save” button to save the step.

4. Run the Task

Click “Run Task” to start scraping data from the URL. You will see the task running when you scroll to the bottom of the page.

the Browserbear task running

5. View the Log/Result

When the task has finished, you can click on the “Log” button to view the result.

task finished running

The result will be a Run Object that contains the outputs from various steps in your task. All outputs will be saved in the outputs object and named in the format of [step_id]_[action_name]:

the result (structured data) in the log

Congrats, you have created your first web scraping task on Roborabbit and run it successfully!

Scraping Multiple Pages

If the website that you want to scrape has a pagination, you can add a few more steps to the task to scrape data from the next page.

Add a new step to the task and select the “Click” action. As before, use the Roborabbit Helper extension and click on the button that brings you to the next page to get the config.

adding a "click" step

Next, we will add a new step for repeating the “Save Structured Data” step.

Select the “Go To Step” action and choose “save_structured_data”. Then, enter how many times you want to repeat the selected step.

adding "go_to_step"

The task will now look like this:

an overview of the steps of the task

Running the Task Using REST API

Once you have created a task and run it successfully from your Roborabbit dashboard, you can trigger it by making a POST request to the Roborabbit API.

You will need the API Key and Task ID:

retrieving the Browserbear API Key retrieving the Task ID

After getting the API Key and Task ID, you can make a POST request to the API:

async function runTask(body) {

  const res = await fetch(`https://api.roborabbit.com/v1/tasks/${TASK_UID}/runs`, {
    method: 'POST',
    body: body
    headers: {
      'Content-Type': 'application/json',
      Authorization: `Bearer ${API_KEY}`,
    },
  });

  return await res.json();
}

const run = await runTask(body);

The Roborabbit API is asynchronous. To receive the result, you can use either webhook or polling.

You will need the ID of the run to get the result via API. As the result of the ‘save_structured_data’ step will be saved as an array with its name following the format of [step_id]_save_structured_data, you need to get the step’s ID too.

Both of them can be retrieved from the POST request’s response:

{
  created_at: '2024-06-18T07:58:00.717Z',
  video_url: null,
  webhook_url: null,
  metadata: null,
  uid: 'Oje6ZEm5pa1jQxLMwV',
  steps: [
    {
      action: 'go',
      enabled: true,
      uid: 'lYXA8Kz1LMo5BLV94N',
      config: [Object]
    },
    {
      action: 'save_structured_data',
      enabled: true,
      uid: '4JN3vLbM3MV6BpV9Ya',
      config: [Object]
    }
  ],
  status: 'running',
  finished_in_seconds: null,
  percent_complete: 0,
  self: 'https://api.roborabbit.com/v1/tasks/a9RAJYEbXapBo3D7q4/runs/Oje6ZEm5pa1jQxLMwV',
  task: 'a9RAJYEbXapBo3D7q4',
  outputs: []
}

To get the result, send a GET request to the API:

async function getRun(runId) {
  const res = await fetch(`https://api.roborabbit.com/v1/tasks/${TASK_UID}/runs/${runId}`, {
    method: 'GET',
    headers: {
      Authorization: `Bearer ${API_KEY}`,
    },
  });

  return await res.json();
}

const runResult = await getRun(run.uid);
const structuredData = runResult.outputs[`${SAVE_STRUCTURED_DATA_STEP_ID}_save_structured_data`];

This will be part of the /array when you log structuredData:

[
  {
    job_title: 'Education Representative',
    company: 'Wildebeest Group',
    location: 'Angola',
    link: '/jobs/P6AAxc_iWXY-education-representative/',
    salary: '$51,000 / year'
  },
  {
    job_title: 'International Advertising Supervisor',
    company: 'Fix San and Sons',
    location: "Democratic People's Republic of Korea",
    link: '/jobs/_j_CPB1RFk0-international-advertising-supervisor/',
    salary: '$13,000 / year'
  },
  {
    job_title: 'Farming Strategist',
    company: 'Y Solowarm LLC',
    location: 'Poland',
    link: '/jobs/ocXapDzGUOA-farming-strategist/',
    salary: '$129,000 / year'
  }
  ...
]

🐻 View the full code that shows you how to receive the result using both webhook and API polling on GitHub.

Conclusion

Congratulations! You have successfully learned how to scrape data from a website using Roborabbit. With the steps outlined above, you can easily scrape data from any website with a few clicks and minimum coding. To know about the Roborabbit API in details, refer to the API Reference.

🐻 If you'd like to learn the advanced method of using Roborabbit to scrape a website, continue with How to Scrape Data from a Website Using Roborabbit (Part 2).

How to Scrape Data from a Website Using Roborabbit (Part 1)

What is Roborabbit

Scraping a Website Using Roborabbit

1. Create a Task

2. Enter the Starting URL

3. Add Steps - Save Structured Data

4. Run the Task

5. View the Log/Result

Scraping Multiple Pages

Running the Task Using REST API

Conclusion

Automate & Scale
Your Web Scraping

Use Cases

Features

Integrations

Demos

Docs & Guides

Other

How to Scrape Data from a Website Using Roborabbit (Part 1)

What is Roborabbit

Scraping a Website Using Roborabbit

1. Create a Task

2. Enter the Starting URL

3. Add Steps - Save Structured Data

4. Run the Task

5. View the Log/Result

Scraping Multiple Pages

Running the Task Using REST API

Conclusion

Automate & Scale Your Web Scraping

Use Cases

Features

Integrations

Demos

Docs & Guides

Other

Automate & Scale
Your Web Scraping