How to Use Puppeteer in AWS Lambda for Web Scraping via Serverless

In this article, we’ll explore how AWS Lambda’s serverless architecture can simplify the deployment of your Puppeteer tasks and walk you through the process of setting up Puppeteer in AWS Lambda using the Serverless framework.

by Josephine Loo · August 2024

Contents

Puppeteer is a versatile Node.js library created by Google that offers a high-level API for controlling Chrome or Chromium browsers. It's especially useful for automating browser-based tasks, handling repetitive actions, and extracting data from websites.

With Puppeteer, you can programmatically interact with web pages, fill out forms, navigate through websites, and extract information with ease. It offers a robust suite of features, including running Chrome in headless mode (without a graphical user interface), simulating user interactions, capturing screenshots, and handling dynamic content on JavaScript-heavy sites.

Integrating Puppeteer with AWS Lambdalets you run automation tasks, including web scraping in a serverless environment, which helps you scale and manage costs more effectively. In this guide, we'll show you how to use Puppeteer in AWS Lambda to streamline and enhance your automation processes, using the Serverless framework.

What is AWS Lambda

AWS Lambda is a serverless computing service offered by Amazon Web Services (AWS). It lets you run code in response to events but eliminates the need to manage your own infrastructure. All you need to do is write a script and upload it as a .zip file or container image to AWS Lambda.

AWS Lambda.png

Here's what AWS Lambda offers:

Serverless - You don't need to manage the underlying infrastructure. Your application still runs on servers, but all the server management is done by AWS.
Automatic scaling - AWS Lambda automatically scales up or down based on the number of incoming requests. There's no need to manually adjust server capacity.
Pay-as-you-go pricing - You only pay for the number of requests and the compute time you use, billed by the millisecond. There are no charges for idle time.
Wide language support - You can write your code in Java, Go, PowerShell, Node.js, C#, Python, and Ruby. If you need to use a different language, the Runtime API allows for that as well.

The code you run on AWS Lambda is uploaded as a “Lambda function”. Each function has its own configuration details, including its name, description, entry point, and resource requirements.

🐰 Hare Hint: The code must be written in a “stateless” style, meaning it should not rely on any previous interactions or stored data from prior executions.

Why Use Puppeteer in AWS Lambda

AWS Lambda’s serverless architecture makes deploying and scaling your Puppeteer tasks easier by eliminating the need to manage the underlying infrastructure. This speeds up development cycles and helps new features to be shipped faster. Additionally, AWS Lambda automatically scales to accommodate high volumes of concurrent requests and charges only for the compute time you use. This can be more cost-effective than maintaining a traditional server, especially for sporadic tasks like web scraping, price tracking, and capturing website screenshots.

AWS Lambda not only saves costs but also comes with a scheduling tool. You can trigger functions at regular intervals (e.g., every hour) or at specific times (e.g., 8:00 a.m. PST) using Amazon EventBridge (formerly CloudWatch Events). We’ll dive deeper into this in the later section.

Pre-requisites

For this tutorial, you will need:

Basic knowledge of AWS Lambda and the Serverless Framework
An AWS account
Node.js, npm, and serverless installed on your local machine

🐰 Hare Hint: Follow the official guide to link your Serverless and AWS accounts.

How to Use Puppeteer in AWS Lambda via Serverless

Step 1. Set Up Your Serverless Project

Create a Serverless project by running the command below in your terminal/command line:

serverless

You will be prompted to choose one of the Serverless templates. Nagivate to AWS - Node.js - HTTP API and press enter:

Serverless CLI - choosing the type of project.png

Enter a name for the project:

Serverless CLI - entering a name for the project.png

Add the project to your organization in Serverless:

Serverless CLI - adding the name to an organization.png

Finally, add the project to an existing app that you have created from your Serverless dashboard or create a new one:

Serverless CLI - adding the project to an app.png

Choose either of the options (Y/n) when you’re prompted to deploy the project to Serverless.

A folder should be created for the project, with .gitignore, index.js, README.md, and serverless.yml files inside:

Serverless project structure screenshot.png

Next, navigate to your project folder, initialize a new Node.js project, and install the puppeteer-core and @sparticuz/chromium libraries:

cd puppeteer-lambda-serverless
npm init
npm install puppeteer-core @sparticuz/chromium@126

🐰 Hare Hint: Check Puppeteer’s documentation to find the compatible version of @sparticuz/chromium for your Puppeteer version.

Step 2. Configure Your Lambda Function

The serverless.yml file is the main configuration file for the Serverless Framework. It defines the service, functions, and resources for your project. Update this file by setting a timeout of 10 seconds (default is 6) and including files from the node_modules folder in the deployment package:

org: your-organization-name
app: puppeteer-lambda-serverless
service: puppeteer-lambda-serverless
frameworkVersion: '3'

provider:
  name: aws
  runtime: nodejs18.x
  timeout: 10 
  
package:
  include:
    - node_modules/**
  
functions:
  api:
    handler: index.handler
    events:
      - httpApi:
          path: /
          method: get

🐰 Hare Hint: The node_modules folder holds all the dependencies required for your Lambda function.

In the functions section of the configuration file, you'll define your Lambda function and set up its trigger. The handler specifies which function from the index.js file will run when the Lambda function is invoked. In the events section, you'll specify the event that triggers the Lambda function.

In our configuration file, a Lambda function named api is defined. This function will execute the handler() function from index.js whenever a GET request is made to the / endpoint.

functions:
  api:
    handler: index.handler
    events:
      - httpApi:
          path: /
          method: get

Step 3. Write the Handler Script

The index.js file serves as the entry point for your AWS Lambda functions, also known as the handler script. It contains the code that runs when your function is triggered.

At the beginning of the file, start by importing @sparticuz/chromium and puppeteer-core. Then, in the handler() function, add the following code to perform a simple task with Puppeteer—visiting a website and returning its title:

const chromium = require("@sparticuz/chromium");
const puppeteer = require('puppeteer-core');

module.exports.handler = async (event) => {
  let browser = null;
  let page = null;
  try {
    
    const browser = await puppeteer.launch({
      args: chromium.args,
      defaultViewport: chromium.defaultViewport,
      executablePath: await chromium.executablePath(),
      headless: chromium.headless,
    });
    
    page = await browser.newPage();
    await page.goto('https://www.roborabbit.com');
    const title = await page.title();
    
    console.log(title);

    return {
      statusCode: 200,
      body: JSON.stringify(
        {
          result: title,
          input: event,
        },
        null,
        2
      ),
    };
  } catch (error) {
    return {
      statusCode: 500,
      body: JSON.stringify({ error: error.message }),
    };
  } finally {
    if (page) {
      await page.close();
    }
    if (browser) {
      await browser.close();
    }
  }
};

Step 4. Deploy the Function

Run the following command to deploy your function to AWS Lambda using Serverless:

serverless deploy

The output will include details about your deployed service, including a link to your app on the Serverless dashboard and the function’s API Gateway endpoint:

serverless deploy output copy.png

Step 5. Test Your Function

Use the endpoint URL to test your Lambda function. You can make a GET request to the endpoint using tools like curl, Postman, or even a web browser:

curl https://your-api-id.execute-api.region.amazonaws.com/

You should receive a JSON response containing the result:

AWS Lambda function run result.png

That’s how to use Puppeteer in AWS Lambda via Serverless! You can customize the code to perform additional tasks with Puppeteer, such as scraping data from multiple pages, capturing website screenshots, downloading images, and more.

🐰 Hare Hint: If you encounter an error, check your logs on AWS CloudWatch or your Serverless dashboard for debugging.

Automating Puppeteer in AWS Lambda

AWS lets you create rules or scheduled events that trigger Lambda functions at specified intervals or times. You can set your function to run at a regular rate by defining a rate expression:

rate(value unit)

…or at a specific time by defining a cron expression:

cron(fields)

To set up a scheduled event, set the trigger event to schedule: rate() or schedule: cron() in your configuration file:

functions:
  api:
    handler: index.handler
    events:
      - schedule: rate(5 minutes)

The configuration above will trigger the AWS Lambda function every 5 minutes, unlike the previous setup, which only triggered the function in response to a GET request to the specified endpoint.

Here’s a screenshot of the CloudWatch log showing the function running at 5-minute intervals:

AWS CloudWatch log.png

Limitations of Using Puppeteer in AWS Lambda

AWS Lambda makes deploying a Puppeteer script straightforward and handles scaling automatically. However, it has a deployment package size limit of 50 MB, which can be a limitation for larger or more complex tasks.

An alternative is to use a tool like Roborabbit, which offers similar capabilities to Puppeteer in AWS Lambda but with an easier setup. It is a scalable, AWS-powered browser automation tool designed to simplify various browser-based tasks.

Roborabbit features a user-friendly interface that lets you create automated workflows and interact with web pages effortlessly. With Roborabbit, you can automate website testing, scrape website data, take scheduled screenshots for archiving, and more—without writing any code.

For example, here’s how you would set up a web scraping task in Roborabbit:

While Roborabbit provides a user-friendly no-code interface, it also offers REST APIs for developers who want to trigger their task programmatically or integrate it with their existing application or workflow. If you’re interested in trying it out, just click on the task above to sign up and add it to your project!

How to Use Puppeteer in AWS Lambda for Web Scraping via Serverless

What is AWS Lambda

Why Use Puppeteer in AWS Lambda

Pre-requisites

How to Use Puppeteer in AWS Lambda via Serverless

Step 1. Set Up Your Serverless Project

Step 2. Configure Your Lambda Function

Step 3. Write the Handler Script

Step 4. Deploy the Function

Step 5. Test Your Function

Automating Puppeteer in AWS Lambda

Limitations of Using Puppeteer in AWS Lambda

Automate & Scale
Your Web Scraping

Use Cases

Features

Integrations

Demos

Docs & Guides

Other

How to Use Puppeteer in AWS Lambda for Web Scraping via Serverless

What is AWS Lambda

Why Use Puppeteer in AWS Lambda

Pre-requisites

How to Use Puppeteer in AWS Lambda via Serverless

Step 1. Set Up Your Serverless Project

Step 2. Configure Your Lambda Function

Step 3. Write the Handler Script

Step 4. Deploy the Function

Step 5. Test Your Function

Automating Puppeteer in AWS Lambda

Limitations of Using Puppeteer in AWS Lambda

Automate & Scale Your Web Scraping

Use Cases

Features

Integrations

Demos

Docs & Guides

Other

Automate & Scale
Your Web Scraping