We have a new name! Browserbear is now Roborabbit

Beautiful Soup Tutorial: How To Use the find_all() and find() Methods

Beautiful Soup provides simple methods like find_all() and find() for navigating, searching, and modifying an HTML/XML parse tree. This article will show you how to use them to extract information from HTML/XML.
by Josephine Loo · · Updated

Contents

    Web scraping is a useful technique for extracting information from the internet, especially when the amount of data is huge. If you’re scraping websites using Python language, you should learn how to use Beautiful Soup. It is one of the most popular libraries for web scraping in Python.

    In this article, we’ll learn how to use Beautiful Soup’s find_all() and find() methods, which are essential for locating elements and extracting data in the web scraping process.

    What is Beautiful Soup

    Beautiful Soup is a Python library for scraping data from HTML and XML files. It transforms complex HTML/XML documents into a Python object tree and provides simple methods to help you navigate, search, and modify the tree:

    • Navigate - Down (.head, .title, .body, etc.), up (.parent, .parents), sideways (.next_sibling, .previous_sibling, .etc.), back and forth (.next_element, .previous_element, etc.)
    • Search - find_all(), find(), find_next(), etc.
    • Modify - append(), extend(), insert(), clear(), etc.

    You can pass HTML/XML code as a string into the BeautifulSoup constructor to make a BeautifulSoup object:

    from bs4 import BeautifulSoup
    
    html = """
    <html>
      <body>
        <p>This is an example of an HTML file.</p>
      </body>
    </html>
    """
    
    soup = BeautifulSoup(html, 'html.parser')
    

    or use requests to retrieve the HTML code from a URL:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://www.roborabbit.com/blog/'
    response = requests.get(url)
    
    if response.status_code == 200:
        html = response.content
    
    soup = BeautifulSoup(html, 'html.parser')
    

    By default, Beautiful Soup parses documents as HTML. To parse a document as XML, you need to have the lxlml parser installed and pass in “xml” as the second argument to the BeautifulSoup constructor:

    soup = BeautifulSoup(xml, "xml")
    

    Then, you can use any of the methods provided to traverse the document and retrieve any information from it.

    How to Install Beautiful Soup

    You can install the latest version of Beautiful Soup (Beautiful Soup 4) in your Python project using pip, by running the command below in your project directory:

    pip install beautifulsoup4
    

    If you’re using Python 3.x, replace pip in the command above with pip3.

    ❗️Note: Beautiful Soup 3 is no longer being developed and supported.

    Then, import the library into your Python project and create a BeautifulSoup object:

    from bs4 import BeautifulSoup
    
    html = """
    <html>
      <body>
        <p>This is an example of an HTML file.</p>
      </body>
    </html>
    """
    
    soup = BeautifulSoup(html, 'html.parser')
    

    Using the find_all() Method

    The find_all() method helps you to search for any element from the parse tree and returns a list containing all the results. To find specific elements, you can pass in a filter to this method to search elements based on their tag names, attributes, text, or a combination of these.

    Let’s find some elements from the HTML code below:

    from bs4 import BeautifulSoup
    
    html = """
    <html>
      <body>
        <p class="first" id="first_p">This is the first paragraph.</p>
        <p class="second" id="second_p">This is the second paragraph.</p>
        <div>
          <p class="third">Here is the third paragraph.</p>
        </div>
        <input name="email"/>'
      </body>
    </html>
    """
    

    By Tag Name

    You can find elements by their tag names by passing the name to the find_all() method. The code below will find all <p> elements from the HTML code and return them as a list:

    results = soup.find_all('p')
    
    for result in results:
      print(result.text)
    
    # This is the first paragraph.
    # This is the second paragraph.
    # Here is the third paragraph.
    

    To find multiple different elements, pass in the tag names as a list:

    results = soup.find_all(['div', 'input'])
    
    for result in results:
        if (result.text):
            print(result.text)
        else:
            print(result)
    
    # Here is the third paragraph.
    # <input name="email"/>
    

    🐰 Hare Hint: As find_all() is the most popular method in the Beautiful Soup search API, you can use a shortcut to find elements by treating the BeautifulSoup object as a function, eg. soup(‘p’).

    By Class Name

    You can also find elements using the class name. As “class” is a reserved word in Python, you need to find elements by their class names using the keyword argument class_:

    results = soup.find_all(class_='first')
    
    for result in results:
      print(result.text)
    
    # This is the first paragraph.
    

    🐰 Hare Hint: If you want to search for elements that match two or more CSS classes, use a CSS selector (eg. soup.select(".fist-class.second-class")) instead.

    By ID

    To find elements based on the ID, use the keyword argument id:

    results = soup.find_all(id='second_p')
    
    for result in results:
      print(result.text)
    
    # This is the second paragraph.
    

    By Name

    You can also find elements using their names, but using the keyword argument attrs. As Beautiful Soup uses the name argument to contain the tag name, we can’t use name as a keyword argument. Instead, pass a value to ‘name’ in the attrs argument:

    results = soup.find_all(attrs={"name": "email"})
    
    for result in results:
      print(result)
    
    # <input name="email"/>
    

    By String

    HTML elements can also be found using their text content. To find elements that contain certain text content, pass the text content in the string argument. However, searching elements using solely the string argument will only return the inner HTML (text content) and not the HTML element:

    results = soup.find_all(string="Here is the third paragraph.")
    
    for result in results:
      print(result)
    
    # Here is the third paragraph.
    

    To find the actual HTML element, you need to use it with other filters:

    results = soup.find_all('p', string="Here is the third paragraph.")
    
    for result in results:
      print(result)
    
    # <p class="third">Here is the third paragraph.</p>
    

    The string argument finds elements that match the text content completely. To find elements that match the text content partially, pass in the string as a regular expression:

    results = soup.find_all("p", string=re.compile("third"))
    
    for result in results:
      print(result.text)
    
    # Here is the third paragraph.
    

    Combining Multiple Filters

    If you want to find only a certain number of the matched elements, you can specify it using limit argument. For example, the code below will return only the first <p> element:

    results = soup.find_all('p', limit=1)
    
    for result in results:
      print(result.text)
    
    # This is the first paragraph.
    

    You can also combine tag name and class name to find elements that match both filters:

    results = soup.find_all('p', 'first')
    # or 
    # results = soup.find_all('p', class_='first')
    
    for result in results:
      print(result.text)
    
    # This is the first paragraph.
    

    The same goes for ID:

    # Find all <p> element which the id is 'second_p'.
    results = soup.find_all('p', id='second_p')
    
    for result in results:
      print(result.text)
    
    # This is the second paragraph.
    
    # Find all <p> element which has an ID.
    results = soup.find_all('p', id=True)
    
    for result in results:
      print(result.text)
    
    # This is the first paragraph.
    # This is the second paragraph.
    

    Using the find() Method

    The find_all() method scans the entire document to look for the target elements. If you only want to look for one element, besides using the limit argument in the find_all() method, you can also use the find() method:

    result = soup.find('p', class_='second')
    
    print(result.text)
    
    # This is the second paragraph.
    

    🐰 Hare Hint: Although both methods return only a single element, the find_all() method returns the element in a list while the find() method returns the element directly.

    Scraping Websites with Beautiful Soup

    Using Beautiful Soup’s find_all() and find() methods, you can scrape information from a website by extracting the text content from the found elements. For example, the code below scrapes the job title, URL, company name, salary, and location from this job board:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://playground.roborabbit.com/jobs/'
    response = requests.get(url)
    
    if response.status_code == 200:
        html = response.content
    else:
        print(f"Failed to retrieve the page. Status code: {response.status_code}")
        
    soup = BeautifulSoup(html, 'html.parser')
    
    # Find all job listings on the page
    job_listings = soup.find_all(class_='job card')
    
    for job_listing in job_listings:
    
        job_title = job_listing.find('a').text
        job_link = job_listing.find('a')['href']
        company_name = job_listing.find(class_='company').text
        salary = job_listing.find(class_='salary').text
        location = job_listing.find(class_='location').text
    
        print(f"Job Title: {job_title}")
        print(f"Job Link: {job_link}")
        print(f"Company Name: {company_name}")
        print(f"Salary: {salary}")
        print(f"Location: {location}")
        print("\n")
    

    Result:

    Job Title: Farming Executive
    Job Link: /jobs/7dJRGX9wgRQ-farming-executive/
    Company Name: Job Inc
    Salary: $113,000 / year
    Location: Fiji
    
    Job Title: Accounting Orchestrator
    Job Link: /jobs/9XxqTZH25WQ-accounting-orchestrator/
    Company Name: Mat Lam Tam Group
    Salary: $15,000 / year
    Location: Israel
    
    Job Title: Central Accounting Developer
    Job Link: /jobs/0fIJzIQwf0c-central-accounting-developer/
    Company Name: Home Ing and Sons
    Salary: $10,000 / year
    Location: Afghanistan
    
    ...
    

    Beautiful Soup is designed for parsing and navigating static HTML or XML content. If you want to scrape data from websites that load content dynamically based on user action, Beautiful Soup alone may not be sufficient. You need to use a browser automation tool such as Selenium, Playwright, or Roborabbit to navigate the website and perform browser actions like clicking, selecting dropdown options, and submitting forms.

    Using Roborabbit for Web Scraping

    Unlike other browser automation tools, Roborabbit allows you to automate the browser and scrape data without coding. All you need to do is create an automation task in Roborabbit and add actions like navigating, clicking, saving data, etc. to the task to scrape data from a website.

    Here’s an example of a Roborabbit web scraping task that scrapes data from the previous job board:

    Although creating such a task doesn’t require coding, you can also integrate the task into your coding project using Roborabbit’s API. By making a POST request to the API with some necessary data, you can run the task from your code:

    post_url = f"https://api.roborabbit.com/v1/tasks/{task_uid}/runs"
    
    headers = {
      'Authorization' : f"Bearer {api_key}"
    }
    
    data = {
      "steps": [
        {
          "uid": "GNqV2ngBmly7O9dPRe",
          "action": "go",
          "config": {
            "url": "https://newurl.com" # new URL
          }
        }
      ]
    }
    
    response = requests.post(post_url, headers=headers, json=data)
    

    Not only that, you can also run your task with some modifications by passing in different data to overwrite the original configurations. For example, you can overwrite the "Go" step's URL by passing in a new URL in the data object when calling the API:

    data = {
      "steps": [
        {
          "uid": "GNqV2ngBmly7O9dPRe",
          "action": "go",
          "config": {
            "url": "https://newurl.com" # new URL
          }
        }
      ]
    }
    

    This makes it convenient for web scraping tasks that require flexibility and dynamic adjustments.

    🐰 Hare Hint: To learn more about how to create a task like the example above, you can refer to How to Scrape Data from a Website Using Roborabbit.

    Conclusion

    The find_all() and find() methods make Beautiful Soup very easy to use for web scraping. However, when it comes to scraping websites that load content dynamically, using or combining other tools that allow you to interact with the browser like Selenium, Playwright, or Roborabbit is recommended. So, make sure to identify your needs and requirements, and choose the tool that best suits them!

    P.S. If you would like to try out Roborabbit, click here to register a free account. 😉

    About the authorJosephine Loo
    Josephine is an automation enthusiast. She loves automating stuff and helping people to increase productivity with automation.

    Automate & Scale
    Your Web Scraping

    Roborabbit helps you get the data you need to run your business, with our nocode task builder and integrations

    Beautiful Soup Tutorial: How To Use the find_all() and find() Methods
    Beautiful Soup Tutorial: How To Use the find_all() and find() Methods