We have a new name! Browserbear is now Roborabbit

How to Clean Scraped Data with Roborabbit

Scrubbing your scraped information, improves accuracy, streamlines downstream processes, and produces a high-quality data set. Here's how you can clean your extracted data with Roborabbit.
by Julianne Youngberg · July 2024

Contents

    Raw data often needs cleaning and formatting before it can be used effectively. Depending on where you source the information, your data might be excessive, incomplete, or poorly organized. That’s why when you’re extracting data online, transforming the output to better suit your needs is a typical part of processing.

    Cleaned and formatted data is important because it:

    • Improves accuracy
    • Prepares information for analysis
    • Integrates more seamlessly with other systems and tools
    • Looks more presentable and user-friendly

    When working with data extracted using Roborabbit, you may find that simple transformations can improve the quality of your output and prepare it for use. This guide will explain when and how you can use Roborabbit's built-in features to scrub your output feed.

    When to Clean Structured Data

    Scrubbing structured data can be largely automated at different stages of the process, whether it's within a database, through task automation, or at point of extraction. Here's a simplified overview of when you might consider each option:

    • Database : When complex data processing is required
    • Task automation tool : When a few tweaks are needed for a specific automated task
    • Web scraping tool : When standard output needs to be transformed slightly before being stored or used in a final product

    Knowing when to format extracted data is a matter of understanding your use case and the impact the decision makes on workflow efficiency. For example, performing too many transformations in a single task automation may be more costly and time-consuming than a nested database formula. Similarly, using a database transformation for a minor adjustment could require more storage space than handling the data at point of extraction. The best decision depends on the volume and specifics of your data cleaning requirements.

    How to Use Roborabbit to Scrub Data

    To clean data with Roborabbit, you have to create a custom feed that transforms your output based on your specifications, then delivers it along with the rest of your task run results so you can route it to the storage or task process of your choice.

    Step 1. Navigate to Feeds

    Let’s start from a task with output that needs cleaning.

    From the task page, scroll down to the Feeds which is listed under the Integrations section. Click Settings.

    Step 2. Create a Custom Feed

    Now, you should be on the Feeds page, where you can manage one or more output feeds from your task data.

    Click Create a New Feed.

    Step 3. Add Fields to Builder

    You should now be on a page where you can set up your feed’s custom transformations and view the output.

    To apply transformations, you first need to add the fields that will be modified to the builder. Do this by specifying each field and choosing between text or number types. You can also add a target name if desired. Then, click Add Field to load it into the builder.

    Alternatively, you can click Add All Fields to auto-add all of your standard output components to the builder.

    Step 4. Apply Transformation(s)

    Now, you can make adjustments to your output. Click on the transformation counter next to the field you want to modify.

    Set up your transformation by choosing the transformation type.

    Then, fill out any specifications that might be required. Click Add Transformation.

    You can stack multiple transformations until the custom output is exactly what you need.

    Hare Hint 🐰: Learn all about Roborabbit’s data transformation types and examples of when to use them in the next section of this guide!

    Step 5. Check Output

    It’s important to make sure all of your structured data is being processed correctly, so inspect your output by returning to the feed page and viewing the custom output section at the bottom.

    Another way to access this data is from the task’s Feeds page, where you will find your Field URLs.

    Clicking this will lead to a JSON array of your custom output. This is also accessible via the Roborabbit API, allowing you to retrieve your cleaned feed instead of the raw extracted data.

    Roborabbit Data Transformation Types

    Roborabbit currently supports 16 different data transformation types, as follows:

    Append

    This transformation option adds something to the end of your data, such as a description or reference number.

    Convert to Float

    This transformation option converts an integer to a float, which can be helpful when you’re working with monetary amounts or wanting to maintain a consistent number of characters.

    Convert to Integer

    This transformation option converts a float to an integer, making for cleaner output if decimals aren’t important.

    Prepend

    This transformation option adds something to the beginning of your data, such as a reference number or currency symbol.

    Find and Replace

    This transformation option locates and replaces all mentions of a string into something else, helping to standardize formatting or remove unwanted terms.

    Find Email

    This transformation type locates strings following typical email formats, making it easy to extract contact information and compile mailing lists.

    Find Phone Number

    This transformation type locates strings following typical phone number formats, assisting you in extracting contact information.

    Lowercase

    This transformation type transforms all text into lowercase letters, ensuring consistent formatting.

    Uppercase

    This transformation type transforms all text into uppercase letters, ensuring consistent formatting.

    Titlecase

    This transformation type transforms all text into title case, which can clear up inconsistencies and prepare headers.

    Split By

    This transformation type splits text using a separator, then returns the specified string such as a first name or a last name.

    Strip

    This transformation type removes white spaces from the beginning and ending of a string, cleaning it up for use.

    Truncate

    This transformation type shortens text to be within a certain character limit, which can keep product titles or descriptions within display constraints or trim lengthy URLs.

    Remove All Spaces

    This transformation type removes all spaces from the string, which can be helpful when forming values like SKUs or unique identifiers that aren’t meant to have any whitespace.

    Remove URL

    This transformation type removes any URL strings within your data, which can clean up extraneous links and keep core content clean and structured.

    Split Into Array

    This transformation type splits a string into items that make up an array, such as lists, comma-separated values, or names.

    Cheat Sheet

    Command Description
    Append Add something to the end of your data
    Convert to Float Converts an integer to a float
    Convert to Integer Converts a float to an integer
    Prepend Add something to the beginning of your data
    Find and Replace Find a string and replace it with something else
    Find Email Filters data matching typical email format
    Find Phone Number Filters data matching typical phone number format
    Lowercase Transform all text to lowercase
    Uppercase Transform all text to uppercase
    Titlecase Transform the first letter of every word to uppercase
    Split By Use a separator to split some text
    Strip Removes white spaces from the beginning and end of text
    Truncate Shortens the text to a specific number of characters
    Remove All Spaces Remove all spaces from your data
    Remove URL Removes URLs from your data
    Split Into Array Splits your data into an array using a separator

    Conclusion

    Properly cleaning and formatting your web-scraped data is a critical step in the data extraction process. By leveraging Roborabbit's built-in data transformation features, you can take messy, unstructured output and transform it into clean, usable data that is ready for analysis, integration, and presentation.

    From removing unwanted URLs and standardizing text formatting, Roborabbit has the tools to handle basic data cleaning requirements. By taking the time to properly scrub your scraped information, you'll improve accuracy, streamline downstream processes, and end up with a high-quality data set.

    About the authorJulianne Youngberg@paradoxicaljul
    Julianne is a technical content specialist fascinated with digital tools and how they can optimize our lives. She enjoys bridging product-user gaps using the power of words.

    Automate & Scale
    Your Web Scraping

    Roborabbit helps you get the data you need to run your business, with our nocode task builder and integrations

    How to Clean Scraped Data with Roborabbit
    How to Clean Scraped Data with Roborabbit