How to Clean Scraped Data with Roborabbit
Contents
Raw data often needs cleaning and formatting before it can be used effectively. Depending on where you source the information, your data might be excessive, incomplete, or poorly organized. That’s why when you’re extracting data online, transforming the output to better suit your needs is a typical part of processing.
Cleaned and formatted data is important because it:
- Improves accuracy
- Prepares information for analysis
- Integrates more seamlessly with other systems and tools
- Looks more presentable and user-friendly
When working with data extracted using Roborabbit, you may find that simple transformations can improve the quality of your output and prepare it for use. This guide will explain when and how you can use Roborabbit's built-in features to scrub your output feed.
When to Clean Structured Data
Scrubbing structured data can be largely automated at different stages of the process, whether it's within a database, through task automation, or at point of extraction. Here's a simplified overview of when you might consider each option:
- Database : When complex data processing is required
- Task automation tool : When a few tweaks are needed for a specific automated task
- Web scraping tool : When standard output needs to be transformed slightly before being stored or used in a final product
Knowing when to format extracted data is a matter of understanding your use case and the impact the decision makes on workflow efficiency. For example, performing too many transformations in a single task automation may be more costly and time-consuming than a nested database formula. Similarly, using a database transformation for a minor adjustment could require more storage space than handling the data at point of extraction. The best decision depends on the volume and specifics of your data cleaning requirements.
How to Use Roborabbit to Scrub Data
To clean data with Roborabbit, you have to create a custom feed that transforms your output based on your specifications, then delivers it along with the rest of your task run results so you can route it to the storage or task process of your choice.
Step 1. Navigate to Feeds
Let’s start from a task with output that needs cleaning.
From the task page, scroll down to the Feeds which is listed under the Integrations section. Click Settings.
Step 2. Create a Custom Feed
Now, you should be on the Feeds page, where you can manage one or more output feeds from your task data.
Click Create a New Feed.
Step 3. Add Fields to Builder
You should now be on a page where you can set up your feed’s custom transformations and view the output.
To apply transformations, you first need to add the fields that will be modified to the builder. Do this by specifying each field and choosing between text or number types. You can also add a target name if desired. Then, click Add Field to load it into the builder.
Alternatively, you can click Add All Fields to auto-add all of your standard output components to the builder.
Step 4. Apply Transformation(s)
Now, you can make adjustments to your output. Click on the transformation counter next to the field you want to modify.
Set up your transformation by choosing the transformation type.
Then, fill out any specifications that might be required. Click Add Transformation.
You can stack multiple transformations until the custom output is exactly what you need.
Hare Hint 🐰: Learn all about Roborabbit’s data transformation types and examples of when to use them in the next section of this guide!
Step 5. Check Output
It’s important to make sure all of your structured data is being processed correctly, so inspect your output by returning to the feed page and viewing the custom output section at the bottom.
Another way to access this data is from the task’s Feeds page, where you will find your Field URLs.
Clicking this will lead to a JSON array of your custom output. This is also accessible via the Roborabbit API, allowing you to retrieve your cleaned feed instead of the raw extracted data.
Roborabbit Data Transformation Types
Roborabbit currently supports 16 different data transformation types, as follows:
Append String to Scraped Data
The Append transformation option adds something to the end of your data, such as a description or reference number.
Convert Integer to Float
The Convert to Float transformation option converts an integer to a float, which can be helpful when you’re working with monetary amounts or wanting to maintain a consistent number of characters.
Convert Float to Integer
The Convert to Integer transformation option converts a float to an integer, making for cleaner output if decimals aren’t important.
Prepend String to Scraped Data
The Prepend transformation option adds something to the beginning of your data, such as a reference number or currency symbol.
Find and Replace Text in Scraped Data
The Find and Replace transformation option locates and replaces all mentions of a string into something else, helping to standardize formatting or remove unwanted terms.
Find Email in Scraped Data
The Find Email transformation type locates strings following typical email formats, making it easy to extract contact information and compile mailing lists.
Find Phone Number in Scraped Data
The Find Phone Number transformation type locates strings following typical phone number formats, assisting you in extracting contact information.
Transform Scraped Data into Lowercase
The Lowercase transformation type transforms all text into lowercase letters, ensuring consistent formatting.
Transform Scraped Data into Uppercase
The Uppercase transformation type transforms all text into uppercase letters, ensuring consistent formatting.
Transform Scraped Data into Titlecase
The Titlecase transformation type transforms all text into title case, which can clear up inconsistencies and prepare headers.
Split Scraped Data String By Separator
The Split transformation type splits text using a separator, then returns the specified string such as a first name or a last name.
Strip White Space from Scraped Data
The Strip transformation type removes white spaces from the beginning and ending of a string, cleaning it up for use.
Truncate Scraped Data
The Truncate transformation type shortens text to be within a certain character limit, which can keep product titles or descriptions within display constraints or trim lengthy URLs.
Remove All Spaces from Scraped Data
The Remove All Spaces transformation type removes all spaces from the string, which can be helpful when forming values like SKUs or unique identifiers that aren’t meant to have any whitespace.
Remove URL from Scraped Data
The Remove URL transformation type removes any URL strings within your data, which can clean up extraneous links and keep core content clean and structured.
Split Scraped Data String Into Array
The Split Into Array transformation type splits a string into items that make up an array, such as lists, comma-separated values, or names.
Cheat Sheet
Command | Description |
---|---|
Append | Add something to the end of your data |
Convert to Float | Converts an integer to a float |
Convert to Integer | Converts a float to an integer |
Prepend | Add something to the beginning of your data |
Find and Replace | Find a string and replace it with something else |
Find Email | Filters data matching typical email format |
Find Phone Number | Filters data matching typical phone number format |
Lowercase | Transform all text to lowercase |
Uppercase | Transform all text to uppercase |
Titlecase | Transform the first letter of every word to uppercase |
Split By | Use a separator to split some text |
Strip | Removes white spaces from the beginning and end of text |
Truncate | Shortens the text to a specific number of characters |
Remove All Spaces | Remove all spaces from your data |
Remove URL | Removes URLs from your data |
Split Into Array | Splits your data into an array using a separator |
Conclusion
Properly cleaning and formatting your web-scraped data is a critical step in the data extraction process. By leveraging Roborabbit's built-in data transformation features, you can take messy, unstructured output and transform it into clean, usable data that is ready for analysis, integration, and presentation.
From removing unwanted URLs and standardizing text formatting, Roborabbit has the tools to handle basic data cleaning requirements. By taking the time to properly scrub your scraped information, you'll improve accuracy, streamline downstream processes, and end up with a high-quality data set.