Search results

Filtering Invalid Entries Using Bold Data Hub

In this article, we will demonstrate how to import tables from a CSV file, remove invalid data using transformations, and migrate the cleaned data into the destination database using Bold Data Hub. Follow the step-by-step process below.

Sample Data Source:

Sample CSC Data


Creating Pipeline

Learn about Pipeline Creation

Applying Transformation

Learn more about transformation here

Removing Invalid or Inconsistent Data in DuckDB

Overview

Data quality issues such as invalid phone numbers or negative ticket costs can impact analysis and decision-making. This guide explains how to filter out inconsistent records using DuckDB queries.

Approach

We use regular expressions (regexp_matches) to validate phone numbers and apply conditional filtering (WHERE ticket_cost >= 0) to ensure only valid data is retained.

SQL Queries

Filter Out Invalid Phone Numbers

Ensures phone numbers contain exactly 10 digits (numeric only).

SELECT 
    Ticket_ID, 
    Customer_ID, 
    Customer_Name, 
    Ticket_Category, 
    Issue_Description, 
    Ticket_Status, 
    Priority, 
    Agent_ID, 
    Agent_Name, 
    Customer_Satisfaction, 
    Ticket_Creation_Date, 
    Ticket_Resolution_Date, 
    Ticket_Comments, 
    Region, 
    City, 
    Country 
FROM {pipeline_name}.sample_csc_data 
WHERE regexp_matches(CAST(Phone AS VARCHAR), '^[0-9]{10}$');

Tranformation Use Case

Filter Out Negative Costs

Removes records where ticket_cost is negative, ensuring only valid financial data is retained.

SELECT 
    Ticket_ID, 
    Customer_ID, 
    Customer_Name, 
    Ticket_Category, 
    Issue_Description, 
    Ticket_Status, 
    Priority, 
    Agent_ID, 
    Agent_Name, 
    Customer_Satisfaction, 
    Ticket_Creation_Date, 
    Ticket_Resolution_Date, 
    Ticket_Comments, 
    Region, 
    City, 
    Country 
FROM {pipeline_name}.sample_csc_data 
WHERE ticket_cost >= 0;

Tranformation Use Case