Search results

Removing Duplicates and Transforming Data Using Bold Data Hub

In this article, we will demonstrate how to import tables from a CSV file, remove duplicate records using transformations, and migrate the cleaned data into the destination database using Bold Data Hub. Follow the step-by-step process below.

Sample Data Source:

Sample CSC Data


Creating Pipeline

Learn about [Pipeline Creation] (https://help.boldbi.com/working-with-data-sources/working-with-bold-data-hub/working-with-pipelines/)

Applying Transformation

  • Go to the Transform tab and click Add Table.

  • Enter the table name to create a transform table for customer satisfaction summary.

Tranformation Use Case

Note: The data will initially be transferred to the DuckDB database within the designated {pipeline_name} schema before undergoing transformation for integration into the target databases. As an illustration, in the case of a pipeline named “customer_service_data”, the data will be relocated to the customer_service_data table schema.


Learn more about transformation here

Removing Duplicate Records

Overview

Duplicate records in a dataset can lead to inconsistencies and inaccurate analysis. To ensure data integrity, we can remove duplicates based on unique identifiers such as Customer_ID and Ticket_ID while retaining the most recent entry.

Approach

We use the DISTINCT ON clause to retain only one record per Customer_ID and Ticket_ID, prioritizing the latest entry based on Ticket_Creation_Date DESC.

SQL Query for Removing Duplicates

WITH Unique_Tickets AS (
    SELECT DISTINCT ON (Customer_ID, Ticket_ID) * 
    FROM {pipeline_name}.sample_csc_data 
    ORDER BY Customer_ID, Ticket_ID, Ticket_Creation_Date DESC
) 
SELECT * FROM Unique_Tickets;

Tranformation Use Case