In this article, we will demonstrate how to import tables from a CSV file, remove duplicate records using transformations, and migrate the cleaned data into the destination database using Bold Data Hub. Follow the step-by-step process below.
Sample Data Source:
Learn about Pipeline Creation
Learn more about transformation here
Duplicate records in a dataset can lead to inconsistencies and inaccurate analysis. To ensure data integrity, we can remove duplicates based on unique identifiers such as Customer_ID
and Ticket_ID
while retaining the most recent entry.
We use the DISTINCT ON
clause to retain only one record per Customer_ID
and Ticket_ID
, prioritizing the latest entry based on Ticket_Creation_Date DESC
.
WITH Unique_Tickets AS (
SELECT DISTINCT ON (Customer_ID, Ticket_ID) *
FROM {pipeline_name}.sample_csc_data
ORDER BY Customer_ID, Ticket_ID, Ticket_Creation_Date DESC
)
SELECT * FROM Unique_Tickets;