Wednesday, October 13, 2010

SSIS Eliminating Duplicate Rows using The PARTITION BY clause

Recently I reviewed a post from  SQLServerCentral about how to eliminate the duplicated rows. We used this kind technique in the ETL process to de-duplicate rows from source tables.

Here is the sample code snippet used in OLE DB source :

SELECT
ROW_NUMBER() OVER (PARTITION BY LocationCode ORDER BY SourceID DESC) RowNumber,
SourceID,
LocationCode,
Column1,
Column2
FROM dbo.DimLocation WITH(NOLOCK) WHERE ......

The Row_Number() function will return a number started from 1 based on the PARTITION BY clause and ORDER BY clause. Whenever the column RowNumber is greater than 1 in the result set above is a duplicate row. We could then add a conditional split based on this column to send those rows that having RowNumber > 1 to error flow for review or drop them on the floor.


image


Should you use the source primary key order ( SourceID column as example above) or others to make the decision in your query?  This is a very import question to ask. You need to know your business rules well  in order to pick the correct data to load into the destination table, otherwise you may drop the right data on the floor during the de-duplicate process.


Here is the post Eliminating Duplicate Rows using The PARTITION BY clause by Suresh Maganti which I encourage you to check out.

No comments:

Post a Comment