Every once in a while I encounter a data migration scenario where the destination table doesn’t enforce a Unique Index and when re-running data migration code if you forget to purge records in the destination prior to inserting without the unique index you could end up with duplicates.

Today’s posting focuses on one way of deleting the duplicates – using the Row_Number() OVER (PARTITION BY ) ORDER BY  syntax in SQL Server to assign row numbers by what I want to be the Unique records and then delete those that are duplicates.

In the example below, I have a table that stores salesperson assignments – it populates a table called the CommissionSalespersonDefaults.

The inserts in this example were run 3 times over a period of months.  But since the delete was omitted, every time the statements were run I ended up with duplicates/triplicates, etc. in that table, like this:

CommissionSalespersonDefaultsID

CustomerCode

ShipToCode

Commissionable

Salesperson1

22

20THC

 

TRUE

AA1

9538

20THC

 

TRUE

PrI

21971

20THC

 

TRUE

PrI

There should only be one record for this CustomerCode/ShipToCode combination in this table – I need to delete two of them, but in case some attribute is different like in this case the sales rep changed from the 1st migration to the 2nd/3rd I only want to keep the last one – the one with ID 21971.

There are a number of ways to get this done, but for today’s article I’m focusing on the Row_Number() OVER (PARTITION BY ) ORDER BY feature in SQL Server.

Here’s a Select statement showing just the first 9 records with this in action:

I like to see what I’m about to do when it comes to delete statements so I usually start with a select statement giving me a preview of what’s about to happen:

SELECT TOP 9 * FROM CommissionSalespersonDefaults s 

INNER JOIN (

  SELECT *

  , ROW_NUMBER()

    OVER (

           PARTITION BY CustomerCode, ISNULL(ShipToCode, )

           ORDER BY CommissionSalespersonDefaultsID DESC

         ) AS RowNumber

  FROM CommissionSalespersonDefaults

) Numbered ON s.CommissionSalespersonDefaultsID = Numbered.CommissionSalespersonDefaultsID

What the query does is add the RowNumber, with it resetting after each change in whatever you put in the Partition By part – in this case, in each change in CustomerCode+ShipToCode.  The ORDER BY Clause tells SQL Server the order in which to order the records prior to assigning the row numbers.  Here’s a sample of the output

RowNumber

CommissionSalespersonDefaultsID

CustomerCode

ShipToCode

Commissionable

Salesperson1

1

21971

20THC

 

TRUE

PrI

2

9538

20THC

 

TRUE

PrI

3

22

20THC

 

TRUE

AA1

1

21972

212

 

TRUE

SCS

2

9539

212

 

TRUE

SCS

3

23

212

 

TRUE

SCS

1

21973

280T

 

TRUE

TSW

2

9540

280T

 

TRUE

TSW

3

24

280T

 

TRUE

TSW

So it’s easy to see here that what we want to delete are any records where the RowNumber is > 1.

DELETE CommissionSalespersonDefaults

FROM CommissionSalespersonDefaults

INNER JOIN (

  SELECT *

  , ROW_NUMBER()

    OVER (

            PARTITION BY CustomerCode, ISNULL(ShipToCode, )

            ORDER BY CommissionSalespersonDefaultsID DESC

          ) AS RowNumber

  FROM CommissionSalespersonDefaults

) Numbered ON CommissionSalespersonDefaults.CommissionSalespersonDefaultsID = Numbered.CommissionSalespersonDefaultsID

WHERE RowNumber > 1

Now I have a table containing only the most recently imported records.

 

Advertisement