TransactSQL: Finding Duplicates

Finding Duplicates

If we group records together by certain identifying fields we can then use a Count function to extract those that are duplicated. This query utilizes a sub-query and, based on their first and last names, returns a list of DelegateIDs for those delegates that have duplicate records.


       SELECT Delegate.DelegateID, Delegate.FirstName, Delegate.LastName 
       FROM Delegate INNER JOIN 
             (SELECT FirstName, LastName 
                    FROM Delegate 
                    GROUP BY FirstName, LastName 
                    HAVING Count(DelegateID) > 1 
             )     AS Duplicates 
             ON Delegate.FirstName = Duplicates.FirstName 
             AND Delegate.LastName = Duplicates.LastName

Removing Duplicates

The following query removes all duplicates from the delegate table, leaving only the originals of the duplicated record. Here, we define a record as having a duplicate if there is another record with matching FirstName and LastName fields. This query relies on a view being set up that returns a list of duplicate delegates. The view is called DupDels.


       DELETE FROM Delegate
       WHERE DelegateID =
       ANY ( SELECT DupDels.DelegateID 
             FROM DupDels LEFT JOIN 
                    (SELECT Min(DelegateID) AS DelegateID, FirstName, LastName 
                          FROM DupDels GROUP BY FirstName, LastName) AS FirstDup 
                    ON DupDels.DelegateID = FirstDup.DelegateID 
             WHERE FirstDup.DelegateID IS NULL 
             )

This is general discussion. You should translate it to your case by yourself.


CREATE TABLE Dups ( ID INT IDENTITY(1,1) PRIMARY KEY, A VARCHAR(30), B VARCHAR(30), C VARCHAR(30) )
go INSERT INTO Dups(A,B,C) VALUES ('hi','there','all') INSERT INTO Dups(A,B,C) VALUES
('hi','there','all') INSERT INTO Dups(A,B,C) VALUES ('bye','from','me') INSERT INTO Dups(A,B,C)
VALUES ('repeated','three','times') INSERT INTO Dups(A,B,C) VALUES ('repeated','three','times')
INSERT INTO Dups(A,B,C) VALUES ('repeated','three','times')

-- Duplicate rows - one from each group SELECT * FROM Dups T WHERE ID > ( SELECT MAX(ID) FROM Dups I
WHERE I.A = T.A AND I.B = T.B AND I.C = T.C AND I.ID <> T.ID )

-- Non-duplicated view - one row from each group SELECT * FROM Dups T WHERE ID >= ALL ( SELECT ID
FROM Dups I WHERE I.A = T.A AND I.B = T.B AND I.C = T.C )

-- Rows to be deleted - usefull for COUNT(*) and DELETE 
SELECT * FROM Dups T 
 WHERE ID < ( 
  SELECT MAX(ID) FROM Dups I 
   WHERE I.A = T.A AND I.B = T.B AND I.C = T.C AND I.ID <> T.ID 
  )

-- Delete the duplicates 
DELETE Dups 
 WHERE ID < ( 
  SELECT MAX(ID) FROM Dups I 
   WHERE I.A = Dups.A AND I.B = Dups.B AND I.C = Dups.C AND I.ID <> Dups.ID 
  )

--
Ivan Arjentinski