Finding Duplicates with SQL
By Pete Freitag
Let's suppose you coded a email newsletter signup form, but you forgot to double check that the email address was not a duplicate, or already in the database. We can write a query to find all the emails in our table that are duplicates, or occurs in more than one row.
The following SQL query works great for finding duplicate values in a table.
SQL To Find Duplicates
SELECT email, COUNT(email) AS NumOccurrences FROM users GROUP BY email HAVING ( COUNT(email) > 1 )
By using group by
and then having a count
greater than one, we find rows with with duplicate email addresses using the above SQL.
Find Rows that Occur Once
The opposite of our duplicates SQL query, we can use a variant of that SQL statement to find rows that occur exactly once:
SELECT email FROM users GROUP BY email HAVING ( COUNT(email) = 1 )
Original Sources: Huajun Zhai's Blog, How to remove duplicate rows from a table - Microsoft Knowledge base article (139444).
Finding Duplicates with SQL was first published on October 06, 2004.
If you like reading about sql, databases, or duplicates then you might also like:
- Cheat Sheet for SQL Server
- Use Char instead of Varchar to Store UUID's
- Sphinx - Open Source SQL Full Text Search Engine
- Updated SQL Reserved Words Checker
Discuss / Follow me on Twitter ↯
Tweet Follow @pfreitagComments
murali is a table name which has many names as one of its field ,in that mani name is repeated for ten times,iwant to get just nine times of 'mani' name from the table
can some one help me on that.
Regarding query:
SELECT email,
COUNT(email) AS NumOccurrences
FROM users
GROUP BY email
HAVING ( COUNT(email) > 1 )
I did:
SELECT email,
COUNT(email) AS NumOccurrences
FROM users
GROUP BY email
HAVING ( NumOccurrences > 1 )
I tried it and it seemed to work. Would it reduce the COUNT time because I don't need to use the COUNT() function again or is it the same thing?
Cheers,
Mark
adcoil.com
Query
SELECT a.customer_id,a.name,a.email from customer a inner join customer b on a.name=b.name and a.email=b.email abd a.customer_id<>b.customer_id
I think this will work
Which is the best method. (According to Performance)
I found my dups but how do I display the different columns (fields) of the duplicated records (ie:name, address, phone, etc). I tried to use "where exists" along with the count stmt but that displays all recs
dave
--select all fields literaly
select empno,ename,job,mgr,hiredate,sal,comm,deptno, count(*)
from tempemp
--group by all fields literaly
group by empno,ename,job,mgr,hiredate,sal,comm,deptno
having count(*) > 1
order by count(*) desc,sal asc;
I hope this helps someone else!
elements which are common in both tables in the following format
element name Table name
m A
m B
thanks for ur help
Tx for the help in advance.
In fact, I went further and found not only the dupes, but just *how* duplicated they are. This's my query:
SELECT field3, COUNT (field3) AS count
FROM table2
GROUP BY field3
HAVING ( COUNT (field3) > 1)
ORDER BY count DESC, field3
The results start with the entries duplicated the *most*, and continue on to the ones duplicated only twice.
Anyway, yer post helped me out a lot, and next time I'm in New York, I ought to buy you a drink.
Where KEY_ID IN (Select MIN(KEY_ID)
FROM table1 Group by REPEATED FIELDS
Having count(REPEATED FILEDS) > 1
Here is a query that displays all records in a table where two selected field values are the same in Oracle syntax. SELECT * FROM table WHERE same_value_field1 = same_value_field2; I hope that is what you are looking for.
I want to first show the version in the target language... and if not, fall back on English (this way, the Wiki starts out filled in for the user of any language).
It's like taking the following two queries and merging them together so that I have distinct topic_id's and don't miss any topics... and always making sure the current language wins.
SELECT topic_id, title FROM revisions WHERE topic_id=:parent_id AND language_id=:current_language
SELECT topic_id, title FROM revisions WHERE topic_id=:parent_id AND language_id=1
# in this example, language 1 is english
Any ideas? Performance is important since there could be a large # of topics.
Thanks,
Pete
Could I do this?
SELECT email,
COUNT(email) AS NumOccurrences
FROM users
GROUP BY email
ORDER BY created_date desc
HAVING ( COUNT(email) > 1 )
Will it work with ORDER BY? I need to start from latest record.
If you have a key based on multiple columns, the correct statement is:
SELECT ColKey1, ColKey2, ... , COUNT(*)
FROM TableName
GROUP BY ColKey1, ColKey2, ...
HAVING COUNT(*) > 1
This will return rows having duplicates.
Regards,
Stan Daymond, London, UK
1. Name
2. Email
And we have to find out the common email with their all fields value.
So this should be a solution
select n.* from newsletter n where
(select count(email) from
newsletter where email=n.email)>1
SELECT distinct email
FROM users
The logic is simply 3 select statements using the word EXISTS. Can any one help me? Let me know if you need the same logic to be poted.
Please help me with this.
I have a table with 4 colmuns: A, B, C, D.
The table has maybe around 5 rows in which two rows have exactly the same values in all the 4 columns. What is the SQL with which I can pickup the duplicate row?
Thx,
Srikanth
IFILN, IBOOKN, IDTBOK, inmate_id, visit_no.
inmate_id is a unique identifier attached to each person abd I need to be able to count the number of times that person appears and put that count into visit_no. example:
inmate_id visit_no
A 1
B 1
A 2
C 1
A 3
B 2
so that it shows this was the 1st visit, this was the 2nd etc. Can i use a form of this, if so/not any ideas on how to implement?
A, B, C and data for column A and B is duplicated, and need to remove the duplicated records but before I remove the record i have to check column 'B' which has some condtion if column 'B' data is 0 I have to remove all the other data which is duplicated either wise I have o keep the records, which means the first priority is for to get 0.
Example:
A B C
1 3 0
1 3 2
I need to have the result of
A B C
1 3 0
Please any idea?
Thanks,
Daniel
I have a similar issue -- let's say I have two tables, A and B, with two columns in each, table A is a table where users upload data, and B is a final table. A has duplicate data, as well as updated records. I've been trying to figure out a sql query which would go through table A, find which entries are not in table B and then copy them over. Further, it should check to see if any records have been updated in Table A and replace the data in Table B with the new data.
Any ideas?
example:
A(Temp Table)
1 1
2 22
3 3
B(Final Table) (Before addition of A)
1 1
2 2
4 4
B(Final Table) (After addition of A)
1 1
2 22
3 3
4 4
That should do the trick...
SELECT trading_id,trading_name
FROM schema.marketer_tbl a
WHERE ROWID >
(SELECT MIN (ROWID)
FROM schema.marketer_tbl b
WHERE b.trading_id = a.trading_id AND b.trading_name= a.trading_name);
And if we'r putting DISTINCT after the first select, we'll get the exact columns who are repeating.
Thanks,
Sinoy Xavier
Infosys, Bangalore
I want to select users who are belonging to more than one group.
I am looking to display all duplicate records in my table but in two fields.
tried this but it's not working. Could anyone tell me what's wrong with it?
SELECT NAME, Address1 COUNT(NAME) AS NumOccurName,
COUNT(Address1) AS NumOccurAddress1,
FROM general_table
GROUP BY Address1
HAVING ( COUNT(NAME) > 1 ) AND ( COUNT(Address1) > 1)
Great solution
I need to filter the records based on the unique combination 3 fields
eg: in source
fld1 fld2 fld3
a 1 1
a 1 1 dup record
a 2 1
x x 1
x x 1 dup record
i need filter the duplicates
so my output should be
fld1 fld2 fld3
a 1 1
a 2 1
x x 1
so i need a query to get this output, i need to get the first occurance of the unique record.
pls hlp me thanks in advance
Thank you.
Serj
that is having values like a,b,c,d
i need to take status based on the priority wise(a-2,b-3,c-1,d-4)
eg:
Emp status
1 a
1 b
1 c
2 b
2 c
3 d
3 b
in the above table
output should be
emp status
1 c
2 c
3 b
please help me out to get that above output
thanks in advance
training abc jan april
training def march
any idea how to do that.. i dont want to repeat the same training....i just want view it as onne training buat have diffenret date.. TQ so much
"Hi Padam. retrieve all columns from duplicate records like this: SELECT * from tbl where tbl.col in ( SELECT tbl.col FROM tbl GROUP BY tbl.col HAVING ( COUNT(*) > 1 )) order by tbl.col Ed"
What will happen if the data like below. In the below structure, INDEX 1 and 2 are duplicate. In these case i need to identify any one of the index as a duplicate?
NAME NAME_1 VALUE DATE INDEX
SURI SE 275 13/12/2005 1
SURI SE 375 1
SURI SE 475 1
SURI SE 275 13/12/2005 2
SURI SE 375 2
SURI SE 475 2
Now Table A has
Last Name, First Name, Country, New Table
Now Table B has
Last Name, First Name, Country, Old Table
I want to remove dupilcates of the these 2 tables and have my result as a New Table C which has no duplicates.
i.e C= a-b
plz help me..
Plz
create table table c as
select last name, first name, country, new table from table a
union
select last name, first name, country,old table from table b;
This will give you all the unique records in a and b
create table table c as select last name, first name, country, new table from table a intersect select last name, first name, country,old table from table b;
This should give you only the commonalities of a and b.
the solutions above list the records only once.
I have a table where the duplicate is based on 5 columns but the remaining column may be different.
So I want to query and bring back only the dupes. I don't want a count.
I have a table with duplicate records, but the duplicate records are based on all fields with the exception of the key field. All records have an ID which is the pk, so technically, the records are unique, but I need to delete duplicate records based on the other fields.
Example:
ID Name Number City
1 John Doe Nashville
2 John Doe Nashville
I want to keep one of them and remove the other. Each has a unique pk, so selecting which records to delete is difficult. Doesn't matter which one is deleted, as long as only one remains.
Any help would be greatly appreciated!! Thank you so much!
persionID FullName
01245 Donkey kong
01245 Donkey kongKing
both are same person's name now i can not get the distinct name out of it as the name stored differently , even though i use Distinct person ID.
And joining with other datatables it gives me more bad results . I don't see any condition also which i can apply for this selection.
Help
SELECT * FROM [your_table_name] ORDER BY [your_table_name].[date] ASC
for ex:empno,empname when retreived i want it to come as single field .Can anyone provide me sql query for this.
No, a simple select statement will not do. The example that I have given only has three IDs, but might have millions in real life. I need a generic solution that will ONLY list IDs which have records that ONLY contain a status of 'n'. Any ID that has even one record with a different status MUST be excluded.
The way that I have done this is to use a cursor, first summing the number of entries for each ID, the secondly summing the number of entries for each ID that has a status of 'n', and then comparing the two results. If an ID has a "total record" count that equals the "'n' status record" count of x, it is included in the end result set. If the "'n' status" count is less that the "total" count, it is obvious that there are other statuses involved, and the ID is excluded from the result set.
The above works fine, but I am sure there must be a more elegant way of doing this. I just don't know how!
Kind regards
Francois
i'm trying to distinctly count the number of entries by date by sales contact - this query works but only returns 1 column with the months i specify, i would like to iterate through all months:
select distinct [sales contact],
COUNT([sent to customer date])
as 'jan'
from quotes
where [sent to customer date]
like '%feb-10%' group by [Sales Contact]
and of course changing it to:
select distinct [sales contact],
COUNT([sent to customer date])
as 'jan',
COUNT([sent to customer date])
as 'feb'
from quotes
where [sent to customer date]
like '%feb-10%' group by [Sales Contact]
adds the second column as feb, but fills it with the results from jan.
Formid FormDate company profile
In that FormDate contains Dates of the records formed.It contains the dates of Year 2008 to 201o and unique formid. I have created one web page in that From Date and To Date selection list is there. If I select the particular from date and To Date and if click submit button, It need to show the particular form id of the records according to the date. And when I click on to that Particular form id it need to show the particular record according to the particular form id.
Should I join the crowd with a comment on the lines of "I want an SQL statement that cleans my teeth, puts my babies to bed, and goes to the supermarket for me"?
I think not. ;)
COUNT(email) AS NumOccurrences
FROM users
GROUP BY email
HAVING ( COUNT(email) > 1 )
===========
Where do I go to do all of this?
I would like to know the DATABASE concepts. Could anyone can provide good URS's for DATABASE concepts, it would be grateful.
I have trying to write a sybase query to check for duplicate records and to execute conditions when the number of duplicate records are more than 5, and also when the duplicate records is less than 5, So i have the below query, please guys provide a review and feedback on the below query
IF (select A,B,C from TEMP_TABLE
group by A,B,C
having count(*) > 1 ) > 5
Print "Error"
else
((select A,B,C from TEMP_TABLE
group by A,B,C
having count(*) > 1) <5
INSERT INTO TABLE_1 (A,B,C) SELECT A,B,C FROM TEMP_TABLE )
I have one table with sales data spawning over multiple years. This table includes all buying customers (CustNo) with sales month (Date) and shop area (ShopNo)
What I need is a result displaying a unique count of customers, starting in january and for each month adding the customers that haven't been buying yet in this year(firstbuyers). This will be calculated for each area too.
Example: In january 500 customers bought articles, in february it was 530, of which 40 where there for the first time this year, in march 490 customers bought articles, of which 25 were there for the first time this year.
Smaller figure will occur for individual areas. Example: In one sub area 350 customers bought something in january; 360 in february (15 firstbuyers) and 340 in march(20 firstbuyers).
My increasing counting table would then look like this:
Year:Month:Area:Count:
2010:January:AllAreas:500
2010:February:AllAreas:540
2010:march:AllAreas:565
2010:January:SubArea:350
2010:February:SubArea:365
2010:march:SubArea:385
How do I achieve that without creating multiple temp tables for each month and merging them manually, grouping and counting customers after each merger?
Thanks for your help!
insert into table b (a,b,c,d,e,f) values (1,2,3,4,5,6) Where a,b,c,d,e,f Not in (select a,b,c,d,e,f from table b)
I cant seem to make it work.. is this possible
Create table tableCleanDup
(idfield int, field1 varchar(30), field2 varchar(30))
Create unique index removeduplicates on tableCleanDup (field1,field2) with IGNORE_DUP_KEY
insert tableCleanDup
select * from tableOriginal
it will send a message "duplicate key was ignored" but that is fine.