CTE to Remove Duplicates « darkblue blog

CTE to Remove Duplicates

23 March 2013, 9:00 pm

Consider for a moment, a table of records including addresses and data, such that most every address occurs once, but for unknown reasons, also contains more than one duplicate address (656 in 55,279 below). How to remove duplicate addresses? let us count the ways.. Here is one:

pg_logo


copy (

 select a_name, address, city, state, zip,

      attribute_1, attribute_2, attribute_3,

      latitude, longitude

 from

 (select *,

  rank() over (partition by (address,city,state,zip)

    order by attribute_1)

   from data1_test) a

 where a.rank = 1

)

 to '/home/shared/daminifthisworks.csv' with CSV header;
-- COPY 55279   of 55935
truncate data1_test ;
copy data1_test from '/home/shared/daminifthisworks.csv' with CSV header;

-- COPY 55279

The subselect adds one more term, the rank, which is used in the outer expression to write out only the rank one result. Also, notice that SELECT * obviates the need to write all of these column names twice. So this is useful both in the case of true duplication of the entire row, and in other cases where the lines are different in a way that does not pertain to the situation at hand. This example differentiates with an ORDER rank() on one variable to choose which to keep – other methods could be devised.

No doubt someone will say, why dont you do this in one step, in place? The two step here suits me well enough. You can hold the .csv file as intermediary, and also backup the original table if you like.

Compare Two Tables

Given two tables, one of them is somewhat a subset of another, in certain columns..
e.g. address,city,state,zip.
How might we determine which entries in table A are not included in table B ? One way is to use EXCEPT.
select c1,c2,c3,c4 from A except select c1,c2,c3,c4 from B;

An alternative to that is to use WHERE NOT EXISTS. Here is an example using a correlated subquery, and makes use of a rule to shorten the AND comparison sytnax, as follows (I believe both forms run in similar times)
select c1,c2,c3 from A where not exists (select * from B where (B.c1,B.c2,B.c3)=(A.c1,A.c2,A.c3) );

As a first try, before I had seen EXCEPT, I solved this by doing a RIGHT JOIN to the smaller table A, including another field from B and then checking for NULL. This works but was about 10% slower.

thanks to RhodiumToad for guidance

UPDATE via robe2
“I have duplicate city rows in my table , with the same city name and state code but different geopoints, I am storing points for each of the city, I would like to delete cities that have the same data and are within a certain distance from each other, cities with same names but with a reasonable distance I would like to keep.”

DELETE FROM cities USING cities As ct WHERE ct.gid > cities.gid AND ST_DWithin(ct.geom, cities.geom, 0.025) AND cities.name = ct.name;

Filed under Postgres

Comments are closed | Permalink

darkblue blog

CTE to Remove Duplicates

Compare Two Tables

Recent Posts

Archives

Categories

Blogroll

Meta