Skip to content

CTE to Remove Duplicates

Consider for a moment, a table of records including addresses and data, such that most every address occurs once, but for unknown reasons, also contains more than one duplicate address (656 in 55,279 below). How to remove duplicate addresses? let us count the ways.. Here is one:

pg_logo

copy (
 select a_name, address, city, state, zip,
      attribute_1, attribute_2, attribute_3,
      latitude, longitude
 from
 (select *,
  rank() over (partition by (address,city,state,zip)
    order by attribute_1)
   from data1_test) a
 where a.rank = 1
)
 to '/home/shared/daminifthisworks.csv' with CSV header;

-- COPY 55279 of 55935

truncate data1_test ;

copy data1_test from '/home/shared/daminifthisworks.csv' with CSV header;
-- COPY 55279

The subselect adds one more term, the rank, which is used in the outer expression to write out only the rank one result. Also, notice that SELECT * obviates the need to write all of these column names twice. So this is useful both in the case of true duplication of the entire row, and in other cases where the lines are different in a way that does not pertain to the situation at hand. This example differentiates with an ORDER rank() on one variable to choose which to keep – other methods could be devised.

No doubt someone will say, why dont you do this in one step, in place? The two step here suits me well enough. You can hold the .csv file as intermediary, and also backup the original table if you like.

Compare Two Tables

Given two tables, one of them is somewhat a subset of another, in certain columns..
   e.g.   address,city,state,zip.
How might we determine which entries in table A are not included in table B ? One way is to use EXCEPT.

  select c1,c2,c3,c4 from A except select c1,c2,c3,c4 from B;

An alternative to that is to use WHERE NOT EXISTS. Here is an example using a correlated subquery, and makes use of a rule to shorten the AND comparison sytnax, as follows (I believe both forms run in similar times)

  select c1,c2,c3 from A where not exists
    (select * from B where (B.c1,B.c2,B.c3)=(A.c1,A.c2,A.c3) );

As a first try, before I had seen EXCEPT, I solved this by doing a RIGHT JOIN to the smaller table A, including another field from B and then checking for NULL. This works but was about 10% slower.

thanks to RhodiumToad for guidance

UPDATE via robe2
“I have duplicate city rows in my table , with the same city name and state code but different geopoints, I am storing points for each of the city, I would like to delete cities that have the same data and are within a certain distance from each other, cities with same names but with a reasonable distance I would like to keep.”

  DELETE FROM cities USING cities As ct
    WHERE ct.gid > cities.gid AND ST_DWithin(ct.geom, cities.geom, 0.025) AND cities.name = ct.name;

Slice a Polygon by a Grid

Sometimes there are polygons with too many vertices to process comfortably. For that or other reasons, you may want to cut one or more polygons with a reference grid. Here is an example using the US 2010 Census and a set of grid cells in EPSG:3310:

pg_logo

--------------------------------------------------------------------
-- take a single census place polygon and make a table of it
-- transform from LonLat to EPSG:3310
--------------------------------------------------------------------
create table davis_place3310 as
SELECT
 gid, statefp10,placefp10,placens10,geoid10,
 name10,namelsad10,lsad10,classfp10,pcicbsa10,pcinecta10,
 mtfcc10,funcstat10,aland10,awater10,intptlat10,intptlon10,

 ST_SetSRID( ST_Transform( the_geom, 3310), 3310) as wkb_geometry
FROM tl_2010_06_place10 where namelsad10 = 'Davis city';
--------------------------------------------------------------------
alter table davis_place3310 add primary key (gid);
create index tdavis_geom_idx on davis_place3310 using GIST (wkb_geometry);
---------------------------------------------------------------------

--------------------------------------------------------------------
-- now make a sliced result with a grid and input poly
--------------------------------------------------------------------
create table davis_sliced as
SELECT
  g.ogc_fid, -- the pkey of the reference grid

  ST_Intersection(o.wkb_geometry, g.wkb_geometry) as wkb_geometry,

  ST_Area(ST_Intersection(o.wkb_geometry, g.wkb_geometry)) as shape_area

FROM

  davis_place3310 as o,
  davis_grid as g

WHERE
  ST_INTERSECTS(o.wkb_geometry, g.wkb_geometry);
--------------------------------------------------------------------------
alter table davis_sliced add primary key(ogc_fid);
create index tds_geom_idx on davis_sliced using GIST(wkb_geometry);
--------------------------------------------------------------------------

 A Single Input Polygon (Davis city) Partially Overlaps the Reference Grid


Next, More Than One Polygon Gridded:


drop table if exists davis_place3310 cascade;
create table davis_place3310 as
SELECT
 gid, statefp10,placefp10,placens10,geoid10,
 name10,namelsad10,lsad10,classfp10,pcicbsa10,pcinecta10,
 mtfcc10,funcstat10,aland10,awater10,intptlat10,intptlon10,

 ST_SetSRID( ST_Transform( the_geom, 3310), 3310) as wkb_geometry
FROM tl_2010_06_place10
WHERE namelsad10 in ( 'Davis city', 'West Sacramento city' );
--------------------------------------------------------------------
alter table davis_place3310 add primary key (gid);
create index tdavis_geom_idx on davis_place3310 using GIST (wkb_geometry);
---------------------------------------------------------------------
-- N Polygons can be intersected in one statement calling GROUP BY
-- a spatial aggregate is defined and invoked
---------------------------------------------------------------------
drop table if exists davis_sliced cascade;
create table davis_sliced as
SELECT

  g.ogc_fid, -- the pkey of the reference grid

  ST_Union(ST_Intersection(o.wkb_geometry, g.wkb_geometry)) as wkb_geometry,

  sum( ST_Area(ST_Intersection(o.wkb_geometry, g.wkb_geometry))) as shape_area

FROM

  davis_place3310 as o,

  davis_grid as g

WHERE
  ST_INTERSECTS(o.wkb_geometry, g.wkb_geometry)

GROUP BY
  g.ogc_fid, g.wkb_geometry;
--------------------------------------------------------------------------
alter table davis_sliced add primary key(ogc_fid);
create index tds_geom_idx on davis_sliced using GIST(wkb_geometry);
--------------------------------------------------------------------------

screenshot

PostGIS 2.0 Intro Presentation

UC

at UC Berkeley Soda Hall, for the San Francisco PostgreSQL Users Group

and again at the College of Natural Resources for GIS Day 2012

main slides and dustymugs’ raster slides

errata:
* The ST_MakeValid(geom) example could have been expanded to include ST_IsValidDetail()

* It is arguably wrong to say that Lat/Lng (EPSG:4326) data is one of the projections. I meant to present that there are spatial reference systems and you use one per table, so calling Lat/Lng one of them made one less turn in the directions for use. But so it goes.

SF GeoMeetup / AGU 2012 slides

KNN Search Speed Test

pg_logo
SYNOPSIS: Using one million randomly generated points, time finding the ten nearest points to the center of the range by using the KNN distance operator versus using a call to PostGIS ST_Distance().


## Use ST_Distance()

explain analyze 
  SELECT ST_Distance(the_geom, ST_GeomFromText('POINT(500 500)')) 
  FROM tgeom ORDER BY ST_Distance(the_geom, ST_GeomFromText('POINT(500 500)')) 
    asc limit 10; 

Total runtime: 1568.499 ms

 QUERY PLAN                                                            
--------------------------------------------------------
 Limit  (cost=289943.64..289943.67 rows=10 width=128) (actual time=1568.452..1568.457 rows=10 loops=1)
   ->  Sort  (cost=289943.64..292443.64 rows=1000000 width=128) (actual time=1568.444..1568.447 rows=10 loops=1)
         Sort Key: (st_distance(the_geom, '01010000000000000000407F400000000000407F40'::geometry))
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Seq Scan on tgeom  (cost=0.00..268334.00 rows=1000000 width=128) (actual time=0.102..1268.725 rows=1000000 loops=1)

## Use KNN Sorting

explain analyze 
  SELECT the_geom < -> ST_GeomFromText('POINT(500 500)') 
  FROM tgeom ORDER BY the_geom < -> ST_GeomFromText('POINT(500 500)') asc limit 10;

Total runtime: 1.847 ms

 QUERY PLAN                                                            
--------------------------------------------------------

 Limit  (cost=0.00..0.74 rows=10 width=128) (actual time=1.476..1.754 rows=10 loops=1)
   ->  Index Scan using tgeom_idx on tgeom  (cost=0.00..74124.46 rows=1000000 width=128) (actual time=1.474..1.748 rows=10 loops=1)
         Order By: (the_geom < -> '01010000000000000000407F400000000000407F40'::geometry)


roughly 850 times faster !!


Complete Setup Instructions for this Example:
Base System: debian 6.0 ‘squeeze’
Add Repositories: backports testing


sudo apt-get -t squeeze-backports install build-essential postgresql-contrib-9.1 postgresql-client-9.1 postgresql-server-dev-9.1 postgresql-9.1 autoconf automake1.9 libtool flex bison gdb

## setup postgres however you are used to working with it..

svn co http://svn.osgeo.org/postgis/branches/2.0/ postgis_20_branch

svn co http://svn.osgeo.org/geos/branches/3.3/ geos_33_branch

svn co http://svn.osgeo.org/metacrs/proj/trunk/proj proj_trunk

sudo apt-get install libexpat1 libexpat1-dev libxslt1.1 libxml2-dev libjson0-dev libcurl4-openssl-dev python2.6-dev

svn checkout http://libkml.googlecode.com/svn/trunk/ libkml_trunk

svn co http://svn.osgeo.org/gdal/branches/1.9/ gdal_19_branch
./configure –with-python –with-curl –without-libtool

## create database testdb ##
psql testdb -f postgis/postgis.sql
psql testdb -f spatial_ref_sys.sql
psql testdb -f raster/rt_pg/rtpostgis.sql

>psql
psql (9.1.6)
Type “help” for help.

testdb=# select postgis_full_version();
POSTGIS=”2.0.2SVN r10643″ GEOS=”3.3.6dev-CAPI-1.7.6″ PROJ=”Rel. 4.8.0, 6 March 2012″ GDAL=”GDAL 1.9.2, released 2012/10/08″ LIBXML=”2.7.8″ LIBJSON=”UNKNOWN” RASTER

CREATE TABLE tgeom (gid serial PRIMARY KEY,the_geom geometry);
INSERT into tgeom(the_geom) select st_geomfromtext('POINT('|| ((random()*1000)::integer)::text || ' ' ||  ((random()*1000)::integer)::text || ')' ) from generate_series(1,1000000);
CREATE INDEX tgeom_idx on tgeom using GIST(the_geom);

## Done Setup

based on
http://www.depesz.com/2010/12/11/waiting-for-9-1-knngist/

UPDATE for debian/ubuntu
   there is now a repository at apt.postgresql.org

Check for Validity

SELECT 
  name, state_name, fips, 
  st_summary(the_geom), 
  st_isvalidreason(the_geom) 
FROM usa_counties 
  WHERE not st_isvalid(the_geom);
NOTICE:  Ring Self-intersection at or near point -70.82466 42.26050
-[ RECORD 1 ]----+-------------------------------------------------
name             | Plymouth
state_name       | Massachusetts
fips             | 25023
st_summary       | MultiPolygon[B] with 1 elements
                 |   Polygon[] with 1 rings
                 |    ring 0 has 31 points
st_isvalidreason | Ring Self-intersection[-70.8246609058028 42.2605069326982]

Time: 268.654 ms
UPDATE usa_counties 
  set the_geom = ST_MakeValid(the_geom) where name='Plymouth' and fips = '25023'; 

SELECT 
  name, state_name, fips, 
  st_summary(the_geom), st_isvalidreason(the_geom) 
FROM usa_counties where not st_isvalid(the_geom); 

 name | state_name | fips | st_summary | st_isvalidreason
------+------------+------+------------+------------------ 
(0 rows) 


select postgis_version();
            postgis_version            
---------------------------------------
 2.0 USE_GEOS=1 USE_PROJ=1 USE_STATS=1
(1 row)


  examples of invalid multipolygon

Distance to Transit in SACOG

parcels

In California, there are multi-county, regional government agencies that perform some functions like planning public transportation systems. The capital city of California is Sacramento, and the regional council is called the Sacramento Area Council of Governments (SACOG).

Using a database of train and light rail stations in a projection where distance measurement is accurate, and the 2010 US Census data, it is possible to find distance from a train station to a census tract or, better yet a census block, like so:

-- return the distance of each transit station from census tract 1768
SELECT 
  b.ogc_fid, 
  st_distance( t.wkb_geometry, b.wkb_geometry)::integer as dist
FROM
  transit_stations_updated t,
  census_2010_tracts b
WHERE
  t.region = 'SACOG' and b.ogc_fid = 1768
ORDER BY
  st_distance( t.wkb_geometry, b.wkb_geometry);


 

But wait, if we want to get the distance to a transit stop, it is far more useful to find the distance to the nearest transit stop.. and getting distance to one transit stop is straightforward, but what about all transit stops in the district, using one SQL statement for performance?
  For this, we can use the handy WINDOW FUNCTIONS introduced in PostgreSQL 8.4. Let’s break the problem into parts.
  First, list the distances from one census block, in order, using a RANK() function too:
 

-- return the ranked distance of each transit station from census block 433144
SELECT 
  b.ogc_fid as census_block_id,
  t.ogc_fid as transit_station_id,
  st_distance( t.wkb_geometry, b.wkb_geometry)::integer as dist,
  rank() OVER 
   (partition by b.ogc_fid ORDER BY st_distance( t.wkb_geometry, b.wkb_geometry))
FROM
  transit_stations_updated t,
  census_2010_blocks b
WHERE
   t.region = 'SACOG' and b.ogc_fid = 433144
ORDER BY 
   st_distance( t.wkb_geometry, b.wkb_geometry)::integer;

--
 census_block_id | transit_station_id | dist  | rank 
-----------------+--------------------+-------+------
          433144 |                758 |   176 |    1
          433144 |                715 |   369 |    2
          433144 |                716 |  1152 |    3
          433144 |               1137 |  1377 |    4
...

 
Windowing functions have been written to be very general, so the scope of any description is far larger than this post. What we can observe here is that a partition has been declared by census block, and the order within the partition is by distance between census block and transit stop. Now we have almost all of the components in place to solve: for every census block, return only the nearest transit stop. Here I have written a solution using a subselect – there are definitely other ways to write this! Note that a judicious use of indexing can speed things up quite a bit.

-- return only the nearest transit station to each census block and if less than 100 meters
SELECT census_block_id, transit_station_id, dist
FROM
(
  SELECT 
    b.ogc_fid as census_block_id,
    t.ogc_fid as transit_station_id,
    st_distance( t.wkb_geometry, b.wkb_geometry)::integer as dist,
    rank() OVER 
      (partition by b.ogc_fid ORDER BY st_distance( t.wkb_geometry, b.wkb_geometry)) as rank
  FROM
    transit_stations_updated t,
    census_2010_blocks b
  WHERE 
   st_distance(t.wkb_geometry, b.wkb_geometry) < 100 AND
   b.countyfp10 = '067' AND
   t.region = 'SACOG'
) t1

WHERE t1.rank = 1;


 

Finally, a variation of this problem is to take all locations within a certain buffer distance of at least one transit stop, and record them along with the distance to the nearest transit stop.

plr median

pg_logo


----------------------------------------- install ------------
create or replace function r_median(_float8) returns float as '
median(arg1)
' language 'plr';

CREATE AGGREGATE median (
sfunc = plr_array_accum,
basetype = float8,
stype = _float8,
finalfunc = r_median
);


----------------------------------------- test ---------------
select max(a.res),avg(a.res),median(a.res),min(a.res)
from
( select (st_area(wkb_geometry) - shape_area) as res
from landtypes_090911_marin_parcel_adjusted) a;

thuban1 Time: 21255 ms
r900 Time: 13880 ms
ps Time: 9337 ms
macpro Time: 8856 ms
i7-960+ Time: 6980 ms

ps- add standard deviation sd too
pps- just noticed this post

Calling Bluff

Two years ago a hard disk arrived on the desk of a colleague, from persons claiming to have some kind of parcel data “for every state in the US.” Naturally, being very skeptical and at the same time, just a bit eager to show off open source tools on linux, I whipped up a script to make this visualization — the result was many gigabytes of shapefiles represented by one 135K png.

What was the difference between loading potentially half of North America as parcels into a database, or simply gathering what is needed and moving on without loading anything?

=> recursively find each shapefile; use ogrinfo to get just the shapefile layer’s BBOX; use a regular expression to emit an INSERT statement into a spatial postgis table defined for this purpose; use psycopg2 to execute each INSERT, periodically COMMIT; COMMIT again at the end of the loop.

On completion you have a table of BBOX, one for each shapefile layer. For the visualization, I used QGis. All shapefile BBOX’s are displayed with transparency to show the density and the metro areas background.

Even given the excess inherent in using its BBOX to represent each shapefile, the graphic easily showed that the collection of data was not exactly as it was described.

Grid 150m classification

This visualization shows two quantitative dimensions at once, on a 150 meter square grid in a study area in California. The base color is a category of land use, and the gray shading is a percentage of developable acreage. The number on each grid cell, drawn using QGis, is synonymous with the gray shade.

grid150m classifications

Stage of Summarization

Here grid cells are classified into three types by landuse, and county parcels are split along grid cell boundaries, in preparation for computing a result stored into a database table of grid cells as table rows.

Note that roadways, water bodies and some public lands are correctly absent in net calculations, but since the area of a grid cell is a well defined constant, acreage and ratios are trivially computed.