Numeric Stats on Bay Area Intersection Counts

5 August 2014, 11:24 am

pg_logo

In preparing for an upcoming Datathon, a column of data in PostgreSQL numeric format needed formatting for presentation. “Intersection Count” intersection_density_sqkm is a count of street intersections per unit area – a quick way to measure density of the built environment. A table of grid cells (covering the nine-county San Francisco Bay Area) that the column comes from consists of roughly 814,000 cells. How to quickly characterize the data contents? Use SQL and the psql quantile extension to look at ranges, with and without the zeroes.

SELECT 
   min(intersection_density_sqkm), 
    quantile(intersection_density_sqkm,ARRAY[0.02,0.25,0.5,0.75,0.92]), 
   max(intersection_density_sqkm) 

 FROM uf_singleparts.ba_intersection_density_sqkm

-[ RECORD 1 ]-----------------------------------------------------------------------------
min      | 0.0
quantile | {0.0, 0.0, 0.683937...,3.191709...,25.604519...}
max      | 116.269430...

Psql extension quantile takes as arguments a column name and an ARRAY for N positional elements by percentage, e.g. above

How Many Gridcells Have Non-zero Data ?

select count(*) from ba_intersection_density_sqkm;
count => 814439

select count(*) from ba_intersection_density_sqkm where intersection_density_sqkm <> 0;
count => 587504

Select stats on non-zero data

 SELECT 
   min(intersection_density_sqkm), 
    quantile(intersection_density_sqkm,ARRAY[0.02,0.25,0.5,0.75,0.92]), 
   max(intersection_density_sqkm) 

 FROM uf_singleparts.ba_intersection_density_sqkm

 where intersection_density_sqkm <> 0;

-[ RECORD 1 ]-----------------------------------------------------------------------------
min      | 0.227979...
quantile | {0.227979...,0.455958...,1.367875...,7.751295...,31.461139...}
max      | 116.269430...

and, what does the high-end of the range look like ? Use SQL for a quick visual inspection for either outliers or smooth transitions:

SELECT intersection_density_sqkm 
FROM ba_intersection_density_sqkm 
 ORDER BY  intersection_density_sqkm   desc limit 12;

 intersection_density_sqkm 
---------------------------
      116.2694300518134736
      115.5854922279792768
      115.3575129533678764
      115.1295336787564760
      114.9015544041450756
      114.9015544041450756
      114.4455958549222792
      113.7616580310880824
      112.6217616580310892
      112.6217616580310892
      112.1658031088082884
      112.1658031088082884

So, recall that a natural log e of 1.0 is 0; a natural log of 116 is slightly over 4.75; a natural log of a number less than 1 is a negative number. To simplify the range for visualization, add a float column called data, set the new data column to the natural log of (intersection_density_sqkm + 1); use a simple multiply-then-divide technique to limit the precision to two digits (screenshot from an IPython Notebook session using psycopg2).

 select quantile(data,ARRAY[0.02,0.25,0.5,0.75,0.92]) from ba_intersection_density_sqkm;
  { 0, 0, 0.52, 1.43, 3.28 }

SELECT
  min(data), 
    quantile(data,ARRAY[0.02,0.25,0.5,0.75,0.92]), 
  max(data) 
FROM ba_intersection_density_sqkm
WHERE data <> 0;

 min  |          quantile          | max  
------+----------------------------+------
 0.21 | {0.21,0.38,0.86,2.17,3.48} | 4.76
(1 row)

Final Results in GeoServer 2.5 CSS styler:

ps- a full sequential scan on this table takes about four seconds, on a Western Digital Black Label 1TB disk, ext4 filesystem, on Linux.

Filed under BigData, Postgres

Comments are closed | Permalink

darkblue blog