In preparing for an upcoming Datathon, a column of data in PostgreSQL `numeric`

format needed formatting for presentation. “Intersection Count” `intersection_density_sqkm`

is a count of street intersections per unit area – a quick way to measure density of the built environment. A table of grid cells (covering the nine-county San Francisco Bay Area) that the column comes from consists of roughly 814,000 cells. How to quickly characterize the data contents? Use SQL and the psql **quantile** extension to look at ranges, with and without the zeroes.

SELECT min(intersection_density_sqkm), quantile(intersection_density_sqkm,ARRAY[0.02,0.25,0.5,0.75,0.92]), max(intersection_density_sqkm) FROM uf_singleparts.ba_intersection_density_sqkm

-[ RECORD 1 ]----------------------------------------------------------------------------- min | 0.0 quantile | {0.0, 0.0, 0.683937...,3.191709...,25.604519...} max | 116.269430...

Psql extension `quantile`

takes as arguments a column name and an ARRAY for *N* positional elements by percentage, *e.g. above*

#### How Many Gridcells Have Non-zero Data ?

select count(*) from ba_intersection_density_sqkm; count => 814439 select count(*) from ba_intersection_density_sqkm where intersection_density_sqkm <> 0; count => 587504

Select stats on non-zero data

SELECT min(intersection_density_sqkm), quantile(intersection_density_sqkm,ARRAY[0.02,0.25,0.5,0.75,0.92]), max(intersection_density_sqkm) FROM uf_singleparts.ba_intersection_density_sqkm where intersection_density_sqkm <> 0;

-[ RECORD 1 ]----------------------------------------------------------------------------- min | 0.227979... quantile | {0.227979...,0.455958...,1.367875...,7.751295...,31.461139...} max | 116.269430...

and, what does the high-end of the range look like ? Use SQL for a quick visual inspection for either outliers or smooth transitions:

SELECT intersection_density_sqkm FROM ba_intersection_density_sqkm ORDER BY intersection_density_sqkm desc limit 12; intersection_density_sqkm --------------------------- 116.2694300518134736 115.5854922279792768 115.3575129533678764 115.1295336787564760 114.9015544041450756 114.9015544041450756 114.4455958549222792 113.7616580310880824 112.6217616580310892 112.6217616580310892 112.1658031088082884 112.1658031088082884

So, recall that a natural log *e* of 1.0 is 0; a natural log of 116 is slightly over 4.75; a natural log of a number less than 1 is a negative number. To simplify the range for visualization, add a float column called *data*, set the new *data* column to the natural log of *(intersection_density_sqkm + 1);* use a simple multiply-then-divide technique to limit the precision to two digits (screenshot from an **IPython Notebook** session using **psycopg2**).

select quantile(data,ARRAY[0.02,0.25,0.5,0.75,0.92]) from ba_intersection_density_sqkm; { 0, 0, 0.52, 1.43, 3.28 }

SELECT min(data), quantile(data,ARRAY[0.02,0.25,0.5,0.75,0.92]), max(data) FROM ba_intersection_density_sqkm WHERE data <> 0; min | quantile | max ------+----------------------------+------ 0.21 | {0.21,0.38,0.86,2.17,3.48} | 4.76 (1 row)

#### Final Results in GeoServer 2.5 CSS styler:

ps- *a full sequential scan on this table takes about four seconds, on a Western Digital Black Label 1TB disk, ext4 filesystem, on Linux.
*