darkblue blog

Calling Bluff

29 July 2012, 12:12 am

Two years ago a hard disk arrived on the desk of a colleague, from persons claiming to have some kind of parcel data “for every state in the US.” Naturally, being very skeptical and at the same time, just a bit eager to show off open source tools on linux, I whipped up a script to make this visualization — the result was many gigabytes of shapefiles represented by one 135K png.

What was the difference between loading potentially half of North America as parcels into a database, or simply gathering what is needed and moving on without loading anything?

=> recursively find each shapefile; use ogrinfo to get just the shapefile layer’s BBOX; use a regular expression to emit an INSERT statement into a spatial postgis table defined for this purpose; use psycopg2 to execute each INSERT, periodically COMMIT; COMMIT again at the end of the loop.

On completion you have a table of BBOX, one for each shapefile layer. For the visualization, I used QGis. All shapefile BBOX’s are displayed with transparency to show the density and the metro areas background.

Even given the excess inherent in using its BBOX to represent each shapefile, the graphic easily showed that the collection of data was not exactly as it was described.

Filed under Postgres, Python, Viz.

Comments Off

Grid 150m classification

12 June 2012, 6:48 pm

This visualization shows two quantitative dimensions at once, on a 150 meter square grid in a study area in California. The base color is a category of land use, and the gray shading is a percentage of developable acreage. The number on each grid cell, drawn using QGis, is synonymous with the gray shade.

grid150m classifications

Filed under Viz.

Comments Off

Stage of Summarization

4 June 2012, 12:26 am

Here grid cells are classified into three types by landuse, and county parcels are split along grid cell boundaries, in preparation for computing a result stored into a database table of grid cells as table rows.

Note that roadways, water bodies and some public lands are correctly absent in net calculations, but since the area of a grid cell is a well defined constant, acreage and ratios are trivially computed.

Filed under Viz.

Comments Off

Local Blockgroups 2010

16 May 2012, 3:32 pm

As a way to learn about the geo constructs in the U.S. Census data, I built a few visualizations of the area where I live.

Elementary Constructs

The heavy green lines are census tracts, with their tracts ID shown in large. The pink lines are census block groups. Census blocks are shown, along with the centroid of each. This data comes directly from PostGIS and is styled in QGis

This next image shows the San Francisco Bay Area, Napa Valley to the north, the greater metropolitan area of the state’s capital Sacramento to the southern tip of Lake Tahoe, a portion of the Central Valley down to a corner of Yosemite National Park, then back across Silicon Valley to the Pacific Coast. Each county is divided into census county subdivisions, with its name in yellow. Census place is shown as orange. The heavy pink lines are major highways from OpenStreetMap. Images are served via GeoServer and SLD Styles and again displayed in QGis.

Census data is not the same set as TIGER (Topologically Integrated Geographic Encoding and Referencing) data, but the two are directly matchable via embedded key fields. see the TIGER documentation

Here is the backwards wording from TGRSHP2013_TechDoc, that says that blocks are the indivisible component of all census geometries:

Blocks never cross county or census tract boundaries. They do not cross the boundaries of any entity for which the Census Bureau tabulates data, including American Indian, Alaska Native, and Native Hawaiian areas, congressional districts, county subdivisions, places, state legislative districts, urbanized areas, urban clusters, school districts, voting districts, or ZIP Code Tabulation Areas (ZCTAs) or some special administrative areas such as military installations, and national parks and monuments.

Filed under Viz.

Comments Off

Topology of Hex Cells

25 April 2012, 5:49 pm

A year ago I had modified code from here to make a set of about thirty million hex cells across the continental United States, using the projection from the National Map, EPSG:2163.

Today, topology wizard strk was building a set of hex cells to test conversion to topology in PostGIS 2.0. The question arose, would the hex cells I had built with another method convert cleanly to a topology, and without specifying a tolerance distance?

So I tried converting a subset of the US cells into a topology.
First step, pick a few thousand contiguous hex cells:

-- create one Alameda County multipolygon
DROP TABLE if exists alameda_tracts CASCADE;
CREATE TABLE alameda_tracts as
SELECT 
  1 as ogc_fid, 
  st_union(wkb_geometry) as wkb_geometry
FROM 
  census_2010_tracts
WHERE
  census_2010_tracts.countyfp10 = '001';
ALTER TABLE alameda_tracts add PRIMARY KEY ( ogc_fid );
--
-- transform and compare county to hex cells
DROP TABLE if exists some_hex_cells CASCADE;
CREATE TABLE some_hex_cells as
SELECT 
  distinct on (hex_grid_cells.id, hex_grid_cells.wkb_geometry)
  hex_grid_cells.id, hex_grid_cells.wkb_geometry
  
FROM 
  public.alameda_tracts, 
  public.hex_grid_cells

WHERE 
  ST_Intersects( 
    hex_grid_cells.wkb_geometry,
    ST_Transform(alameda_tracts.wkb_geometry,2163) );
    
--
alter table some_hex_cells add primary key (id);
create index shc_geom_idx on some_hex_cells USING GIST (wkb_geometry);

select count(*) from some_hex_cells;
 count 
-------
  5272
(1 row)

In an EXPLAIN ANALYZE of the collection of the hex cells query above, the postgres scheduler showed a linear scan of the table being transformed, which is one record long, and an index scan on the 30 million hex cells – perfect! Next step, create a PostGIS 2.0 topology in EPSG:2163 :

SELECT CreateTopology('hexagon_topo', 2163 );

-- use DropTopology('hexagon_topo') to start over..

Now, do the conversion to topology, time it, and validate :

\timing
SELECT TopoGeo_AddPolygon( 'hexagon_topo', h.wkb_geometry )
  FROM some_hex_cells h;

thuban1 Time: 636187 ms
i7-960+ Time: 239265 ms

 select TopologySummary('hexagon_topo');
                        topologysummary                        
---------------------------------------------------------------
 Topology hexagon_topo (2), SRID 2163, precision 0            +
 10608 nodes, 15879 edges, 5272 faces, 0 topogeoms in 0 layers+

It strikes me as odd that the summary would list “0 topogeoms” but if I recall that means that there are no *heirarchical* topologies. Since TopologySummary() returned ok, perhaps my one topology is valid. But, ValidateTopology() returns zero rows ?? I guess there is more to topology to learn, still…

SELECT ValidateTopology('hexagon_topo');
 validatetopology 
------------------
(0 rows)

PS- just noticed this post

Filed under Postgres.

Comments Off

PostGIS Geocoder behavior

23 April 2012, 9:14 pm

I became interested in the practical behavior of the

PostGIS TIGER 2010 Geocoder

when I started doing statistical profiling of the results of geocoding 100,000 standard addresses in my county and got results that looked like what you see in the image at right. That inquiry morphed into this short paper PDF.

Why statistical profiling? I was experimenting with geocoding a subset of Alameda County parcel records as an example data set after sitting in on most of a graduate Data Mining class at UC Berkeley, Spring 2011. In data mining, any set of attributes that can be reduced to a ‘distance’ measure can be characterized in various ways. With geo-coordinates, distance is obvious, but lexical or numerical relationships can be summarized as ‘distance’ also. Examples of all three occur in geocoding and its data sets.

Filed under Postgres, Viz.

Comments Off

ST_DWithin on a Grid

8 April 2012, 9:01 pm

As part of a larger inquiry, I wanted to know how ST_DWithin() behaved on a regular grid – that is, what is the relationship of the input radius, and the number of grid cells that could be selected in the result. The manual implies that it is an intersects relationship not a contains relationship, and in this case, the center is not a point but rather an entire grid cell.

Empirically I came up with a formula that does a good enough job of determining a maximum number of grid cells that could be selected (the logic for the inner case of a small radius is rough, but that is almost inconsequential). Here is a SQL query as executed in openJump, and a screenshot:

SELECT  st_asbinary(r.wkb_geometry)
FROM    grid150m_county r
 JOIN    grid150m_county a ON
  ST_DWithin( r.wkg_geometry, a.wkb_geometry, tRadius)
   AND a.ogc_fid = tCellID
GROUP BY
  r.wkb_geometry

A possible Python function to get a maximum count of grid cells:

##-----------------------------------------------------------
import math
def get_gridcell_count( inRadius, inCellSize=150.0 ):

  ## for a given R in meters, get the max number of 150m grid cells 
  ##  that could be included in the results search
  ## Empirically determined to be factorial-like function
  ## adding a new ring at each new grid cell size increment
  ##
  ##  tRadius is in meters
  
  if inRadius < inCellSize:  return 1
  
  tGridCellsSum = 9
  tR = int(math.ceil(tRadius/inCellSize))
  for tInd in xrange(tR):
    tGridCellsSum += int(math.ceil( 2 * (tInd+2) * 3.141592655)) 
  
  return tGridCellsSum

Filed under Postgres, Python, Viz.

Comments Off

California in Green and Rust

3 April 2012, 8:48 pm

I set out to create a map that showed urban sprawl in the central valley of California.
Geoserver, OpenLayers and PostGIS are the technologies in this demo.
(the linux server itself is being relocated)

Introductory Overview An overview of the central valley from a natural perspective, overlaid with two prominent marks of human development, highways and paved area.
Most all of the detailed information is organized by county – as the user zooms in, show the county names and boundaries in order to facilitate choosing an area of interest quickly.
The low resolution base relief is soon to be discarded, while the names from US Census 2010 county subdivisions now appear.
Two methods of quantitatively determining where urban boundaries lie are visible, with a base layer of greenfield classification.
At the next zoom level, OpenStreetMap place names add minimal point-of-interest context, the county subdivision names are gone, while the emphasis stays on area classification.
Various classification layers are available – shown here is a California Farmland Monitoring layer.
California NAIP 2010 imagery of farmland outside Yuba City, the bright pink major highways layer, and a reasonably accurate water layer shown.
Classification layers can be combined with a base layer, for example imagery.

Filed under Viz.

Comments Off

gdal_retile retooled

31 March 2012, 3:23 pm

Recently the task of combining and tiling some large GeoTIFFs came up.. so began an investigation of adding threading support to gdal_retile.py

Unfortunately, it is widely known that python is not the best environment for multithreaded work, due to the global interpreter lock (GIL). However, while building threading support for some routine postgis tasks recently, I had quickly found that dispatching a series of computationally intensive PostGIS tasks via psycopg2 in python threading can work quite well. In that case the bulk of the work is done by a large, external engine (PostGIS) and python just acts as a brain to the brawn. I started on this gdal_retile.py experiment with an idea that CPU intensive parts of a gdal_retile.py job, perhaps color-space conversion and JPEG compression, might tip a balance such that the python threading itself was no longer a bottleneck.

I wanted to find a straightforward threading pattern in python that was suitable for breaking up gdal_retile.py’s inner loops.

A Threading Pattern in Python

After some experimentation, this program worked well enough.

Thread_Tiny is a threading.Thread subclass, whose __init__ function takes two Queue objects as arguments.
thread_queue maintains the threads themselves
results_queue where each thread pushes its result before being disposed
tD a dict containing anything the main program wants to pass to a new Thread
put() an args dict into thread_queue. A live thread will pick it up. If all threads are busy working, wait internally until a slot becomes available before adding the queue element
app waits for all threads to complete

This construct seemed sufficiently complete to be a template for breaking up a moderately complex utility like gdal_retile.py

Reworking gdal_retile.py

I found that there were really two places that an inner loop repetition was being done. Once where the input was being split into tiles as TileResult_0, and secondly when building a plane of a pyramid TileResult_N. (in gdal_retile.py either operation can be omitted)

Solution: A utility function Count_Cores() returns the number of cores in the Linux operating environment. Define a thread_queue and result_queue – each instances of Queue.Queue – for each case. A difference is that a thread_queue is created such that there is only one slot per desired active process. This way the thread_queue runs Threads in an orderly way. The number of active Threads is limited to “cores minus two” or one by limiting the number of slots available in a new thread_queue. Instead of possibly clogging execution with an arbitrarily large set of new Threads to execute at initialization time, only N threads can run at any given time.

Next, a basic unit of work has to be defined that can be turned over to Threads. In tileImage() there is a call to createTile(), which takes as arguments: a local object called mosaic_info; offsets X,Y; height and width; a tilename; and an object representing an OGR Datasource. That datasource is open and has state, and therefore is not shareable. However I discovered that inside createTile() the only thing the OGR datasource is used for is a single, write-only append of a feature describing the output tile’s BBOX and filename. What if a result_queue replaced the non-shareable datasource passed into createTile() and each thread simply pushes its result into the result_queue ? When all threads have completed, iterate and retrieve each thread result, adding each to the OGRDS data source as a feature. Here is some code…

A Second Comparison

Fortunately, while discussing this on #gdal Robert Coup (rcoup) happened to be online. In his organization there was another attempt at making gdal_retile.py parallel too, but the results were not satisfying. rcoup tossed the code over, and I ran that as well.

Some Results?

time gdal_retile.py -s_srs EPSG:900913 -co "TILED=YES" -co "COMPRESS=JPEG" -ps 1024 1024 -r "bilinear" -co PHOTOMETRIC=YCBCR -targetDir out_dir_stock Fresno_city_900913.tif

In the end, the version rcoup sent, the threaded version described here, and the stock version, all ran against a 77,000 pixel x 84,000 pixel GeoTIFF (18G, three bands) in about the same 50 minutes each on a local machine (producing a matching 6309 tiles). I suspect that something in the GDAL python interfaces (aside from the GIL) is preventing real parallelism from kicking in, though there are a lot of corners for problems to hide in just this small example.

Long ago, hobu had this to say

Filed under Python.

Comments Off

Variable Buffers in PostGIS

31 March 2012, 2:43 pm

Some problems in PostGIS are naturally parallelizable. Many problems involving a grid as a base or reference data set fall into this category; when each grid cell is visited and some operation is done there, each grid cell is uniquely referenced in its own set of calculations, and invariant, then it is likely that the problem is naturally parallelizable. If the computation could be done on more than one cell at a time, so much the better.

Search Radii

There are many variations on this problem, and many approaches to solving any given one. Consider the following: a database table representing a regular grid is “loaded” directly with data per cell. The table definition might look like:

create table data_buffers_variable (
    id150 integer PRIMARY KEY, 
    prod_mil_meters double precision,
    wkb_geometry geometry,
    acres_red_total double precision,
    acres_yellow_total double precision,
    qt1 double precision,
    qt2 double precision,
    qt3 double precision
)

Each cell will be visited in turn, and reference a number that is a distance prod_mil_meters, which after conversion to units in the current projection, is used as a radius. The calculation is then to collect all cells within the determined radius, and sum three quantities.. qt1_total, qt2_total, qt3_total. Here is a query on a single cell cur_id in a radius of search_radius, and a self-join like so:

SELECT
      b.id150, search_radius, b.wkb_geometry,
  	sum(r.qt1) as qt1_total,
  	sum(r.qt2) as qt2_total,
  	sum(r.qt3) as qt3_total
FROM 
data_buffers_variable b
  INNER JOIN data_buffers_variable r
  ON ST_DWithin(b.wkb_geometry,r.wkb_geometry, search_radius)  AND
  b.id150 = cur_id
  
GROUP BY b.id150, b.wkb_geometry

Many Hands Make Light Work

We might define a function call in PostgreSQL as follows:

CREATE OR REPLACE FUNCTION get_buffered_vals_out(
  in_id_grid int, 
  in_distance float,
  in_geom geometry,
  OUT id_grid int, 
  OUT search_distance float,
  OUT wkb_geometry geometry,
  OUT qt1_total float,
  OUT qt2_total float,
  OUT qt3_total float
  ) 
AS 
$$
  select $1 as id_grid, 
  $2 as search_distance,
  $3 as wkb_geometry, 
  	sum(r.qt1) as qt1_total,
  	sum(r.qt2) as qt2_total,
  	sum(r.qt3) as qt3_total
  FROM data_buffers_variable r WHERE st_dwithin( $3, r.wkb_geometry, $2);
$$ 
COST 10000
language SQL STABLE strict;

Now to run this in a parallel fashion. As is detailed in this post, do the following in python: decide on a number of threads to execute, divide the work into buckets with an upper and lower id_grid as bounds, define the basic unit of work as a threading.Thread subclass, and pass a parameter dict into a thread_queue for each thread. Each threaded unit of work will visit each cell within its bounds, call get_buffered_vals_out() and store the results.

note: although it looks more impressive, double precision is in fact a synonym for float in postgres

The syntax for INSERTing the result of a function call like that into another table in one statement can be rather challenging. Here is some useful starter syntax:

select (f).* from (
    select get_buffered_vals_out( ogc_fid, dist, wkb_geometry) as f
    from data_buffers_variable 
    where ogc_fid >= 1656690 and ogc_fid < = 1656690
    offset 0) s;

Why the offset 0 ?!? Its is supposedly a no-op, yet in one test, the runtime is 6x faster with it than without it. More answers yield more questions…

addendum-

whunt	how do you reference a value from the select clause in your where clause?
RhodiumToad	whunt: you don't
whunt	as in: SELECT my_func(bar) as baz from foo where baz is not null;
RhodiumToad	whunt: not allowed per spec, though you can add another nesting level
whunt	i could do it with a with
RhodiumToad	whunt: select * from (select myfunc(...) as baz from ...) s where baz is not null;
RhodiumToad	whunt: note, that'll often call the function twice; to prevent that, add offset 0 just before the )
RhodiumToad	whunt: i.e. select * from (select ... offset 0) s where ...
xocolatl	RhodiumToad: the offset 0 trick is something I think needs improvement we shouldn't have to do that
RhodiumToad	I agree
RhodiumToad	but it's not at all trivial
xocolatl	oh, I don't doubt
whunt	why does the function get run twice?
RhodiumToad	whunt: because the planner will flatten out the subselect in most cases, so it's equivalent to doing
whunt	i imagine the inner query executes the function and creates a new table that the outter query queries
whunt	oh
RhodiumToad	select myfunc(...) from ... where myfunc(...) is not null;
whunt	lol - that's exactly what i wanted to avoid
xocolatl	whunt: then use the offset 0 trick
RhodiumToad	the offset 0 is a general technique to nobble the planner and stop it from flattening subqueries

Filed under Postgres.

Comments Off

darkblue blog

Calling Bluff

Grid 150m classification

Stage of Summarization

Local Blockgroups 2010

Topology of Hex Cells

PostGIS Geocoder behavior

ST_DWithin on a Grid

California in Green and Rust

gdal_retile retooled

A Threading Pattern in Python

Reworking gdal_retile.py

A Second Comparison

Some Results?

Variable Buffers in PostGIS

Search Radii

Many Hands Make Light Work

Recent Posts

Archives

Categories

Blogroll

Meta