{"id":358,"date":"2012-03-31T14:43:00","date_gmt":"2012-03-31T21:43:00","guid":{"rendered":"http:\/\/blog.light42.com\/wordpress\/?p=358"},"modified":"2013-10-25T16:25:32","modified_gmt":"2013-10-25T23:25:32","slug":"variable-buffers-in-postgis","status":"publish","type":"post","link":"http:\/\/blog.light42.com\/wordpress\/?p=358","title":{"rendered":"Variable Buffers in PostGIS"},"content":{"rendered":"<p><img decoding=\"async\" align=\"right\" src=\"wp-content\/uploads\/2012\/04\/grid8.png\"\/> Some problems in PostGIS are naturally parallelizable. Many problems involving a <strong>grid<\/strong> as a base or reference data set fall into this category; when each grid cell is visited and some operation is done there, each grid cell is uniquely referenced in its own set of calculations, and invariant, then it is likely that the problem is naturally parallelizable. If the computation could be done on more than one cell at a time, so much the better. <\/p>\n<h4>Search Radii<\/h4>\n<p>There are many variations on this problem, and many approaches to solving any given one. Consider the following: a database table representing a regular grid is &#8220;loaded&#8221; directly with data per cell. The table definition might look like:<\/p>\n<blockquote>\n<pre>\r\ncreate table data_buffers_variable (\r\n    id150 integer PRIMARY KEY, \r\n    prod_mil_meters double precision,\r\n    wkb_geometry geometry,\r\n    acres_red_total double precision,\r\n    acres_yellow_total double precision,\r\n    qt1 double precision,\r\n    qt2 double precision,\r\n    qt3 double precision\r\n)\r\n<\/pre>\n<\/blockquote>\n<p>&nbsp;<br \/>\nEach cell will be visited in turn, and reference a number that is a distance <code>prod_mil_meters<\/code>, which after conversion to units in the current projection, is used as a radius. The calculation is then to collect all cells within the determined radius, and sum three quantities.. <code>qt1_total, qt2_total, qt3_total<\/code>. Here is a query on a single cell <code>cur_id<\/code> in a radius of <code>search_radius<\/code>, and a self-join like so:<\/p>\n<blockquote>\n<pre>\r\nSELECT\r\n      b.id150, search_radius, b.wkb_geometry,\r\n  \tsum(r.qt1) as qt1_total,\r\n  \tsum(r.qt2) as qt2_total,\r\n  \tsum(r.qt3) as qt3_total\r\nFROM \r\ndata_buffers_variable b\r\n  INNER JOIN data_buffers_variable r\r\n  ON ST_DWithin(b.wkb_geometry,r.wkb_geometry, search_radius)  AND\r\n  b.id150 = cur_id\r\n  \r\nGROUP BY b.id150, b.wkb_geometry\r\n<\/pre>\n<\/blockquote>\n<p>&nbsp;<\/p>\n<h4>Many Hands Make Light Work<\/h4>\n<p>We might define a function call in <strong>PostgreSQL<\/strong> as follows:<\/p>\n<blockquote>\n<pre>\r\nCREATE OR REPLACE FUNCTION get_buffered_vals_out(\r\n  in_id_grid int, \r\n  in_distance float,\r\n  in_geom geometry,\r\n  OUT id_grid int, \r\n  OUT search_distance float,\r\n  OUT wkb_geometry geometry,\r\n  OUT qt1_total float,\r\n  OUT qt2_total float,\r\n  OUT qt3_total float\r\n  ) \r\nAS \r\n$$\r\n  select $1 as id_grid, \r\n  $2 as search_distance,\r\n  $3 as wkb_geometry, \r\n  \tsum(r.qt1) as qt1_total,\r\n  \tsum(r.qt2) as qt2_total,\r\n  \tsum(r.qt3) as qt3_total\r\n  FROM data_buffers_variable r WHERE st_dwithin( $3, r.wkb_geometry, $2);\r\n$$ \r\nCOST 10000\r\nlanguage SQL STABLE strict;\r\n<\/pre>\n<\/blockquote>\n<p>&nbsp;<br \/>\nNow to run this in a parallel fashion. As is detailed in <a href=\"http:\/\/blog.light42.com\/wordpress\/?p=23\">this post<\/a>, do the following in python:  decide on a number of <strong>threads<\/strong> to execute, divide the work into buckets with an upper and lower id_grid as bounds, define the basic unit of work as a <strong>threading.Thread<\/strong> subclass, and pass a parameter dict into a <strong>thread_queue<\/strong> for each thread. Each threaded unit of work will visit each cell within its bounds, call <code>get_buffered_vals_out()<\/code> and store the results.<\/p>\n<p> <em>note: although it looks more impressive, <strong>double precision<\/strong> is in fact a synonym for <strong>float<\/strong> in postgres<\/em><\/p>\n<p>The syntax for INSERTing the result of a function call like that into another table in one statement can be rather challenging. Here is some useful starter syntax:<\/p>\n<blockquote>\n<pre>\r\nselect (f).* from (\r\n    select get_buffered_vals_out( ogc_fid, dist, wkb_geometry) as f\r\n    from data_buffers_variable \r\n    where ogc_fid >= 1656690 and ogc_fid < = 1656690\r\n    offset 0) s;\r\n<\/pre>\n<\/blockquote>\n<p>  Why the offset 0 ?!? Its is supposedly a no-op, yet in one test, the runtime is 6x faster with it than without it. More answers yield more questions&#8230;<\/p>\n<p>addendum-<\/p>\n<pre>\r\nwhunt\thow do you reference a value from the select clause in your where clause?\r\nRhodiumToad\twhunt: you don't\r\nwhunt\tas in: SELECT my_func(bar) as baz from foo where baz is not null;\r\nRhodiumToad\twhunt: not allowed per spec, though you can add another nesting level\r\nwhunt\ti could do it with a with\r\nRhodiumToad\twhunt: select * from (select myfunc(...) as baz from ...) s where baz is not null;\r\nRhodiumToad\twhunt: note, that'll often call the function twice; to prevent that, add offset 0 just before the )\r\nRhodiumToad\twhunt: i.e. select * from (select ... offset 0) s where ...\r\nxocolatl\tRhodiumToad: the offset 0 trick is something I think needs improvement we shouldn't have to do that\r\nRhodiumToad\tI agree\r\nRhodiumToad\tbut it's not at all trivial\r\nxocolatl\toh, I don't doubt\r\nwhunt\twhy does the function get run twice?\r\nRhodiumToad\twhunt: because the planner will flatten out the subselect in most cases, so it's equivalent to doing\r\nwhunt\ti imagine the inner query executes the function and creates a new table that the outter query queries\r\nwhunt\toh\r\nRhodiumToad\tselect myfunc(...) from ... where myfunc(...) is not null;\r\nwhunt\tlol - that's exactly what i wanted to avoid\r\nxocolatl\twhunt: then use the offset 0 trick\r\nRhodiumToad\tthe offset 0 is a general technique to nobble the planner and stop it from flattening subqueries\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Some problems in PostGIS are naturally parallelizable. Many problems involving a grid as a base or reference data set fall into this category; when each grid cell is visited and some operation is done there, each grid cell is uniquely referenced in its own set of calculations, and invariant, then it is likely that the [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[],"_links":{"self":[{"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/358"}],"collection":[{"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=358"}],"version-history":[{"count":61,"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/358\/revisions"}],"predecessor-version":[{"id":1468,"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/358\/revisions\/1468"}],"wp:attachment":[{"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=358"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=358"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=358"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}