For those that have been following the Climate Change story over the years, this satellite imagery tells a story quite vividly.. no modelling uncertainty involved.
AmpCamp 2014
BDAS the Berkeley Data Analytics Stack
At a minimum, suffice it to say I participated online in roughly twelve hours of lecture and lab on Nov 20 and 21, 2014 at AmpCamp 5 (I also attended one in Fall 2012). I put an emphasis on python, IPython Notebook, and SQL.
Once again this year, the camp mechanics went very smoothly — readable and succinct online exercises; Spark docs; Spark python, called pyspark is advancing, although some interfaces may not be available to python yet; Spark SQL appears to be useable.
To setup on my own Linux box, I unzipped the following files:
ampcamp5-usb.zip ampcamp-pipelines.zip training-downloads.zip
The resulting directories provided a pre-built Spark 1.1
Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_65)
The Lab exercises are almost all available as both Scala and python. Tools to do the first labs:
$SPARK_HOME/bin/spark-shell $SPARK_HOME/bin/pyspark
and for extra practice
$SPARK_HOME/bin/spark-submit $SPARK_HOME/bin/run-example
IPython Notebook
An online teaching assistant (TA) suggested a command line to launch the Notebook – here are my notes:
##-- TA suggestion
IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark --master "local[4]"
##-- a server already setup with a Notebook, options
--matplotlib inline --ip=192.168.1.200 --no-browser --port=8888
##-- COMBINE
IPYTHON_OPTS="notebook --matplotlib inline --ip=192.168.1.200 --no-browser --port=8888" $SPARK_HOME/bin/pyspark --master "local[4]"
The IPython Notebook worked ! Lots of conveniences, interactivity and viz potential immediately available against the pyspark environment. I created several Notebooks in short order, to test and explore, for example SQL.
The SQL exercise reads data from a format new to me, called Parquet
Part 1.2
After rest and recuperation, I wanted to try python in the almost-ready Spark 1.2 branch. It turned out to build and run easily. First get the spark code:
https://github.com/apache/spark/tree/branch-1.2
make sure maven is installed on your system, then run
./make-distribution.sh
. Afterwards, I set $SPARK_HOME
to this directory, and launched IPython Notebook again. All the examples and experiments I had built worked without modification ! Success.
Other Links
http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html
http://spark-summit.org/2014/training
https://github.com/amplab-extras
http://www.planetscala.com/
experimental
https://github.com/ooyala/spark-jobserver
JSONb First Looks
PostgreSQL 9.4 beta 3 on Linux
-- Simple JSON/JSONb compare, by Oleg -- json: text storage, as is -- jsonb: whitespace dissolved, no duplicate keys (last in wins), keys sorted SELECT '{"c":0, "a":2, "a":1}'::json, '{"c":0, "a":2, "a":1}'::jsonb; json | jsonb -------------------------+------------------ {"c":0, "a":2, "a":1} | {"a": 1, "c": 0} (1 row) -- emit JSON text from Census corpus -- SELECT json_agg(row_to_json(p)) from ( select gid,fullname,'feat' as ftype from tiger_data.ca_featnames where fullname ~ '^Az' ) as p; json_agg (formatting added) ------------------------------------------- [ {"gid":5048,"fullname":"Aztec Way","ftype":"feat"}, {"gid":9682,"fullname":"Azalea Ct","ftype":"feat"}, ... {"gid":4504601,"fullname":"Azure Pl","ftype":"feat"} ] ##-- return a dict with metadata fields, and an array of dict select row_to_json(a.*) from (select 'census_acs_2013' as origin, 'ca' as state, 'ca_featnames' as table, ( SELECT json_agg(row_to_json(p)) from ( select gid,fullname,'feat' as ftype from tiger_data.ca_featnames where fullname ~ '^Az' ) as p ) as rows ) a; row_to_json (formatting added) --------------------------------------------------------- { "origin":"census_acs_2013", "state":"ca", "table":"ca_featnames", "rows": [ {"gid":5048,"fullname":"Aztec Way","ftype":"feat"}, ... {"gid":4519032,"fullname":"Azalea Way","ftype":"feat"} ] }
GeoPandas and NaturalEarth2 tryout
things are looking good with GeoPandas
Census Tract and 150 Meter Grids Compare
In this screenshot of Central Silicon Valley, Census tracts have been combined with a constraints layer, and then cut with a 150 meter grid in the EPSG:3310
projection. Using imputation tables and external sources, each grid cell is then computed. The result is a statistically defensible, higher-resolution and handily applicable set of grid cells.
ACS 5yr Viz Processing
A systematic way to choose, extract and visualize data from the massive American Community Survey 5 Year census product is a challenge. I have written python code to ingest raw inputs into tables, and a small relational engine to handle the verbose naming.
An extraction and visualization process is underway… something like the following:
0) bulk tables in all geographies for all states
1a) define a batch of tables to extract by table_id
1b) choose a state or territory
1c) choose a geographic summary level
for example:
STATE California (FIPS 06) TABLE ('B01001', 'SEX BY AGE', 'Age-Sex', 'Universe: Total population') GEO Tracts (Summary level 140 - State-County-Census Tract)
Once the choice is made, SQL + Python is executed, either as a standalone program in Linux or in the IPython Notebook. The code creates a working schema in PostgreSQL, copies table subsets into the new schema, and JOINs them with TIGER geometry to get spatial data. A preliminary, working version looks something like this:
graphical browsing of the results in QGis:
geographic summaries defined in ACS_2008-2012_SF_Tech_Doc:
Appendix F: ACS 5-year Summary Levels/Components for Detailed Tables
Numeric Stats on Bay Area Intersection Counts
In preparing for an upcoming Datathon, a column of data in PostgreSQL numeric
format needed formatting for presentation. “Intersection Count” intersection_density_sqkm
is a count of street intersections per unit area – a quick way to measure density of the built environment. A table of grid cells (covering the nine-county San Francisco Bay Area) that the column comes from consists of roughly 814,000 cells. How to quickly characterize the data contents? Use SQL and the psql quantile extension to look at ranges, with and without the zeroes.
SELECT min(intersection_density_sqkm), quantile(intersection_density_sqkm,ARRAY[0.02,0.25,0.5,0.75,0.92]), max(intersection_density_sqkm) FROM uf_singleparts.ba_intersection_density_sqkm
-[ RECORD 1 ]----------------------------------------------------------------------------- min | 0.0 quantile | {0.0, 0.0, 0.683937...,3.191709...,25.604519...} max | 116.269430...
Psql extension quantile
takes as arguments a column name and an ARRAY for N positional elements by percentage, e.g. above
How Many Gridcells Have Non-zero Data ?
select count(*) from ba_intersection_density_sqkm; count => 814439 select count(*) from ba_intersection_density_sqkm where intersection_density_sqkm <> 0; count => 587504
Select stats on non-zero data
SELECT min(intersection_density_sqkm), quantile(intersection_density_sqkm,ARRAY[0.02,0.25,0.5,0.75,0.92]), max(intersection_density_sqkm) FROM uf_singleparts.ba_intersection_density_sqkm where intersection_density_sqkm <> 0;
-[ RECORD 1 ]----------------------------------------------------------------------------- min | 0.227979... quantile | {0.227979...,0.455958...,1.367875...,7.751295...,31.461139...} max | 116.269430...
and, what does the high-end of the range look like ? Use SQL for a quick visual inspection for either outliers or smooth transitions:
SELECT intersection_density_sqkm FROM ba_intersection_density_sqkm ORDER BY intersection_density_sqkm desc limit 12; intersection_density_sqkm --------------------------- 116.2694300518134736 115.5854922279792768 115.3575129533678764 115.1295336787564760 114.9015544041450756 114.9015544041450756 114.4455958549222792 113.7616580310880824 112.6217616580310892 112.6217616580310892 112.1658031088082884 112.1658031088082884
So, recall that a natural log e of 1.0 is 0; a natural log of 116 is slightly over 4.75; a natural log of a number less than 1 is a negative number. To simplify the range for visualization, add a float column called data, set the new data column to the natural log of (intersection_density_sqkm + 1); use a simple multiply-then-divide technique to limit the precision to two digits (screenshot from an IPython Notebook session using psycopg2).
select quantile(data,ARRAY[0.02,0.25,0.5,0.75,0.92]) from ba_intersection_density_sqkm; { 0, 0, 0.52, 1.43, 3.28 }
SELECT min(data), quantile(data,ARRAY[0.02,0.25,0.5,0.75,0.92]), max(data) FROM ba_intersection_density_sqkm WHERE data <> 0; min | quantile | max ------+----------------------------+------ 0.21 | {0.21,0.38,0.86,2.17,3.48} | 4.76 (1 row)
Final Results in GeoServer 2.5 CSS styler:
ps- a full sequential scan on this table takes about four seconds, on a Western Digital Black Label 1TB disk, ext4 filesystem, on Linux.
Debian 7 Repos Added
A small email went by today, bringing big news…
From: Jerome Villeneuve Larouche \To: "ubuntu@lists.osgeo.org" \ Subject: [Ubuntu] DebianGIS Repo Hello everyone, This is a small message to tell you that the DebianGIS repo for Debian Wheezy is up and that every package on it is up-to-date! Everything is built against stable so you don't need to add any unstable repo to use the latest GIS packages. To add it on your machine, edit "/etc/apt/sources.list" and add the line deb http://debian.mapgears.com/repos/apt/debian/ wheezy main You should also add the public key here http://debian.mapgears.com/repos/apt/debian/debgis.gpg.key Enjoy, as always if you have any questions about the repo, be it this one or UbuntuGIS, send message on the mailing list! PS: I'm also currently updating UbuntuGIS to Saucy!
Worldwide Forestry Inventory Published, Nov13
Dozens of major news outlets posted articles yesterday profiling a paper published in the journal ‘Science’ by a team led by Matthew Hansen, a remote sensing scientist at the University of Maryland, along with extensive data.
‘Published by Hansen, Potapov, Moore, Hancher et al. * Powered by Google Earth Engine‘
NLCD 06 Landcover, San Francisco Bay Area
A colleague pointed out the National Land Cover Database (NLCD) imagery today, which is not new, but it is useful. Here is a simple treatment of the San Francisco Bay Area, with city center markers matching the red urban coloring used in the base map. Click for the larger image, and you can see Lake Tahoe in the East and unmarked is Yosemite National Park almost due south. (’06’ in the blog post title refers to both the publication year of this base map, 2006, and the FIPS code for the State of California)