darkblue blog

Five Colors for Stats

24 September 2013, 1:03 pm

PostgreSQL_logo.3colors.120x120

I am building some visualization layers in Geoserver from PostGIS, which requires .sld files (until Geoserver catches up with the CSS styling world – oh wait, look here). It is convenient to show ranges using ColorBrewer2 colors in a set of one plus five.. a color for NoValue, then what I call little0, little1, central value, big1, big0. People often use Standard Deviation for determining the color breaks, but StdDev is based on the average, not the median. In the data I am looking at, it makes a lot more sense to me to take the median value of all values that have a measure, or, a variation on that, all values with a measure and the measure is greater than zero.

In Postgres, I can use an extension called quantile.

a_grid_db=# SELECT quantile( pop_assoc_some_coll_pct, ARRAY[ 0.02,0.25,0.75,0.92]) FROM ag_edu_assoc_some_coll_pct WHERE pop_assoc_some_coll_pct > 0 and pop_assoc_some_coll_pct is not null;

quantile -------------------------------------------- {0.05866,0.14949,0.25696,0.32230}

one_plus_five

In this example, I get approximately: little0 < 0.058; little1 < 0.15; central value > 0.15 and < 0.25; big1 > 0.25 and < 0.32; big0 > 0.32 (filled out completely with boundaries correct). Of course, there are specific reasons to use StdDev and a whole host of statistical techniques, in various situations. This basic technique is good enough for the situation at hand, and so, bears repeating.

Filed under BigData, geoserver, Postgres, Viz.

Comments Off

PostgreSQL 9.3 plus Hadoop File System

23 September 2013, 5:40 pm

hadoopfdw4logo

It seems that things just got a little bit more interesting with the release of Pg 9.3

Filed under BigData, Postgres.

Comments Off

Geoserver 2.3 WMS from PostGIS Vectors

20 September 2013, 3:30 pm

… it turns out that this technique is important using QGis 1.8 with tiled WMS.

You can set the following options on the wms connection properties:


1)      Append this to the wms url "?TRANSPARENT=TRUE" or "&TRANSPARENT=TRUE" if your url already has more parameters
2)      Check the option "Ignore GetMap URI reported in capabilities"

This works fine only if your capabilities doc does not point to some other location than the wms url you defined in the connection...

AND ALSO…

getting fine results with the new QGis 2.0. trying a few experiments now with WMS of PostGIS geometry layers.

AND ALSO…

Geoserver 2.4 now in place as well.. the block above cost me a lot of hours… grr

Filed under geoserver.

Comments Off

2013 Nottingham Workshops Begin

16 September 2013, 8:59 am

Filed under OSGeo Live.

Comments Off

Earthscapes USPS

13 September 2013, 11:57 am

Filed under Viz.

Comments Off

Data Characterization and the Live

8 September 2013, 1:26 pm

People may already know about the OSGeo Live project. Its a great base as a VM since a) it is stable and very well tested, and b) it has much software installed, but in a way that is transparent through install scripts, so customization is as straightforward as it gets..

some_7_pkgs_image

I was faced with a bit of “data mining” prep, so I pulled out the OSGeo Live and set up a VM. I use VirtualBox and its GUI on my desktop i5 quite a bit. I made a new VM, gave it two cores and 1G of RAM, bridged networking instead of NAT, and set the CD storage to point to a nice new Live .iso.

There are quite a few PostgreSQL 9.1 pl-language extensions as packages for Ubuntu, but not much for 9.2 apparently, (precise, raring and the XUbuntu precise Live all use 9.1), and of course little for the 9.3 as of yet.

Since a rockin’ R environment is already pre-built on the OSGeo Live, I didnt have to think about that.. I installed this package and used my formula from a previous post. Dont forget to set Postgres to use more RAM (this is the generic formula I use to set System V memory limits on Linux in /etc/sysctl.conf )

#!/bin/bash mem_bytes=`awk '/MemTotal:/ { printf "%0.f",$2 * 1024}' /proc/meminfo` mem_max=`expr $mem_bytes / 2` page_size=`getconf PAGE_SIZE` shmall=`expr $mem_bytes / $page_size` echo \# Maximum shared segment size in bytes echo kernel.shmmax = $mem_max echo \# Maximum number of shared memory segments in pages echo kernel.shmall = $shmall

## /etc/postgresql/9.1/main/postgresql.conf


# - Connection Settings -

 .....
max_connections = 24                    # (change requires restart)
 .....
# - Memory -
shared_buffers = 240MB                  # min 128kB

                                        # (change requires restart)

temp_buffers = 16MB                     # min 800kB
work_mem = 160MB                        # min 64kB

maintenance_work_mem = 128MB            # min 1MB

max_stack_depth = 3MB                   # min 100kB

But, after waiting 160 seconds for some descriptive stats on a column of 900,000 floats, a dozen times.. I was searching around for better performance..

SELECT max(a.res),avg(a.res),median(a.res),min(a.res),sd(a.res) FROM ( SELECT (pop_hs_diploma_pct *100)::integer as res FROM my_excellent_adventure_table WHERE pop_hs_diploma_pct > 0.01 ) a;


 max |         avg         | median | min |        sd        
-----+---------------------+--------+-----+------------------
  63 | 12.5609578591567519 |     12 |   1 | 6.88020337480763

SnowMan pointed out a short rant from a few years ago, but actually, pl-R is still slow!

Then I found this post and the corresponding project. I compiled, built and it worked just fine. Thx InterWeb!

-- SELECT min(pop_age65_up::integer), quantile( pop_age65_up, ARRAY[0.25, 0.5, 0.75]), avg(pop_age65_up::integer), max(pop_age65_up::integer) FROM my_excellent_adventure_table WHERE pop_age65_up > 1;


 min |      quantile          |   avg  | max  
-----+------------------------+--------+------
   1 | {2.4556,4.7930,8.5264} | 8.6491 | 3878

Filed under BigData, OSGeo Live, Postgres.

Comments Off

AmpCamp 3 – HDFS

3 September 2013, 9:10 am

As described in a previous post, I used Cloudera .debs to install Hadoop/HDFS on an Ubuntu 12.04 ‘Precise’ single node. Now, to put some data into the HDFS system and use it. (note- this did not work in a 32bit VM)

The Hadoop/HDFS install consisted of two steps: obtain and install .deb cdh4-repository, which enables a suite of other packages (perhaps auto-magically updated); then use those packages to install the features you want.

apt-get update shows the Cloudera repository in the list.
Hit http://archive.cloudera.com precise-cdh4 Release.gpg Hit http://archive.cloudera.com precise-cdh4 Release Hit http://archive.cloudera.com precise-cdh4/contrib Sources Hit http://archive.cloudera.com precise-cdh4/contrib amd64 Packages Ign http://archive.cloudera.com precise-cdh4/contrib TranslationIndex Ign http://archive.cloudera.com precise-cdh4/contrib Translation-en_US Ign http://archive.cloudera.com precise-cdh4/contrib Translation-en

Packages look like this:

Some quick reading shows I have at least two choices to easily interface to HDFS. One is the hue suite, and another is an HDFS Fuse interface. More options are listed on this MountableHDFS page.

Cloudera supplies an HDFS Fuse mount with their system. Instructions on how to use the FUSE extension are here.

When HDFS is running with this system (with or without FUSE) you can view a web interface at port 50070. A port list here.

after mucking with the environment a bit
Exception in thread "main" java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.defaultFS):
I was getting an error when trying to start the secondarynamenode, that the namenode had an invalid address. so, investigating:
update-alternatives --get-selections | grep hadoop hadoop-conf auto /etc/hadoop/conf.empty

less /etc/hadoop/conf.empty/core-site.xml

core-site.xml turned out to be missing the actual namenode name,
which was not apparent in the doc I was reading.. So, edit
/etc/hadoop/conf.empty/core-site.xml and add a name and value property xml tags. The system starts up after this change.

<configuration>
 <property>
  <name>fs.default.name</name>
  <value>hdfs://127.0.1.1:50070</value>
 </property>
</configuration>

Filed under BigData.

Comments Off

AmpCamp 3 – The Stack

1 September 2013, 2:49 pm

The Berkeley Data Analytics Stack (BDAS) was the central subject at AmpCamp 3.

Spark is the core of the stack. It has been recently adopted for incubation as an Apache Project. True to form for a fast-moving OSS project, we actually used the 0.80 git repo version, rather than the 0.73 that you will find on the Apache site.

Spark is built on Scala, which runs in a Java VM. All the lab exercises used openJDK7, and they ran very very well. (we moved all of the OSGeo Live java projects to openJDK7 last December or so, and have never looked back)

Among the components of BDAS, I got the sense that Spark and Mesos were most stable, with Shark (the SQL interface ) and pySpark (the python interface ) also strong.. The in-memory filesystem Tachyon was presented as clearly in early stages, as well as the very interesting GraphX, MLBase and BlinkDB.

All the lab exercises were executed in an Amazon Web-Services (AWS) cluster. Thanks to excellent tech and teamwork, the labs flowed smoothly both days. However, I was interested in bringing up the BDAS stack on my own machines. Here is what I did to do that:

* make a working directory, I called mine amplab3

* I found that java -version showed 1.6 even though I had installed 1.7. So I used these two steps to change it:
dbb@i7c:~/amplab3$ update-java-alternatives -l java-1.6.0-openjdk-amd64 1061 /usr/lib/jvm/java-1.6.0-openjdk-amd64 java-1.7.0-openjdk-amd64 1051 /usr/lib/jvm/java-1.7.0-openjdk-amd64

sudo apt-get install icedtea-7-plugin sudo update-java-alternatives -s java-1.7.0-openjdk-amd64

* Install Hadoop from Cloudera (CDH4) via .debs, following these instructions

* verify the Hadoop install by starting it
sudo service hadoop-hdfs-namenode start sudo service hadoop-hdfs-secondarynamenode start

* I did not use an Ubuntu package for Scala. I used openJDK7 from repository, then got Scala here, unpacked scala-2.9.3.tgz in the working directory.

* git clone https://github.com/mesos/spark.git

* cd spark; cp conf/spark-env.sh.template conf/spark-env.sh

* add SCALA_HOME=/path/to/scala-2.9.3 to conf/spark-env.sh

* sbt/sbt assembly

You should be ready to go! Of course, I did a few extra things, and I took many detours along the way, but that was about it. In particular, I found that the hadoop data directory in this setup is under /var/lib/hadoop-*, so I created an alias for that and pointed to a fast local disk with plenty of free space. You may use the conf files in /etc to get the same effect, but I did not want to change things at a fine level yet.

Once spark or pyspark is running, you can hit port ~~3030~~ 4040 with a web browser for an interface to the engine. Other useful docs here:

http://spark.incubator.apache.org/docs/latest/configuration.html
http://ampcamp.berkeley.edu/exercises-strata-conf-2013/index.html

Filed under BigData.

Comments Off

AmpCamp 3 – Intro

1 September 2013, 1:48 pm

I cannot say enough about AmpCamp 3, a two day workshop at UC Berkeley that has just completed. The Berkeley AMP Lab (Algorithms, Machines and People) put on another great Open Source community building event, with state-of-the-art tech, precision execution, and the sort of fun that comes from a job well done.

Many interesting people, conversations, and of course, big data. This post is perhaps the first in several, because it will take several to even touch on all the angles presented at this intense and very productive camp.

Patterns of communication and publication are emerging and strengthening all the time on the ‘net, and the AMP Lab is in fine form, using a myriad of tech to reach broader audiences and encourage participation.

The AMP Lab Github repository

Filed under BigData.

Comments Off

PostGIS 2.1 on Debian 7

19 August 2013, 1:03 pm

pg_logo

Newly released PostGIS 2.1 rocks.. How can I install it on a fresh Debian 7 ‘wheezy’ to try it out?
Here is a recipe that worked for me:

—
Final Results debian_version 7.1 PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu ----------------------------------------------- myrole=# select postgis_full_version();

POSTGIS="2.1.1dev r11823" GEOS="3.4.2dev-CAPI-1.8.0 r3897" PROJ="Rel. 4.7.1, 23 September 2009" GDAL="GDAL 1.11dev, released 2013/04/13" LIBXML="2.8.0" LIBJSON="UNKNOWN" RASTER

—
Shell Steps


df -h

sudo apt-get update

mkdir srcs_local

cd srcs_local

sudo apt-get install build-essential flex autoconf libtool gdb gfortran subversion htop

sudo apt-get install postgresql-plpython-9.1 postgresql-server-dev-all python-psycopg2

sudo apt-get install xsltproc libproj-dev python2.7-dev
## GEOS 

svn co http://svn.osgeo.org/geos/branches/3.4 geos_34_branch

cd geos_34_branch/

./autogen.sh

./configure

make -j6

sudo make install

ls -l /usr/local/lib

cd ..
## GDAL

sudo apt-get install libcurl4-openssl-dev libgcrypt11-doc

sudo apt-get install libpcre3-dev libjson0-dev libxerces-c-dev libexpat1-dev libxml2-dev

sudo apt-get install libkml-dev
svn co http://svn.osgeo.org/gdal/trunk gdal_trunk

cd gdal_trunk/gdal

./autogen

./configure --with-python --with-curl

make -j6

sudo make install

sudo ldconfig  # for good luck
cd ../autotest/

##-- test out the new gdal

python run_all.py 
## PostGIS

cd ../..

svn co http://svn.osgeo.org/postgis/branches/2.1 postgis_21_branch

cd postgis_21_branch/
./autogen.sh

./configure

make -j4

sudo make install

sudo ldconfig  # for good luck
sudo su - postgres

# psql

> create role myrole superuser createdb login;

exit ## out of linux user postgres shell

createdb myrole

psql

sudo service postgresql restart

—
The libraries in /usr/local/lib are symbolically linked. It ends up looking like this: ----------------------


libgdal.a

libgdal.la

libgdal.so -> libgdal.so.1.17.0

libgdal.so.1 -> libgdal.so.1.17.0

libgdal.so.1.17.0

libgeos-3.4.2dev.so

libgeos.a

libgeos_c.a

libgeos_c.la

libgeos_c.so -> libgeos_c.so.1.8.0

libgeos_c.so.1 -> libgeos_c.so.1.8.0

libgeos_c.so.1.8.0

libgeos.la

libgeos.so -> libgeos-3.4.2dev.so

liblwgeom-2.1.1dev.so

liblwgeom.a

liblwgeom.la

liblwgeom.so -> liblwgeom-2.1.1dev.so

Filed under Postgres. Tagged debian, postgis, postgresql

Comments Off

darkblue blog

Five Colors for Stats

PostgreSQL 9.3 plus Hadoop File System

Geoserver 2.3 WMS from PostGIS Vectors

2013 Nottingham Workshops Begin

Earthscapes USPS

Data Characterization and the Live

AmpCamp 3 – HDFS

AmpCamp 3 – The Stack

AmpCamp 3 – Intro

PostGIS 2.1 on Debian 7

Recent Posts

Archives

Categories

Blogroll

Meta