Skip to content
 

Data Characterization and the Live

People may already know about the OSGeo Live project. Its a great base as a VM since a) it is stable and very well tested, and b) it has much software installed, but in a way that is transparent through install scripts, so customization is as straightforward as it gets..

some_7_pkgs_image

I was faced with a bit of “data mining” prep, so I pulled out the OSGeo Live and set up a VM. I use VirtualBox and its GUI on my desktop i5 quite a bit. I made a new VM, gave it two cores and 1G of RAM, bridged networking instead of NAT, and set the CD storage to point to a nice new Live .iso.

There are quite a few PostgreSQL 9.1 pl-language extensions as packages for Ubuntu, but not much for 9.2 apparently, (precise, raring and the XUbuntu precise Live all use 9.1), and of course little for the 9.3 as of yet.

Since a rockin’ R environment is already pre-built on the OSGeo Live, I didnt have to think about that.. I installed this package and used my formula from a previous post. Dont forget to set Postgres to use more RAM (this is the generic formula I use to set System V memory limits on Linux in /etc/sysctl.conf )


#!/bin/bash
mem_bytes=`awk '/MemTotal:/ { printf "%0.f",$2 * 1024}' /proc/meminfo`
mem_max=`expr $mem_bytes / 2`
page_size=`getconf PAGE_SIZE`
shmall=`expr $mem_bytes / $page_size`
echo \# Maximum shared segment size in bytes
echo kernel.shmmax = $mem_max
echo \# Maximum number of shared memory segments in pages
echo kernel.shmall = $shmall



## /etc/postgresql/9.1/main/postgresql.conf

# - Connection Settings -
.....

max_connections = 24 # (change requires restart)

.....

# - Memory -

shared_buffers = 240MB # min 128kB
# (change requires restart)
temp_buffers = 16MB # min 800kB

work_mem = 160MB # min 64kB
maintenance_work_mem = 128MB # min 1MB
max_stack_depth = 3MB # min 100kB


But, after waiting 160 seconds for some descriptive stats on a column of 900,000 floats, a dozen times.. I was searching around for better performance..


SELECT max(a.res),avg(a.res),median(a.res),min(a.res),sd(a.res)
FROM
( SELECT (pop_hs_diploma_pct *100)::integer as res
  FROM my_excellent_adventure_table
  WHERE pop_hs_diploma_pct > 0.01 ) a;

 max |         avg         | median | min |        sd        
-----+---------------------+--------+-----+------------------
  63 | 12.5609578591567519 |     12 |   1 | 6.88020337480763

SnowMan pointed out a short rant from a few years ago, but actually, pl-R is still slow!

Then I found this post and the corresponding project. I compiled, built and it worked just fine. Thx InterWeb!


--
SELECT
  min(pop_age65_up::integer),
  quantile( pop_age65_up, ARRAY[0.25, 0.5, 0.75]),
  avg(pop_age65_up::integer),
  max(pop_age65_up::integer)
FROM
  my_excellent_adventure_table
WHERE pop_age65_up > 1;

 min |      quantile          |   avg  | max  
-----+------------------------+--------+------
   1 | {2.4556,4.7930,8.5264} | 8.6491 | 3878