Data Characterization and the Live

8 September 2013, 1:26 pm

People may already know about the OSGeo Live project. Its a great base as a VM since a) it is stable and very well tested, and b) it has much software installed, but in a way that is transparent through install scripts, so customization is as straightforward as it gets..

some_7_pkgs_image

I was faced with a bit of “data mining” prep, so I pulled out the OSGeo Live and set up a VM. I use VirtualBox and its GUI on my desktop i5 quite a bit. I made a new VM, gave it two cores and 1G of RAM, bridged networking instead of NAT, and set the CD storage to point to a nice new Live .iso.

There are quite a few PostgreSQL 9.1 pl-language extensions as packages for Ubuntu, but not much for 9.2 apparently, (precise, raring and the XUbuntu precise Live all use 9.1), and of course little for the 9.3 as of yet.

Since a rockin’ R environment is already pre-built on the OSGeo Live, I didnt have to think about that.. I installed this package and used my formula from a previous post. Dont forget to set Postgres to use more RAM (this is the generic formula I use to set System V memory limits on Linux in /etc/sysctl.conf )

#!/bin/bash mem_bytes=`awk '/MemTotal:/ { printf "%0.f",$2 * 1024}' /proc/meminfo` mem_max=`expr $mem_bytes / 2` page_size=`getconf PAGE_SIZE` shmall=`expr $mem_bytes / $page_size` echo \# Maximum shared segment size in bytes echo kernel.shmmax = $mem_max echo \# Maximum number of shared memory segments in pages echo kernel.shmall = $shmall

## /etc/postgresql/9.1/main/postgresql.conf


# - Connection Settings -

 .....
max_connections = 24                    # (change requires restart)
 .....
# - Memory -
shared_buffers = 240MB                  # min 128kB

                                        # (change requires restart)

temp_buffers = 16MB                     # min 800kB
work_mem = 160MB                        # min 64kB

maintenance_work_mem = 128MB            # min 1MB

max_stack_depth = 3MB                   # min 100kB

But, after waiting 160 seconds for some descriptive stats on a column of 900,000 floats, a dozen times.. I was searching around for better performance..

SELECT max(a.res),avg(a.res),median(a.res),min(a.res),sd(a.res) FROM ( SELECT (pop_hs_diploma_pct *100)::integer as res FROM my_excellent_adventure_table WHERE pop_hs_diploma_pct > 0.01 ) a;


 max |         avg         | median | min |        sd        
-----+---------------------+--------+-----+------------------
  63 | 12.5609578591567519 |     12 |   1 | 6.88020337480763

SnowMan pointed out a short rant from a few years ago, but actually, pl-R is still slow!

Then I found this post and the corresponding project. I compiled, built and it worked just fine. Thx InterWeb!

-- SELECT min(pop_age65_up::integer), quantile( pop_age65_up, ARRAY[0.25, 0.5, 0.75]), avg(pop_age65_up::integer), max(pop_age65_up::integer) FROM my_excellent_adventure_table WHERE pop_age65_up > 1;


 min |      quantile          |   avg  | max  
-----+------------------------+--------+------
   1 | {2.4556,4.7930,8.5264} | 8.6491 | 3878

Filed under BigData, OSGeo Live, Postgres

Comments are closed | Permalink

darkblue blog

Data Characterization and the Live

Recent Posts

Archives

Categories

Blogroll

Meta