{"id":1307,"date":"2013-09-08T13:26:40","date_gmt":"2013-09-08T20:26:40","guid":{"rendered":"http:\/\/blog.light42.com\/wordpress\/?p=1307"},"modified":"2013-09-18T17:25:51","modified_gmt":"2013-09-19T00:25:51","slug":"data-characterization-and-the-live","status":"publish","type":"post","link":"http:\/\/blog.light42.com\/wordpress\/?p=1307","title":{"rendered":"Data Characterization and the Live"},"content":{"rendered":"<p>People may already know about the <a href=\"http:\/\/live.osgeo.org\" title=\"osgeo-live\" target=\"_blank\"><strong>OSGeo Live project<\/strong><\/a>. Its a great base as a VM since <strong>a)<\/strong> it is stable and very well tested, and <strong>b)<\/strong> it has <a href=\"http:\/\/live.osgeo.org\/en\/overview\/overview.html\" title=\"osgeo live contents\" target=\"_blank\">much software installed<\/a>, but in a way that is transparent through <a href=\"http:\/\/svn.osgeo.org\/osgeo\/livedvd\/gisvm\/branches\/osgeolive_7_0\/bin\/\" title=\"install_scripts\" target=\"_blank\">install scripts<\/a>, so customization is as straightforward as it gets..<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/blog.light42.com\/wordpress\/wp-content\/uploads\/2013\/09\/some_7_pkgs_image-165x300.png\" alt=\"some_7_pkgs_image\" width=\"150\" height=\"270\" class=\"alignright size-medium wp-image-1317\" \/><\/p>\n<p>I was faced with a bit of &#8220;data mining&#8221; prep, so I pulled out the <strong>OSGeo Live<\/strong> and set up a VM. I use VirtualBox and its GUI on my desktop i5 quite a bit. I made a new VM, gave it two cores and 1G of RAM, bridged networking instead of NAT, and set the CD storage to point to a nice new <a href=\"http:\/\/sourceforge.net\/projects\/osgeo-live\/files\/7.0\/\" title=\"live-7.0\" target=\"_blank\">Live .iso<\/a>.<\/p>\n<p>There are <a href=\"http:\/\/packages.ubuntu.com\/precise\/database\/\" title=\"some precise pkgs\" target=\"_blank\">quite a few PostgreSQL 9.1 pl-language extensions as packages for Ubuntu<\/a>, but not much for 9.2 apparently, (precise, raring and the XUbuntu precise Live all use 9.1), and of course little for the 9.3 as of yet.<\/p>\n<p>Since a rockin&#8217; <strong>R<\/strong> environment is already <a href=\"http:\/\/ct.light42.com\/www_shared\/R_all_packages_70.txt\" title=\"R_list\" target=\"_blank\">pre-built on the OSGeo Live<\/a>, I didnt have to think about that.. I installed <a href=\"http:\/\/packages.ubuntu.com\/precise\/database\/postgresql-9.1-plr\" title=\"pl-r\" target=\"_blank\">this package<\/a> and used my <a href=\"http:\/\/blog.light42.com\/wordpress\/?p=712\" title=\"pl-r post\" target=\"_blank\">formula from a previous post<\/a>.  <em>Dont forget to set Postgres to use more RAM <\/em>(this is the generic formula I use to set System V memory limits on Linux in <code>\/etc\/sysctl.conf<\/code> )<\/p>\n<p><code><br \/>\n#!\/bin\/bash<br \/>\nmem_bytes=`awk '\/MemTotal:\/ { printf \"%0.f\",$2 * 1024}' \/proc\/meminfo`<br \/>\nmem_max=`expr $mem_bytes \/ 2`<br \/>\npage_size=`getconf PAGE_SIZE`<br \/>\nshmall=`expr $mem_bytes \/ $page_size`<br \/>\necho \\# Maximum shared segment size in bytes<br \/>\necho kernel.shmmax = $mem_max<br \/>\necho \\# Maximum number of shared memory segments in pages<br \/>\necho kernel.shmall = $shmall<br \/>\n<\/code><\/p>\n<hr \/>\n<p><code><br \/>\n## \/etc\/postgresql\/9.1\/main\/postgresql.conf<\/p>\n<p># - Connection Settings -<br \/>\n .....<\/p>\n<p>max_connections = 24                    # (change requires restart)<\/p>\n<p> .....<\/p>\n<p># - Memory -<\/p>\n<p>shared_buffers = 240MB                  # min 128kB<br \/>\n                                        # (change requires restart)<br \/>\ntemp_buffers = 16MB                     # min 800kB<\/p>\n<p>work_mem = 160MB                        # min 64kB<br \/>\nmaintenance_work_mem = 128MB            # min 1MB<br \/>\nmax_stack_depth = 3MB                   # min 100kB<\/p>\n<p><\/code><\/p>\n<hr \/>\n<p>But, after waiting <em>160 seconds <\/em>for some descriptive stats on a column of 900,000 floats, a dozen times.. I was searching around for better performance..<\/p>\n<p><code><br \/>\nSELECT max(a.res),avg(a.res),median(a.res),min(a.res),sd(a.res)<br \/>\nFROM<br \/>\n( SELECT (pop_hs_diploma_pct *100)::integer as res<br \/>\n&nbsp;&nbsp;FROM my_excellent_adventure_table<br \/>\n&nbsp;&nbsp;WHERE pop_hs_diploma_pct > 0.01 ) a;<\/p>\n<pre>\r\n max |         avg         | median | min |        sd        \r\n-----+---------------------+--------+-----+------------------\r\n  63 | 12.5609578591567519 |     12 |   1 | 6.88020337480763\r\n<\/pre>\n<p><\/code><\/p>\n<p><a href=\"http:\/\/www.youtube.com\/watch?v=w6zOkxU9uRY\" title=\"SnowMan rocks out\" target=\"_blank\"><strong>SnowMan<\/strong><\/a> pointed out a <a href=\"http:\/\/grokbase.com\/t\/postgresql\/pgsql-hackers\/06a62n2pe1\/array-accum-aggregate\" title=\"grokbase rocks too\" target=\"_blank\">short rant from a few years ago<\/a>, but actually, pl-R is still slow!<\/p>\n<p>Then I found <a href=\"http:\/\/www.fuzzy.cz\/en\/articles\/calculating-quantiles-and-trimmed-aggregates\/\" title=\"cz rocks quantiles\" target=\"_blank\">this post<\/a> and <a href=\"https:\/\/github.com\/tvondra\/quantile\" title=\"srcs for quantile\" target=\"_blank\">the corresponding project<\/a>. I compiled, built and it worked just fine. <strong>Thx InterWeb!<\/strong><\/p>\n<p><code><br \/>\n--<br \/>\nSELECT<br \/>\n&nbsp;&nbsp;min(pop_age65_up::integer),<br \/>\n&nbsp;&nbsp;quantile( pop_age65_up, ARRAY[0.25, 0.5, 0.75]),<br \/>\n&nbsp;&nbsp;avg(pop_age65_up::integer),<br \/>\n&nbsp;&nbsp;max(pop_age65_up::integer)<br \/>\nFROM<br \/>\n&nbsp;&nbsp;my_excellent_adventure_table<br \/>\nWHERE pop_age65_up > 1;<\/p>\n<pre>\r\n min |      quantile          |   avg  | max  \r\n-----+------------------------+--------+------\r\n   1 | {2.4556,4.7930,8.5264} | 8.6491 | 3878\r\n<\/pre>\n<p><\/code><\/p>\n","protected":false},"excerpt":{"rendered":"<p>People may already know about the OSGeo Live project. Its a great base as a VM since a) it is stable and very well tested, and b) it has much software installed, but in a way that is transparent through install scripts, so customization is as straightforward as it gets.. I was faced with a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11,12,5],"tags":[],"_links":{"self":[{"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/1307"}],"collection":[{"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1307"}],"version-history":[{"count":28,"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/1307\/revisions"}],"predecessor-version":[{"id":1337,"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/1307\/revisions\/1337"}],"wp:attachment":[{"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1307"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1307"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/blog.light42.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1307"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}