The Berkeley Data Analytics Stack (BDAS) was the central subject at AmpCamp 3.
Spark is the core of the stack. It has been recently adopted for incubation as an Apache Project. True to form for a fast-moving OSS project, we actually used the 0.80 git repo version, rather than the 0.73 that you will find on the Apache site.
Spark is built on Scala, which runs in a Java VM. All the lab exercises used openJDK7, and they ran very very well. (we moved all of the OSGeo Live java projects to openJDK7 last December or so, and have never looked back)
Among the components of BDAS, I got the sense that Spark and Mesos were most stable, with Shark (the SQL interface ) and pySpark (the python interface ) also strong.. The in-memory filesystem Tachyon was presented as clearly in early stages, as well as the very interesting GraphX, MLBase and BlinkDB.
All the lab exercises were executed in an Amazon Web-Services (AWS) cluster. Thanks to excellent tech and teamwork, the labs flowed smoothly both days. However, I was interested in bringing up the BDAS stack on my own machines. Here is what I did to do that:
* make a working directory, I called mine amplab3
* I found that java -version
showed 1.6 even though I had installed 1.7. So I used these two steps to change it:
dbb@i7c:~/amplab3$ update-java-alternatives -l
java-1.6.0-openjdk-amd64 1061 /usr/lib/jvm/java-1.6.0-openjdk-amd64
java-1.7.0-openjdk-amd64 1051 /usr/lib/jvm/java-1.7.0-openjdk-amd64
sudo apt-get install icedtea-7-plugin
sudo update-java-alternatives -s java-1.7.0-openjdk-amd64
* Install Hadoop from Cloudera (CDH4) via .debs, following these instructions
* verify the Hadoop install by starting it
sudo service hadoop-hdfs-namenode start
sudo service hadoop-hdfs-secondarynamenode start
* I did not use an Ubuntu package for Scala. I used openJDK7 from repository, then got Scala here, unpacked scala-2.9.3.tgz
in the working directory.
* git clone https://github.com/mesos/spark.git
* cd spark; cp conf/spark-env.sh.template conf/spark-env.sh
* add SCALA_HOME=/path/to/scala-2.9.3
to conf/spark-env.sh
* sbt/sbt assembly
You should be ready to go! Of course, I did a few extra things, and I took many detours along the way, but that was about it. In particular, I found that the hadoop data directory in this setup is under /var/lib/hadoop-*
, so I created an alias for that and pointed to a fast local disk with plenty of free space. You may use the conf files in /etc to get the same effect, but I did not want to change things at a fine level yet.
Once spark or pyspark is running, you can hit port 3030 4040 with a web browser for an interface to the engine. Other useful docs here:
http://spark.incubator.apache.org/docs/latest/configuration.html
http://ampcamp.berkeley.edu/exercises-strata-conf-2013/index.html