BDAS the Berkeley Data Analytics Stack
At a minimum, suffice it to say I participated online in roughly twelve hours of lecture and lab on Nov 20 and 21, 2014 at AmpCamp 5 (I also attended one in Fall 2012). I put an emphasis on python, IPython Notebook, and SQL.
Once again this year, the camp mechanics went very smoothly — readable and succinct online exercises; Spark docs; Spark python, called pyspark is advancing, although some interfaces may not be available to python yet; Spark SQL appears to be useable.
To setup on my own Linux box, I unzipped the following files:
ampcamp5-usb.zip ampcamp-pipelines.zip training-downloads.zip
The resulting directories provided a pre-built Spark 1.1
Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_65)
The Lab exercises are almost all available as both Scala and python. Tools to do the first labs:
$SPARK_HOME/bin/spark-shell $SPARK_HOME/bin/pyspark
and for extra practice
$SPARK_HOME/bin/spark-submit $SPARK_HOME/bin/run-example
IPython Notebook
An online teaching assistant (TA) suggested a command line to launch the Notebook – here are my notes:
##-- TA suggestion
IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark --master "local[4]"
##-- a server already setup with a Notebook, options
--matplotlib inline --ip=192.168.1.200 --no-browser --port=8888
##-- COMBINE
IPYTHON_OPTS="notebook --matplotlib inline --ip=192.168.1.200 --no-browser --port=8888" $SPARK_HOME/bin/pyspark --master "local[4]"
The IPython Notebook worked ! Lots of conveniences, interactivity and viz potential immediately available against the pyspark environment. I created several Notebooks in short order, to test and explore, for example SQL.
The SQL exercise reads data from a format new to me, called Parquet
Part 1.2
After rest and recuperation, I wanted to try python in the almost-ready Spark 1.2 branch. It turned out to build and run easily. First get the spark code:
https://github.com/apache/spark/tree/branch-1.2
make sure maven is installed on your system, then run
./make-distribution.sh
. Afterwards, I set $SPARK_HOME
to this directory, and launched IPython Notebook again. All the examples and experiments I had built worked without modification ! Success.
Other Links
http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html
http://spark-summit.org/2014/training
https://github.com/amplab-extras
http://www.planetscala.com/
experimental
https://github.com/ooyala/spark-jobserver