Skip to content
 

AmpCamp 2014

spark_logo_sm

BDAS  the Berkeley Data Analytics Stack

At a minimum, suffice it to say I participated online in roughly twelve hours of lecture and lab on Nov 20 and 21, 2014 at AmpCamp 5 (I also attended one in Fall 2012). I put an emphasis on python, IPython Notebook, and SQL.

Once again this year, the camp mechanics went very smoothly — readable and succinct online exercises; Spark docs; Spark python, called pyspark is advancing, although some interfaces may not be available to python yet; Spark SQL appears to be useable.

To setup on my own Linux box, I unzipped the following files:
ampcamp5-usb.zip ampcamp-pipelines.zip training-downloads.zip

The resulting directories provided a pre-built Spark 1.1
Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_65)

The Lab exercises are almost all available as both Scala and python. Tools to do the first labs:

$SPARK_HOME/bin/spark-shell  $SPARK_HOME/bin/pyspark

and for extra practice

$SPARK_HOME/bin/spark-submit  $SPARK_HOME/bin/run-example

IPython Notebook

An online teaching assistant (TA) suggested a command line to launch the Notebook – here are my notes:

##-- TA suggestion
IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark --master "local[4]"

##-- a server already setup with a Notebook, options
--matplotlib inline --ip=192.168.1.200 --no-browser --port=8888

##-- COMBINE
IPYTHON_OPTS="notebook --matplotlib inline --ip=192.168.1.200 --no-browser --port=8888" $SPARK_HOME/bin/pyspark --master "local[4]"

The IPython Notebook worked ! Lots of conveniences, interactivity and viz potential immediately available against the pyspark environment. I created several Notebooks in short order, to test and explore, for example SQL.

The SQL exercise reads data from a format new to me, called Parquet

 
Part 1.2

After rest and recuperation, I wanted to try python in the almost-ready Spark 1.2 branch. It turned out to build and run easily. First get the spark code:

 https://github.com/apache/spark/tree/branch-1.2

make sure maven is installed on your system, then run

./make-distribution.sh

. Afterwards, I set $SPARK_HOME to this directory, and launched IPython Notebook again. All the examples and experiments I had built worked without modification ! Success.

Other Links

http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html
http://spark-summit.org/2014/training
https://github.com/amplab-extras

http://www.planetscala.com/

experimental
https://github.com/ooyala/spark-jobserver