My Software Notepad: Building Apache Spark Applications In Visual Studio

There are several applications named Spark. This post refers to Apache's in-memory streaming application suite named Spark. This is part of the Hadoop reference architecture. Generally speaking the documentation is oriented towards developers coding on a Linux laptop using sbt or Maven for CI support. I am working on a mixed team which prefers the Microsoft tooling (as do I). Our goal was to find a language that had built in Visual Studio and TFS support that would also deploy to HDFS using the Hadoop/Spark supported deployment tooling (Ambari and spark submit) with only minor environment reconfiguration.

Tooling:
Team Foundation Server 2013
Visual Studio 2013
Python Tools for Visual Studio (UnitTest, pip, python environment support, REPL integration, and MSBuild support for Python setup tools)
Spark 1.3
Python 2.7 (3.N, PyPy, and Anaconda are not tested with Spark yet)
CentOS
Windows 8.1
Hortonworks magical sandbox.
Powershell 3
GitHub for pulling Spark

Windows Setup
Install VS, PTVS, Python 2.7, IPython, nad GitHub to the default paths.
Install Spark in C:\Spark
Add a ton of environment variables (super important).

SPARK_HOME C:\Spark

PYSPARK_HOME C:\Spark\Python

PY4J_HOME C:\Spark\python\lib\py4j-0.8.2.1-src.zip

PYTHONPATH C:\python27;C:\python27\scripts;c:\python27;c:\python27\scripts;%SPARK_HOME%;%PYSPARK_HOME%;%PY4J_HOME%

These are optional

PYTHON2      C:\python27\python

PYTHON3      C:\python3\python
ANACONDA C:\Users\ealdinger\AppData\Local\Continuum\Anaconda

GIT_HOME    <wherever you dump your files>

You can do this from Powershell or System - Advanced - Environment Variables

Testing Setup
Open Powershell

$py = $env:Path|select-string -pattern "c:\\python27"
$spark = $env:Path|select-string -pattern "c:\\spark"
$pyspark = $env:Path|select-string -pattern "c:\\spark\\python"
$py4j = $env:Path|select-string -pattern "C:\\Spark\\python\\lib\\py4j-0.8.2.1-src.zip"
$py -ne $null;$spark -ne $null;$pyspark -ne $null;$py4j -ne $null;

Start IPython and test that pyspark can be imported
Paste the lines below

from pyspark import SparkContext
logFile = "c:\spark\README.md" # Should be some file on your system
sc = SparkContext("local", "SimpleApp")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print "_-_-_-_-_-_-_-_-_-_-_-_-_-_-"
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
print "_-_-_-_-_-_-_-_-_-_-_-_-_-_-"
sc.stop()

Look for

In [32]: print "_-_-_-_-_-_-_-_-_-_-_-_-_-_-"
_-_-_-_-_-_-_-_-_-_-_-_-_-_-
In [33]: print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
Lines with a: 60, lines with b: 29
In [34]: print "_-_-_-_-_-_-_-_-_-_-_-_-_-_-"
_-_-_-_-_-_-_-_-_-_-_-_-_-_-

Start Visual Studio - Create a new Python Application project
Right click Search Paths in the solution. Add PYTHONPATH to Search Path. You should see see spark, spark\python and py4j
Add a file to test with add

import re
for test_string in ['555-1212', 'ILL-EGAL']:
    if re.match(r'^\d{3}-\d{4}$', test_string):
        print test_string, 'is a valid US local phone number'
    else:
        print test_string, 'rejected'
print 'end of test'

Save and Start with Debugging
Add another file or change the first one.

from pyspark import SparkContext

logFile = "c:\spark\README.md" # Should be some file on your system
sc = SparkContext("local", "Simple App")

logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print "_-_-_-_-_-_-_-_-_-_-_-_-_-_-"
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
print "_-_-_-_-_-_-_-_-_-_-_-_-_-_-"

sc.stop()

Save and Start with Debugging. The script should run with a lot of output. The final output should be

Lines with a: 60, lines with b: 29

REF:
http://mund-consulting.com/Blog/using-ipython-and-visual-studio-with-apache-spark/

My Software Notepad

Tuesday, March 31, 2015

Building Apache Spark Applications In Visual Studio

No comments:

Post a Comment

Followers

About Me