Tuesday, March 31, 2015

Building Apache Spark Applications In Visual Studio

There are several applications named Spark. This post refers to Apache's in-memory streaming application suite named Spark. This is part of the Hadoop reference architecture. Generally speaking the documentation is oriented towards developers coding on a Linux laptop using sbt or Maven for CI support. I am working on a mixed team which prefers the Microsoft tooling (as do I). Our goal was to find a language that had built in Visual Studio and TFS support that would also deploy to HDFS using the Hadoop/Spark supported deployment tooling (Ambari and spark submit) with only minor environment reconfiguration.

Tooling:
Team Foundation Server 2013
Visual Studio 2013
Python Tools for Visual Studio (UnitTest, pip, python environment support, REPL integration, and MSBuild support for Python setup tools)
Spark 1.3
Python 2.7 (3.N, PyPy, and Anaconda are not tested with Spark yet)
CentOS
Windows 8.1
Hortonworks magical sandbox.
Powershell 3
GitHub for pulling Spark


Windows Setup
Install VS, PTVS, Python 2.7, IPython, nad GitHub to the default paths.
Install Spark in C:\Spark
Add a ton of environment variables (super important).

SPARK_HOME      C:\Spark
PYSPARK_HOME C:\Spark\Python
PY4J_HOME          C:\Spark\python\lib\py4j-0.8.2.1-src.zip
PYTHONPATH       C:\python27;C:\python27\scripts;c:\python27;c:\python27\scripts;%SPARK_HOME%;%PYSPARK_HOME%;%PY4J_HOME%

These are optional
PYTHON2      C:\python27\python
PYTHON3      C:\python3\python
ANACONDA C:\Users\ealdinger\AppData\Local\Continuum\Anaconda
GIT_HOME    <wherever you dump your files>

You can do this from Powershell or System - Advanced - Environment Variables

Testing Setup
Open Powershell
$py = $env:Path|select-string -pattern "c:\\python27"
$spark = $env:Path|select-string -pattern "c:\\spark"
$pyspark = $env:Path|select-string -pattern "c:\\spark\\python"
$py4j = $env:Path|select-string -pattern "C:\\Spark\\python\\lib\\py4j-0.8.2.1-src.zip"
$py -ne $null;$spark -ne $null;$pyspark -ne $null;$py4j -ne $null;

Start IPython and test that pyspark can be imported
Paste the lines below
from pyspark import SparkContext
logFile = "c:\spark\README.md"  # Should be some file on your system
sc = SparkContext("local", "SimpleApp")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print "_-_-_-_-_-_-_-_-_-_-_-_-_-_-"
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
print "_-_-_-_-_-_-_-_-_-_-_-_-_-_-"
sc.stop()

Look for
In [32]: print "_-_-_-_-_-_-_-_-_-_-_-_-_-_-"
_-_-_-_-_-_-_-_-_-_-_-_-_-_-
In [33]: print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
Lines with a: 60, lines with b: 29
In [34]: print "_-_-_-_-_-_-_-_-_-_-_-_-_-_-"
_-_-_-_-_-_-_-_-_-_-_-_-_-_-

Start Visual Studio - Create a new Python Application project
Right click Search Paths in the solution.  Add PYTHONPATH to Search Path. You should see see spark, spark\python and py4j
Add a file to test with add
import re
for test_string in ['555-1212', 'ILL-EGAL']:
    if re.match(r'^\d{3}-\d{4}$', test_string):
        print test_string, 'is a valid US local phone number'
    else:
        print test_string, 'rejected'
print 'end of test'
Save and Start with Debugging
Add another file or change the first one.
from pyspark import SparkContext

logFile = "c:\spark\README.md"  # Should be some file on your system
sc = SparkContext("local", "Simple App")

logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print "_-_-_-_-_-_-_-_-_-_-_-_-_-_-"
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
print "_-_-_-_-_-_-_-_-_-_-_-_-_-_-"

sc.stop()
 Save and Start with Debugging. The script should run with a lot of output. The final output should be
Lines with a: 60, lines with b: 29

REF:
http://mund-consulting.com/Blog/using-ipython-and-visual-studio-with-apache-spark/

No comments:

Post a Comment