There are several applications named Spark. This post refers to Apache's in-memory streaming application suite named Spark. This is part of the Hadoop reference architecture. Generally speaking the documentation is oriented towards developers coding on a Linux laptop using sbt or Maven for CI support. I am working on a mixed team which prefers the Microsoft tooling (as do I). Our goal was to find a language that had built in Visual Studio and TFS support that would also deploy to HDFS using the Hadoop/Spark supported deployment tooling (Ambari and spark submit) with only minor environment reconfiguration.
Tooling:
Team Foundation Server 2013
Visual Studio 2013
Python Tools for Visual Studio (UnitTest, pip, python environment support, REPL integration, and MSBuild support for Python setup tools)
Spark 1.3
Python 2.7 (3.N, PyPy, and Anaconda are not tested with Spark yet)
CentOS
Windows 8.1
Hortonworks magical sandbox.
Powershell 3
GitHub for pulling Spark
Windows Setup
Install VS, PTVS, Python 2.7, IPython, nad GitHub to the default paths.
Install Spark in C:\Spark
Add a ton of environment variables (super important).
These are optional
Start IPython and test that pyspark can be imported
Paste the lines below
Look for
Start Visual Studio - Create a new Python Application project
Right click Search Paths in the solution. Add PYTHONPATH to Search Path. You should see see spark, spark\python and py4j
Add a file to test with add
Add another file or change the first one.
REF:
http://mund-consulting.com/Blog/using-ipython-and-visual-studio-with-apache-spark/
Tooling:
Team Foundation Server 2013
Visual Studio 2013
Python Tools for Visual Studio (UnitTest, pip, python environment support, REPL integration, and MSBuild support for Python setup tools)
Spark 1.3
Python 2.7 (3.N, PyPy, and Anaconda are not tested with Spark yet)
CentOS
Windows 8.1
Hortonworks magical sandbox.
Powershell 3
GitHub for pulling Spark
Windows Setup
Install VS, PTVS, Python 2.7, IPython, nad GitHub to the default paths.
Install Spark in C:\Spark
Add a ton of environment variables (super important).
SPARK_HOME C:\SparkPYSPARK_HOME C:\Spark\PythonPY4J_HOME C:\Spark\python\lib\py4j-0.8.2.1-src.zip
PYTHONPATH C:\python27;C:\python27\scripts;c:\python27;c:\python27\scripts;%SPARK_HOME%;%PYSPARK_HOME%;%PY4J_HOME%
These are optional
PYTHON2 C:\python27\pythonPYTHON3 C:\python3\pythonANACONDA C:\Users\ealdinger\AppData\Local\Continuum\Anaconda
GIT_HOME <wherever you dump your files>
You can do this from Powershell or System - Advanced - Environment Variables
Testing Setup
Open Powershell
Testing Setup
Open Powershell
$py = $env:Path|select-string -pattern "c:\\python27"
$spark = $env:Path|select-string -pattern "c:\\spark"
$pyspark = $env:Path|select-string -pattern "c:\\spark\\python"
$py4j = $env:Path|select-string -pattern "C:\\Spark\\python\\lib\\py4j-0.8.2.1-src.zip"
$py -ne $null;$spark -ne $null;$pyspark -ne $null;$py4j -ne $null;
Start IPython and test that pyspark can be imported
Paste the lines below
from pyspark import SparkContext
logFile = "c:\spark\README.md" # Should be some file on your system
sc = SparkContext("local", "SimpleApp")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print "_-_-_-_-_-_-_-_-_-_-_-_-_-_-"
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
print "_-_-_-_-_-_-_-_-_-_-_-_-_-_-"
sc.stop()
Look for
In [32]: print "_-_-_-_-_-_-_-_-_-_-_-_-_-_-"
_-_-_-_-_-_-_-_-_-_-_-_-_-_-
In [33]: print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
Lines with a: 60, lines with b: 29
In [34]: print "_-_-_-_-_-_-_-_-_-_-_-_-_-_-"
_-_-_-_-_-_-_-_-_-_-_-_-_-_-
Start Visual Studio - Create a new Python Application project
Right click Search Paths in the solution. Add PYTHONPATH to Search Path. You should see see spark, spark\python and py4j
Add a file to test with add
import reSave and Start with Debugging
for test_string in ['555-1212', 'ILL-EGAL']:
if re.match(r'^\d{3}-\d{4}$', test_string):
print test_string, 'is a valid US local phone number'
else:
print test_string, 'rejected'
print 'end of test'
Add another file or change the first one.
from pyspark import SparkContextSave and Start with Debugging. The script should run with a lot of output. The final output should be
logFile = "c:\spark\README.md" # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print "_-_-_-_-_-_-_-_-_-_-_-_-_-_-"
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
print "_-_-_-_-_-_-_-_-_-_-_-_-_-_-"
sc.stop()
Lines with a: 60, lines with b: 29
REF:
http://mund-consulting.com/Blog/using-ipython-and-visual-studio-with-apache-spark/