Monday, October 17, 2016

Spark Setup and installation on window

Spark Setup and installation on window

1. Download r
https://cran.r-project.org/mirrors.html
2. Download Scala
http://www.scala-lang.org/download/2.10.6.html
3. Jdl 1.7 +
4. Download python

5. After saving the file I have downloaded the hadoop binary file winutils.exe, even though Spark runs independently of Hadoop, there is a bug which searches for winutils.exe which is needed for hadoop, and throws up an error.

I have download the file from the below mentioned link

http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe

I have created a folder named winutils in c:\ and created bin directory and placed the winutils.exe file in it. The file location is as follows.

C:\winutils\bin\winutils.exe

6. Setup envionrment variable
Pressing WIN + R button which open’s up the run and enter sysdm.cpl
I then clicked on advanced tab and then on environment variables. Clicked new for the user variables and added the following

variable name as HADOOP_HOME  and it's value as C:\winutils
variable name as SPARK_HOME and it's value as C:\spark

7. Install Jupyter
https://jupyter.readthedocs.io/en/latest/install.html

8. run jupyter notebook

*************************

Continue with reading
http://spark.apache.org/docs/latest/programming-guide.html#overview

************************


Troubleshooting stesps

Wrong => trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
right Pytyon 3.0=> testErr = labelsAndPredictions.filter(lambda v: v[0] != v[1]).count() / float(testData.count())

model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},impurity='gini', maxDepth=5, maxBins=32)



Error
[1] "The data can be downloaded from: http://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv "

  java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org