Changes

GPU621/Apache Spark

148 bytes added, 15:13, 30 November 2020

m

→‎Spark vs Hadoop Wordcount Performance

[[File:Googlecloud-setup-9.jpg]]

'''To view the individual nodes in the cluster go to '''Menu -> Virtual Machines -> VM Instances'''

wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda count1, count2: count1 + count2)

wordCounts.saveAsTextFile(sys.argv[2])

'''Finally, add the input files containing the text the word count jobs will be processing'''

Now that we have our project code, input files and Dataproc cluster setup we can proceed to run the Hadoop MapReduce and Spark wordcount jobs.

'''Run the Hadoop MapReduce Job'''

3 arguments:

wordcount gs://<myBucketName>/inputFolder gs://<myBucketName>output

'''note: Running the job will create the output folder, However for subsequent jobs be sure to delete the output folder else Hadoop or Spark will not run. This limitation is done to prevent existing output from being overwritten'''

When the jobs have completed and all the input files have been processed, Hadoop provides '''counters''', statistics on the executed Job

Also you can navigate back the the '''Jobs''' tab to see the total Elapsed Time of the job.

'''Some counters of note:'''

[[File:Dataproc-hadoop-2.jpeg]]

''' To output the files to a .txt file'''

# Open the SSH for the Master VM node: '''Menu -> Compute -> Compute Engine -> VM Instances -> SSH (of 'm' master node) -> Open in Browser Window'''

# Run following command in the shell to aggregate the results into 'output.txt' file

gsutil cat gs://rinsereduce/output/* > ~~gs://rinsereduce/output/~~output.txtYou can then download the output file from the VM local storage to your local machinePress the dropdown from the Gear icon in the SSH and select '''Download File'''

DanielPark

76

edits

Changes

GPU621/Apache Spark

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools