Changes

GPU621/Apache Spark

788 bytes added, 13:49, 30 November 2020

m

→‎Setting up Dataproc and Google Cloud Storage

Copy the Hadoop wordcount example available on every Dataproc cluster, from Master node VM to our Cloud Storage bucket

# Open Secure Shell (SSH) from VM Instances list: Menu -> Compute -> Compute Engine.

# To copy from the VM local disk to Cloud Storage bucket enter the following command in the shell:<Code> gsutil cp /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar gs://~~rinsereduce~~<myBucketName>/ </Code> Save the Spark wordcount example into the Cloud Storage bucket by dragging and dropping it into the storage browswer# To open Browser: Menu -> Storage -> Browser# Drag and drop the below word-count.py into the browser, or use 'UPLOAD FILES' to upload.<Code> #!/usr/bin/env python import pysparkimport sys if len(sys.argv) != 3: raise Exception("Exactly 2 arguments are required: <inputUri> <outputUri>") inputUri=sys.argv[1]outputUri=sys.argv[2] sc = pyspark.SparkContext()lines = sc.textFile(sys.argv[1])words = lines.flatMap(lambda line: line.split())wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda count1, count2: count1 + count2)wordCounts.saveAsTextFile(sys.argv[2])</Code>

=== Results ===

DanielPark

76

edits

Changes

GPU621/Apache Spark

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools