Changes

GPU621/Apache Spark

996 bytes added, 14:29, 30 November 2020

m

→‎Running the Jobs in Dataproc

# Select your cluster

# Specify Hadoop as Job Type

# Specify JAR which contains the Hadoop MapReduce algorithm, ~~and~~ give 3 arguments to wordcount, and submit job.

gs://<myBucketName>/hadoop-mapreduce-examples.jar

'''note: Running the job will create the output folder, However for subsequent jobs be sure to delete the output folder else Hadoop or Spark will not run. This limitation is done to prevent existing output from being overwritten'''

'''Retrieve the Results'''

You can observe the progress of each of the map and reduce jobs in the '''Job output''' console.

When the jobs has completed and all the input files have been processed, Hadoop provides '''counters''', statistics on the executed Job

Some counters of note:

# number of splits: a split is the amount of data in one map task (Hadoop default block size is 128 MB)

# Launched map tasks: number of total map tasks. Note that it matches the number of splits

# Launched reduce tasks: number of total reduce tasks

# GS: Number of bytes read: the total number of bytes read from Google Cloud Storage by both map and reduce tasks

# GS: Number of bytes written: the total number of bytes writtenfrom Google Cloud Storage by both map and reduce tasks

# Map input records: number of records (words) processed by all map tasks

# Reduce output records: number of records (word) output by all reduce tasks.

# CPUT time spent (ms): total CPU processing time by all tasks

[[File:Dataproc-hadoop.jpg]]

DanielPark

76

edits

Changes

GPU621/Apache Spark

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools