Changes

GPU621/Apache Spark

3 bytes removed, 13:32, 30 November 2020

m

→‎Methodology

[[File:Google-cloud-dataproc.png]]

1. # We will use the Google Cloud Platform '''Dataproc''' to deploy a 6 virtual machine (VM) nodes (1 master, 5 workers) cluster that is automatically configured for both Hadoop and Spark.2. # Use '''Google Cloud Storage Connector''' which is compatible with Apache HDFS file system, instead of storing data on local disks of VMs. 3. # Run a '''Dataproc''' Hadoop MapReduce and Spark jobs to count number of words in large text files and compare the performance between Hadoop and Spark in execution time.

=== Setup ===

DanielPark

76

edits

Changes

GPU621/Apache Spark

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools