Changes

GPU621/Apache Spark

964 bytes added, 13:23, 30 November 2020

m

→‎Methodology

== Analysis: Spark vs Hadoop ==

=== Methodology ===

Hadoop and Spark clusters can be deployed in cloud environments such as the Google Cloud Platform or Amazon EMR.

The clusters are managed, scalable, and pay-per-usage and comparatively easier to setup and manage versus setting up a cluster locally on commodity hardware.

We will use the Google Cloud Platform managed service to run experiments and observe possible expected performance differences between Hadoop and Spark.

[[File:Google-cloud-dataproc.png]]

1. We will use the Google Cloud Platform '''Dataproc''' to deploy a 6 virtual machine (VM) nodes (1 master, 5 workers) cluster that is automatically configured for both Hadoop and Spark.

2. Use '''Google Cloud Storage Connector''' which is compatible with Apache HDFS file system, instead of storing data on local disks of VMs.

3. Run a '''Dataproc''' Hadoop MapReduce and Spark jobs to count number of words in large text files and compare the performance between Hadoop and Spark in execution time.

=== Setup ===

Using a registered Google account navigate to the Google Cloud Console https://console.cloud.google.com/ and activate the free-trial credits.

DanielPark

76

edits

Changes

GPU621/Apache Spark

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools