Difference between revisions of "GPU621/ApacheSpark"

From CDOT Wiki
Jump to: navigation, search
(How it works)
(History of Apache Spark)
Line 12: Line 12:
 
=== History of Apache Spark ===
 
=== History of Apache Spark ===
  
2009: a distributed system framework initiated at UC Berkeley AMPLab by MateiZaharia  
+
2009: a distributed system framework initiated at UC Berkeley AMPLab by MateiZaharia <br />
2010: Open sourced under a BSD license
+
2010: Open sourced under a BSD license <br />
2013: The project was donated to the Apache Software Foundation and the license was changed to Apache 2.0
+
2013: The project was donated to the Apache Software Foundation and the license was changed to Apache 2.0 <br />
2014: Became an Apache Top-Level Project. Used by Databricks to set a world record in large-scale sorting in November.
+
2014: Became an Apache Top-Level Project. Used by Databricks to set a world record in large-scale sorting in November <br />
2014-present: Exists as a next generation real-time and batch processing framework.
+
2014-present: Exists as a next generation real-time and batch processing framework <br />
  
 
=== Why Apache Spark ===
 
=== Why Apache Spark ===

Revision as of 08:37, 26 November 2018

Team Members

  1. Shreena Athia
  2. Wang Pan

Introduction

What is Apache Spark ?

An open-source distributed general-purpose cluster-computing framework for Big Data.

History of Apache Spark

2009: a distributed system framework initiated at UC Berkeley AMPLab by MateiZaharia
2010: Open sourced under a BSD license
2013: The project was donated to the Apache Software Foundation and the license was changed to Apache 2.0
2014: Became an Apache Top-Level Project. Used by Databricks to set a world record in large-scale sorting in November
2014-present: Exists as a next generation real-time and batch processing framework

Why Apache Spark

Data is exploded in volume, velocity and variety 
 The need to have faster analytic results becomes increasingly important 
Support near real time analytics to answer business questions

Features

Easy to use Supporting python. Java and Scala Libraries for sql, ml, streaming General-purpose Batch like MapReduce is included Iterative algorithm Interactive queries and streaming which return results immediately Speed In memory computations Faster than MapReduce for complex application on disks

Examples