Changes

Jump to: navigation, search

GPU621/Apache Spark

101 bytes added, 12:54, 30 November 2020
Apache Spark
== Architecture ==
One of the distinguishing features of Spark is that it processes data in RAM using a concept known as Resilient Distributed Datasets (RDDs) - an immutable distributed collection of objects which can contain any type of Python, Java, or Scala objects, including user-defined classes. Each dataset is divided into logical partitions which may be computed on different nodes of the cluster. Spark's RDDs function as a working set for distributed programs that offer a restricted form of distributed shared memory.
[[File: Cluster-overview.png|thumb|upright=1|right|alt=Spark cluster|4.1 Spark Cluster components]]
 
At a fundamental level, an Apache Spark application consists of two main components: a driver, which converts the user's code into multiple tasks that can be distributed across worker nodes, and executors, which run on those nodes and execute the tasks assigned to them.

Navigation menu