Difference between revisions of "GPU621/Apache Spark Fall 2022"

From CDOT Wiki
Jump to: navigation, search
(RDD Overview)
(Create And Set Up Spark)
Line 31: Line 31:
  
 
===Create And Set Up Spark===
 
===Create And Set Up Spark===
Spark needs to be set up in a cluster. But you can also run it locally and act as a cluster. We will talk about how to set up a spark in a cluster later. Now let's try to create a spark locally. To do that, we will need the following code:
+
Spark needs to be set up in a cluster so first we need to create a JavaSparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application. We will talk about how to set up a spark in a cluster later. Now let's try to create a spark locally. To do that, we will need the following code:  
  
 
   //create and set up spark
 
   //create and set up spark

Revision as of 16:32, 30 November 2022

Apache Spark

Apache Spark Core API

RDD Overview

One of the most important concepts in Spark is a resilient distributed dataset (RDD). RDD is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. RDDs are created by starting with a file, or an existing Java collection in the driver program, and transforming it. We will introduce some key APIs provided by Spark Core 2.2.1 using Java 8. You can find more information about the RDD here. https://spark.apache.org/docs/2.2.1/rdd-programming-guide.html

Spark Library Installation Using Maven

An Apache Spark application can be easily instantiated using Maven. To add the required libraries, you can copy and paste the following code into the "pom.xml".

   <properties>
       <maven.compiler.source>8</maven.compiler.source>
       <maven.compiler.target>8</maven.compiler.target>
   </properties>
   <dependencies>
       <dependency>
           <groupId>org.apache.spark</groupId>
           <artifactId>spark-core_2.10</artifactId>
           <version>2.2.0</version>
       </dependency>
       <dependency>
           <groupId>org.apache.spark</groupId>
           <artifactId>spark-sql_2.10</artifactId>
           <version>2.2.0</version>
       </dependency>
       <dependency>
           <groupId>org.apache.hadoop</groupId>
           <artifactId>hadoop-hdfs</artifactId>
           <version>2.2.0</version>
       </dependency>
   </dependencies>

Create And Set Up Spark

Spark needs to be set up in a cluster so first we need to create a JavaSparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application. We will talk about how to set up a spark in a cluster later. Now let's try to create a spark locally. To do that, we will need the following code:

  //create and set up spark
  SparkConf conf = new SparkConf().setAppName("HelloSpark").setMaster("local[*]");
  JavaSparkContext sc = new JavaSparkContext(conf);
  sc.setLogLevel("WARN");

Deploy Apache Spark Application On AWS