Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing
Research project at UC Berkeley in 2009
APIs: Scala, Java, Python, R, and SQL
Built by more than 1,400 developers from more than 200 companies
One Driver and many Executor JVMs
RDD
DataFrame
Dataset
Resilient: Fault-tolerant
Distributed: Computed across multiple nodes
Dataset: Collection of partitioned data
Immutable once constructed
Track lineage information
Operations on collection of elements in parallel
Transformations | Actions | ||
---|---|---|---|
Filter | Count | ||
Sample | Take | ||
Union | Collect |
Data with columns (built on RDDs)
Improved performance via optimizations
User-friendly API
dataRDD = sc.parallelize([("Jim", 20), ("Anne", 31), ("Jim", 30)])
# RDD
(dataRDD.map(lambda (x,y): (x, (y,1)))
.reduceByKey(lambda x,y: (x[0] +y[0], x[1] +y[1]))
.map(lambda (x, (y, z)): (x, y / z)))
dataDF = dataRDD.toDF(["name", "age"])
dataDF.groupBy("name").agg(avg("age"))
User-friendly API
Benefits:
Wrapper to create logical plan
Scale out: Model or data too large to process on a single machine
Speed up: Benefit from faster results