Sunday 14 May 2023

Spark 3.3 Notes

1. Spark Driver is initialized first. The no. of executors and the Memory and CPU cores to be assigned to each Executor is specified via configuration which is supplied to the driver.

2. The Spark Driver (session) requests Cluster Manager to provision a container for each Executor with required amount of CPU cores and Memory. Once the containers are started, the Spark Driver starts the Executor process within each container. The Executor is nothing but a JVM process just like the Spark driver and the process can use the CPU cores and Memory allocated to it.

3. The Spark driver then works with the Storage Manager to gather insights on no. of data partitions to create and the executor to which the partition needs to be assigned. E.g. if the data files are distributed over 10 storage nodes and there are 5 Executors with 5 CPU Cores each, each Executor can load 5 partitions of data and work on them in parallel. This means the data stored across 10 storage nodes can be split up into 25 smaller partitions, 5 of which can be assigned to one Executor. The actual assignment of partition to Executor will also take into account the proximity of the Executor container to the storage node as greater proximity implies lower network latency.

4. Each Action in Spark induces the creation of a new Spark Job. Each job is a series of one or more Stages. A new Stage is created each time Spark encounters a wide transformation. Each Stage has one or more Tasks that perform narrow transformations on a data partition and hence all Tasks within a Stage can be executed in parallel as they operate on their own data partitions. Stages themselves are serial in nature, i.e. Stage (i+1) needs to wait for completion of State (i).