BigData Ecosystem Architecture

Internal working of Bigdata and it's ecosystems such as

The background process of resource allocation, database connection.
How the data is distributed across the nodes.
Execution life-cycle on submitting a Job.

** Note: Refer the links metioned below under each ecosystem for detailed explanation **

1. HDFS 🐘

The various underlying process that takes place during the storage of a file into HDFS such as:

Type of scheduler
Block & Rack information
File size
File location
Replication information about the file(Over-replicated blocks, Under-replicated blocks, ...)
Health status of the file

Please click on the link below to know the execution and flow process

🔗 HDFS Architecture in Depth

2. SQOOP

Used to perform 2 main operations.

Sqoop Import:
- To ingest data from any source such as traditional databases into hadoop file system HDFS
Sqoop Export:
- To export data from hadoop file system HDFS to any traditional databases

To support the above two operations internally a CodeGen is used.

Sqoop CodeGen:
- To compile metadata and other relative information into java class file & create a Jar

Please click on the link below to know the execution and flow process

🔗 SQOOP Architecture in Depth

3. HIVE 🐝

It has mainly 4 components

Hadoop core components(Hdfs, MapReduce)
Metastore
Driver
Hive Clients

Please click on the link below to know the execution and flow process

🔗 HIVE Architecture in Depth

4. SPARK 💥

The various phases involved before and during the execution of a spark job.

Spark Context
- It is the heart of spark application.
Yarn Resource Manager, Application Master & launching of executors (containers).
Setting up environment variables, job resources.
CoarseGrainedExecutorBackend & Netty-based RPC.
SparkListeners.
- LiveListenerBus
- StatsReportListener
- EventLoggingListener
Execution of a job
- Logical Plan (Lineage)
- Physical Plan (DAG)
Spark-WebUI.

Please click on the link below to know the execution and flow process

🔗 SPARK Architecture in Depth

4.1 SPARK Abstraction Layers & Internal Optimization Techniques used 💥

It has 3 different variants as part of it.

RDD (Resilient Distributed Datasets)
- Lineage Graph
- DAG Scheduler
DataFrames
- Catalyst Optimizer
- Tungsten Engine
- Default source or Base relation
Datasets
- Optimized Tungsten Engine - V2
- Whole Stage Code Generation

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
HDFS-Architecture		HDFS-Architecture
Hive-Architecture		Hive-Architecture
Spark-Architecture		Spark-Architecture
Sqoop-Architecture		Sqoop-Architecture
images		images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BigData Ecosystem Architecture

Internal working of Bigdata and it's ecosystems such as

1. HDFS 🐘

2. SQOOP

3. HIVE 🐝

4. SPARK 💥

4.1 SPARK Abstraction Layers & Internal Optimization Techniques used 💥

5. HBASE 🐋

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BigData Ecosystem Architecture

Internal working of Bigdata and it's ecosystems such as

1. HDFS 🐘

2. SQOOP

3. HIVE 🐝

4. SPARK 💥

4.1 SPARK Abstraction Layers & Internal Optimization Techniques used 💥

5. HBASE 🐋

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages