Skip to content

Jayvardhan-Reddy/BigData-Ecosystem-Architecture

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BigData Ecosystem Architecture

Internal working of Bigdata and it's ecosystems such as

  • The background process of resource allocation, database connection.
  • How the data is distributed across the nodes.
  • Execution life-cycle on submitting a Job.

** Note: Refer the links metioned below under each ecosystem for detailed explanation **

1. HDFS 🐘

The various underlying process that takes place during the storage of a file into HDFS such as:

  • Type of scheduler

  • Block & Rack information

  • File size

  • File location

  • Replication information about the file(Over-replicated blocks, Under-replicated blocks, ...)

  • Health status of the file

Please click on the link below to know the execution and flow process

🔗 HDFS Architecture in Depth

2. SQOOP :octocat:

Used to perform 2 main operations.

  • Sqoop Import:

    • To ingest data from any source such as traditional databases into hadoop file system HDFS
  • Sqoop Export:

    • To export data from hadoop file system HDFS to any traditional databases

To support the above two operations internally a CodeGen is used.

  • Sqoop CodeGen:

    • To compile metadata and other relative information into java class file & create a Jar

Please click on the link below to know the execution and flow process

🔗 SQOOP Architecture in Depth

3. HIVE 🐝

It has mainly 4 components

  • Hadoop core components(Hdfs, MapReduce)

  • Metastore

  • Driver

  • Hive Clients

Please click on the link below to know the execution and flow process

🔗 HIVE Architecture in Depth

4. SPARK 💥

The various phases involved before and during the execution of a spark job.

  • Spark Context

    • It is the heart of spark application.
  • Yarn Resource Manager, Application Master & launching of executors (containers).

  • Setting up environment variables, job resources.

  • CoarseGrainedExecutorBackend & Netty-based RPC.

  • SparkListeners.

    • LiveListenerBus
    • StatsReportListener
    • EventLoggingListener
  • Execution of a job

    • Logical Plan (Lineage)
    • Physical Plan (DAG)
  • Spark-WebUI.

Please click on the link below to know the execution and flow process

🔗 SPARK Architecture in Depth

4.1 SPARK Abstraction Layers & Internal Optimization Techniques used 💥

It has 3 different variants as part of it.

  • RDD (Resilient Distributed Datasets)

    • Lineage Graph
    • DAG Scheduler
  • DataFrames

    • Catalyst Optimizer
    • Tungsten Engine
    • Default source or Base relation
  • Datasets

    • Optimized Tungsten Engine - V2
    • Whole Stage Code Generation

5. HBASE 🐋