HBase Architecture

After working on HBase from past one and half year I decided to share my understanding. In this blog I will try to describe the high level functioning of HBase and the different components involved.

HBase – The Basics:
HBase is an open-source, NoSQL, distributed, column-oriented data store which has been implemented from Google BigTable that runs on top of HDFS. It was developed as part of Apache’s Hadoop project and runs on top of HDFS (Hadoop Distributed File System). HBase provides all the features of Google BigTable. We can call HBase a “Data Store” than a “Data Base” as it lacks many of the features available in traditional database, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
The Data model consists of Table name, row key, column family, columns, time stamp. While creating tables in HBase, the rows will be uniquely identified with the help of row keys and time stamp. In this data model the column family are static whereas columns are dynamic. Now let us look into the HBase Architecture.

HBase Architecture:
HBaseARchitecture
HBase Architecture Components:

  1. HMaster: The HBase HMaster is a lightweight process responsible for assigning regions to RegionServers in the Hadoop cluster to achieve load balancing.
  2. RegionServer: HBase RegionServers are the worker nodes that handle read, write, update, and delete requests from clients. The RegionServer process typically runs on each Hadoop node in the cluster.
  3. ZooKeeper: Zoo keeper helps in keeping a track of all region servers that are there for HBase. Zoo keeper keeps track of how many region servers are there, which region servers are holding from which data node. HMaster gets the details of region servers by contacting Zoo keeper.
  4. Memstore: Memstore is an in-memory storage, hence the Memstore utilizes the in-memory storage of each data node to store the logs. Rows are written to theMemStore. The data in the MemStore is ordered.When certain thresholds are met, Memstore data gets flushed into HFile. Every time Memstore flush happens one HFile created for each ColumnFamily
  5. HFile: HFiles are the actual storage files i.e. physical representation of data in HFile, specifically created to serve one purpose: store HBase’s data fast and efficiently. Clients do not read HFiles directly but go through region servers to get to the data.

HBase Table Operations :

  1. Reads: Client read requests are directed to the proper RegionServer by the ZooKeeper service. Clients can read all columns for a given row, or read an entire column or column family for a range of rows.
  2. Writes: An HBase write to a single row is atomic, meaning the whole operation either succeeds or fails, even if the write occurs across column families. A write operation to multiple rows however is not atomic, some row writes may succeed while others fail.
  3. Updates: Each Cell in Hbase is capable of storing multiple values, each with an associated version, or timestamp corresponding to the time the value was written. Users can specify time to live values for cells, instructing HBase to delete old cells at a given interval.
  4. Deletes: When an HBase client wishes to delete a row, it is not immediately removed from the table. Instead, HBase writes a tombstone marker to the blocks of data storing the row. The data is permanently removed from storage during the next major compaction.

HBase Table Maintenance:

  1. Minor Compactions: When data is written to HBase, it is first written to an in-memory structure called a memstore for performance. Intermittently, when the memstore reaches a certain size, the data is written to a store-file on disk and marked read-only. When the number of storefiles reaches a configured threshold, a minor compaction occurs to merge multiple storefiles.
  2. Major Compactions: Periodically, default every 24 hours, a major compaction runs to merge all storefiles together into a single storefile on each RegionServer. In addition, the RegionServer walks its tables to find any rows that were marked with a tombstone, meaning a delete was requested, and those rows are purged at this time.

Compactions, especially major compactions, can take a toll on utilization of a RegionServer. Client requests made during a compaction will experience latency and jitter as a result of resource contention.

Diagram of HBase Table and Region servers

HBase Table Achitecture

In HBase it works something like this:

  1. Edits (Puts, etc) are collected and sorted in memory (using a skip list specifically). HBase calls this the “memstore”
  2. When the memstore reached a certain size (hbase.hregion.memstore.flush.size) it is written (or flushed) to disk as a new “HFile”
  3. There is one memstore per region and column family
  4. Upon read, HBase performs a merge sort between all – partially sorted – memstore disk images (i.e. the HFiles)

HBase stores rows of data in tables. Tables are split into chunks of rows called “regions”. Those regions are distributed across the cluster, hosted and made available to client processes by the RegionServer process. A region is a continuous range within the key space, meaning all rows in the table that sort between the region’s start key and end key are stored in the same region. Regions are non-overlapping, i.e. a single row key belongs to exactly one region at any point in time. A region is only served by a single region server at any point in time, which is how HBase guarantees strong consistency within a single row#.

Advertisements

10 comments

  1. Good one, Sreejith.

    Could you please do a post on connecting HBase (say within HDP) from Eclipse IDE? The ports and configuration stuff are actually confusing and I’m stuck there. It would be really helpful.

    Thanks!

    Like

    1. Hi Ranit,
      Please visit my blog HBase Bulk Loading where you can find configurations of HBase and ZooKeeper.

      Hope this helps !

      Like

      1. Thanks, just about to go through it.

        Keep blogging buddy. 🙂

        Like

  2. […] HBase Reference Guide 2) HBase Architecture – Lars George 3) HBase Architecture – Sreejith 4) HBase Architecture – Altamira 5) RegionServer and DataNodes in […]

    Liked by 1 person

  3. Sreejith, I found this post really good , which gives a good level details of HBase and its architecture in highly concise and accurate format. Keep Posting good Stuff !!

    Like

  4. hi boss,
    great explanation……..im having a small doubt plz clarify it…….as im beginer in learning hbase
    how many regions can a region server have………..and how many region servers can a data node have……………plz tell me

    thank you

    Like

  5. and also when im creating table and inserting data into table(VMWARE with ubuntu)…….once i have restarted means im unable to see that data again…………what is the solution for this

    thank you

    Like

  6. GANGADHAR · · Reply

    Hbase is manditory for semi strut

    Like

  7. I am really happy to read this website posts which consists of plenty of useful information, thanks for providing these kinds of information.|

    Like

  8. Nice Explanation .. short n Good.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: