HBase tutorial

Since 1970, relational databases have been used to solve problems related to data storage and maintenance. After the emergence of big data, many companies realized and benefited from processing big data, and began to choose solutions like Hadoop.

Hadoop uses a distributed file system to store big data and uses MapReduce to process it. Hadoop is good at storing huge data in various formats, arbitrary formats and even unstructured processing.

Limits of Hadoop

Hadoop can only perform batch processing and only access data in a sequential manner. This means that the entire data set must be searched, even for the simplest search task.

When the processing result is in another huge data set, one huge data set is also processed in sequence. At this point, a new solution requires access to any point in the data (random access) unit.

Hadoop Random Access Database

Applications such as HBase, Cassandra, couchDB, Dynamo and MongoDB are databases that store large amounts of data and access data in a random manner.

What is HBase?

HBase is a distributed column-oriented database built on the Hadoop file system. It is an open source project and is scaled horizontally.

HBase is a data model, similar to Google’s large table design, which can provide fast and random access to massive structured data. It takes advantage of the fault tolerance provided by Hadoop’s file system (HDFS).

It is the Hadoop ecosystem that provides random real-time read/write access to data and is part of the Hadoop file system.

People can store HDFS data directly or through HBase. Use HBase to read consumption/random access data in HDFS. HBase is on top of Hadoop’s file system and provides read and write access.

share picture

HBase and HDFS

HDFS HBase
HDFS is a distributed file system suitable for storing large-capacity files. HBase is a database built on HDFS.
HDFS does not support fast single record search. HBase provides fast search in larger tables
It provides high-latency batch processing; there is no concept of batch processing. It provides billions of records with low latency access to a single row of records (random access).
The data it provides can only be accessed sequentially. HBase internally uses a hash table and provides random access, and its storage index can quickly search the data in the HDFS file.

HBase storage mechanism

HBase is a column-oriented database, which is sorted by rows in the table. The table schema definition can only list families, that is, key-value pairs. A table has multiple column families and each column family can have any number of columns. The values ​​of subsequent columns are stored continuously on the disk. Each cell value in the table has a timestamp. In a word, in an HBase:

  • A table is a collection of rows.
  • A row is a collection of column families.
  • Column family is a collection of columns.
  • Columns are collections of key-value pairs.

The table given below is an example of HBase mode.

< th>col1

< tr>

td>

Rowide Column Family Column Family Column Family Column Family
col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
1
2
3

Column-oriented and row-oriented

h2>

The column-oriented database is to store the data table as the part of the data column, rather than as the row data. In short, they have column families.

row database column database
It is suitable for online transaction processing (OLTP). It is suitable for online analytical processing (OLAP).
Such a database is designed with a small number of rows and columns. A huge table designed for column-oriented database.

The following figure shows the column family in a column-oriented database:

share picture

HBase and RDBMS

HBase RDBMS
HBase is modeless, it does not have the concept of fixed column mode; only columns are defined Family. RDBMS has its model, which describes the constraints of the overall structure of the table.
It was specially created as a wide table. HBase is scaled horizontally. These are detailed and designed for small tables. It is difficult to scale.
No transaction exists in HBase. RDBMS is transactional.
It denormalizes the data. It has standardized data.
It is very good for semi-structured and structured data. It is very good for structured data.

Characteristics of HBase

  • HBase is linearly scalable.
  • It has automatic failure support.
  • It provides consistent reading and writing.
  • It integrates Hadoop as the source and destination.
  • Convenient Java API on the client side.
  • It provides cross-cluster data replication.

Where can I use HBase?

  • Apache HBase used to be random, real-time read/write access to big data.
  • It is a very large table hosted on top of the common hardware of the cluster.
  • Apache HBase is the previous Google Bigtable simulation non-relational database. Bigtable operates on Google’s file system, similar to Apache HBase, which works on top of Hadoop HDFS.

HBase application

  • It is used when there is a need to write heavy applications.
  • HBase is used when we need to provide fast random access to data.
  • Many companies, such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

HBase history

year event
Nov 2006 Google announced the BigTable file.
Feb 2007 The original HBase prototype creation was contributed by Hadoop.
Oct 2007 With Hadoop 0.15.0, the first available HBase is also released.
Jan 2008 HBase has become a sub-project of Hadoop.
Oct 2008 HBase 0.18.1 is released.
Jan 2009 HBase 0.19 is released.
Sept 2009 HBase 0.20.0 released.
May 2010 HBase has become the top project of Apache.

The original text is from [e-Buy Tutorial]. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please keep the original link: https://www.yiibai .com/hbase/

Leave a Comment

Your email address will not be published.