HBase Integration

Apache Tajo™ storage supports integration with Apache HBase™. This integration allows Tajo to access all tables used in Apache HBase.

In order to use this feature, you need to build add some configs into conf/tajo-env.sh and then add some properties into a table create statement.

This section describes how to setup HBase integration.

First, you need to set your HBase home directory to the environment variable HBASE_HOME in conf/tajo-env.sh as follows:

export HBASE_HOME=/path/to/your/hbase/directory

If you set the directory, Tajo will add HBase library file to classpath.

Next, you must configure tablespace about HBase. Please see Tablespaces if you want to know more information about it.

CREATE TABLE

Synopsis

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] <table_name> [(<column_name> <data_type>, ... )]
USING hbase
WITH ('table'='<hbase_table_name>'
, 'columns'=':key,<column_family_name>:<qualifier_name>, ...'
, 'hbase.zookeeper.quorum'='<zookeeper_address>'
, 'hbase.zookeeper.property.clientPort'='<zookeeper_client_port>')
[LOCATION 'hbase:zk://<hostname>:<port>/'] ;

IF NOT EXISTS allows CREATE [EXTERNAL] TABLE statement to avoid an error which occurs when the table does not exist.

If you want to create EXTERNAL TABLE, You must write LOCATION statement.

Options

  • table : Set hbase origin table name. If you want to create an external table, the table must exists on HBase. The other way, if you want to create a managed table, the table must doesn’t exist on HBase.
  • columns : :key means HBase row key. The number of columns entry need to equals to the number of Tajo table column
  • hbase.zookeeper.quorum : Set zookeeper quorum address. You can use different zookeeper cluster on the same Tajo database. If you don’t set the zookeeper address, Tajo will refer the property of hbase-site.xml file.
  • hbase.zookeeper.property.clientPort : Set zookeeper client port. If you don’t set the port, Tajo will refer the property of hbase-site.xml file.

DROP TABLE

Synopsis

DROP TABLE [IF EXISTS] <table_name> [PURGE]

IF EXISTS allows DROP TABLE statement to avoid an error which occurs when the table does not exist. DROP TABLE statement removes a table from Tajo catalog, but it does not remove the contents on HBase cluster. If PURGE option is given, DROP TABLE statement will eliminate the entry in the catalog as well as the contents on HBase cluster.

INSERT (OVERWRITE) INTO

INSERT OVERWRITE statement overwrites a table data of an existing table. Tajo’s INSERT OVERWRITE statement follows INSERT INTO SELECT statement of SQL. The examples are as follows:

-- when a target table schema and output schema are equivalent to each other
INSERT OVERWRITE INTO t1 SELECT l_orderkey, l_partkey, l_quantity FROM lineitem;
-- or
INSERT OVERWRITE INTO t1 SELECT * FROM lineitem;

-- when the output schema are smaller than the target table schema
INSERT OVERWRITE INTO t1 SELECT l_orderkey FROM lineitem;

-- when you want to specify certain target columns
INSERT OVERWRITE INTO t1 (col1, col3) SELECT l_orderkey, l_quantity FROM lineitem;

Note

If you don’t set row key option, You are never able to use your table data. Because Tajo need to have some key columns for sorting before creating result data.

Usage

In order to create a new HBase table which is to be managed by Tajo, use the USING clause on CREATE TABLE:

CREATE EXTERNAL TABLE blog (rowkey text, author text, register_date text, title text)
USING hbase WITH (
  'table'='blog'
  , 'columns'=':key,info:author,info:date,content:title')
LOCATION 'hbase:zk://<hostname>:<port>/';

After executing the command above, you should be able to see the new table in the HBase shell:

$ hbase shell
create 'blog', {NAME=>'info'}, {NAME=>'content'}
put 'blog', 'hyunsik-02', 'content:title', 'Getting started with Tajo on your desktop'
put 'blog', 'hyunsik-02', 'info:author', 'Hyunsik Choi'
put 'blog', 'hyunsik-02', 'info:date', '2014-12-03'
put 'blog', 'blrunner-01', 'content:title', 'Apache Tajo: A Big Data Warehouse System on Hadoop'
put 'blog', 'blrunner-01', 'info:author', 'Jaehwa Jung'
put 'blog', 'blrunner-01', 'info:date', '2014-10-31'
put 'blog', 'jhkim-01', 'content:title', 'APACHE TAJO™ v0.9 HAS ARRIVED!'
put 'blog', 'jhkim-01', 'info:author', 'Jinho Kim'
put 'blog', 'jhkim-01', 'info:date', '2014-10-22'

And then create the table and query the table meta data with \d option:

default> \d blog;

table name: default.blog
table path:
store type: HBASE
number of rows: unknown
volume: 0 B
Options:
        'columns'=':key,info:author,info:date,content:title'
        'table'='blog'

schema:
rowkey  TEXT
author  TEXT
register_date   TEXT
title   TEXT

And then query the table as follows:

default> SELECT * FROM blog;
rowkey,  author,  register_date,  title
-------------------------------
blrunner-01,  Jaehwa Jung,  2014-10-31,  Apache Tajo: A Big Data Warehouse System on Hadoop
hyunsik-02,  Hyunsik Choi,  2014-12-03,  Getting started with Tajo on your desktop
jhkim-01,  Jinho Kim,  2014-10-22,  APACHE TAJO™ v0.9 HAS ARRIVED!

default> SELECT * FROM blog WHERE rowkey = 'blrunner-01';
Progress: 100%, response time: 2.043 sec
rowkey,  author,  register_date,  title
-------------------------------
blrunner-01,  Jaehwa Jung,  2014-10-31,  Apache Tajo: A Big Data Warehouse System on Hadoop

Here’s how to insert data the HBase table:

CREATE TABLE blog_backup(rowkey text, author text, register_date text, title text)
USING hbase WITH (
  'table'='blog_backup'
  , 'columns'=':key,info:author,info:date,content:title');
INSERT OVERWRITE INTO blog_backup SELECT * FROM blog;

Use HBase shell to verify that the data actually got loaded:

hbase(main):004:0> scan 'blog_backup'
 ROW          COLUMN+CELL
 blrunner-01  column=content:title, timestamp=1421227531054, value=Apache Tajo: A Big Data Warehouse System on Hadoop
 blrunner-01  column=info:author, timestamp=1421227531054, value=Jaehwa Jung
 blrunner-01  column=info:date, timestamp=1421227531054, value=2014-10-31
 hyunsik-02   column=content:title, timestamp=1421227531054, value=Getting started with Tajo on your desktop
 hyunsik-02   column=info:author, timestamp=1421227531054, value=Hyunsik Choi
 hyunsik-02   column=info:date, timestamp=1421227531054, value=2014-12-03
 jhkim-01     column=content:title, timestamp=1421227531054, value=APACHE TAJO\xE2\x84\xA2 v0.9 HAS ARRIVED!
 jhkim-01     column=info:author, timestamp=1421227531054, value=Jinho Kim
 jhkim-01     column=info:date, timestamp=1421227531054, value=2014-10-22
3 row(s) in 0.0470 seconds