ORC

ORC(Optimized Row Columnar) is a columnar storage format from Hive. ORC improves performance for reading, writing, and processing data. For more details, please refer to ORC Files at Hive wiki.

How to Create an ORC Table?

If you are not familiar with CREATE TABLE statement, please refer to Data Definition Language Data Definition Language.

In order to specify a certain file format for your table, you need to use the USING clause in your CREATE TABLE statement. Below is an example statement for creating a table using orc files.

CREATE TABLE table1 (
  id int,
  name text,
  score float,
  type text
) USING orc;

Physical Properties

Some table storage formats provide parameters for enabling or disabling features and adjusting physical parameters. The WITH clause in the CREATE TABLE statement allows users to set those parameters.

Now, ORC file provides the following physical properties.

  • orc.max.merge.distance: When ORC file is read, if stripes are too closer and the distance is lower than this value, they are merged and read at once. Default is 1MB.
  • orc.stripe.size: It decides size of each stripe. Default is 64MB.
  • orc.compression.kind: It means the compression algorithm used to compress and write data. It should be one of none, snappy, zlib. Default is none.
  • orc.buffer.size: It decides size of writing buffer. Default is 256KB.
  • orc.rowindex.stride: Define the default ORC index stride in number of rows. (Stride is the number of rows an index entry represents.) Default is 10000.

Compatibility Issues with Apache Hive™

At the moment, Tajo only supports flat relational tables. We are currently working on adding support for nested schemas and non-scalar types (TAJO-710).