Friday 10 August 2012

Column-oriented storage in Hadoop using RCFile

Hadoop stores the large amount of data on the distributed cluster using HDFS. Normally its stored in the CSV file format with some delimiter. For example the logs of the web server or the csv exports from the RDBMS systems. To process the large files with huge data the MapReduce paradigm is used and it can really scale with massive parallel-processing of the cluster.

During the processing all the record including all the columns are read and processed, BUT what if I need to process only few columns out of many? For example if I would like to get the sum of a specific column, why all the columns should be read and impact the Disk IO?

There are column oriented databases like Cassandra, HBase, Hypertable and Vertica. For details of Column-oriented storage see the wiki page.

There are many advantages of using the column-oriented database; but here we see the storing of the data in Hadoop file system in column-oriented format using RCFile.

RCFile (Record Columnar File) format can partition the data horizontally(rows) and vertically(columns) and allows to fetch only the specific columns during the processing and avoid the Disk IO penalty with all the columns.

The simplest way to create the RCFile format is using the Hive as follows:

CREATE TABLE USER_RC(
  id int,
  name string,
  age int,
  manager string,
  salary int
)
STORED AS RCFILE;


To store the data in this table in RCFile format, follow the following steps:

CREATE TABLE USER_TEXT(
  id int,
  name string,
  age int,
  manager string,
  salary int
)
ROW FORMAT DELIMITED fields terminated by ',';

load data local inpath '/tmp/user.txt' into table USER_TEXT;

INSERT OVERWRITE table USER_RC SELECT * from USER_TEXT;

Now run the hive query to sum the salary using the query from both the table:

select sum(salary) from USER_TEXT;
select sum(salary) from USER_RC:

It starts the map reduce job and watch the HDFS_BYTES_READ parameter to see the difference of the bytes read from the HDFS. You can see the huge difference of the data read; as the RCFile is reading only the salary column and the text format is reading the complete data to execute the query.

For example the following file in text format (/tmp/user.txt):

1,Charry,30,Joe,878
2,Roy,31,Joe,879
3,Robert,32,Joe,880
4,Barry,33,Joe,881

would be stored in RCFile fomat as follows and reads only the last row and skips all other data.
1,2,3,4
Charry,Roy,Robert,Barry
30,31,32,33
Joe,Joe,Joe,Joe
878,879,880,881

To confirm this format the following shows the browse of the file /user/hive/warehouse/user_rc/00000_0 in HDFS:


4 comments:

  1. You start seeing the difference when you have millions or billions of rows.

    Row Store
    =========
    HDFS_BYTES_READ 951,418,972

    RCFile
    ======
    HDFS_BYTES_READ 387,101,184

    3 Times less data is read form HDFS
    Table size : 48 million rows, each with 5 columns.

    ReplyDelete
  2. Is it possible to separate out RCFile storage format from Hive and use it with MR Job by specifying it as OutputFormat while writing and InputFormat while reading ? If yes can you share library and sample code to do so ?

    ReplyDelete
  3. Unable to run Hive quires .......
    When i run select sum(salary) from USER_TEXT;
    it gives java.io.FileNotFoundException: File does not exist: ../hive-0.9.0-cdh4.1.2/lib/hive-builtins-0.9.0-cdh4.1.2.jar
    where jar is present at location ......
    am using cloudera 4.1.2
    can u tell me solution ....... or am i missing some HIVE-HADOOP configuration ?

    ReplyDelete
  4. I get a lot of great information from this blog. Thank you for your sharing this informative blog. Just now I have completed hadoop certification course at a leading academy. If you are interested to learn Hadoop Training in Chennai visit FITA IT training and placement academy.

    ReplyDelete