Saturday 11 August 2012

Performance Tip - Select only required columns from HBase

HBase is column-oriented database and allows you to select only the required columns or columns family using Scan objects. For a query on some specific columns, choose only those columns in the scan object to get better performance. Performance is better as the less data needs to be read from the disks and could be read faster.

To set the specific column/column family use the following methods of Scan class.

Scan scan = new Scan();
//Set the specific column name of a column family
//scan.addColumn(Bytes.toBytes(<family>), Bytes.toBytes(<qualifier>));
scan.addColumn(Bytes.toBytes("data"), Bytes.toBytes("firstName"));

//Or set the complete column family
//scan.addFamily(Bytes.toBytes(<family>));
//scan.addFamily(Bytes.toBytes("data"));

Also select only the required rows by using the setStartRow and setStopRow methods of the scan object.

By selecting only the required columns; processing of 1 Billion records of my previous blog could be much faster.

No comments:

Post a Comment