Pages

Thursday 9 August 2012

How to verify the record size of HBase?

From my previous blog "How to calculate the record size of HBase?" its easy to calculate the record size of HBase and estimate the storage requirements. But how to verify it during the testing phase and in production environment?

HFile tool of HBase could be used to find the average key size, average value size and number of records per store file in HDFS. It can also be used to see the actual records in the store file.

To use if first browse the HDFS in path "hadoop fs -lsr /hbase/<table name>" and find the store file of the table as follows for 'employee' table:

 Prafulls-MacBook-Pro:~ prafkum$ hadoop fs -lsr /hbase/employee  
 -rw-r--r--  1 prafkum supergroup    521 2012-08-10 06:22 /hbase/employee/.tableinfo.0000000001  
 drwxr-xr-x  - prafkum supergroup     0 2012-08-10 06:22 /hbase/employee/.tmp  
 drwxr-xr-x  - prafkum supergroup     0 2012-08-10 06:24 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7  
 drwxr-xr-x  - prafkum supergroup     0 2012-08-10 06:22 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/.oldlogs  
 -rw-r--r--  1 prafkum supergroup    124 2012-08-10 06:22 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/.oldlogs/hlog.1344559953739  
 -rw-r--r--  1 prafkum supergroup    231 2012-08-10 06:22 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/.regioninfo  
 drwxr-xr-x  - prafkum supergroup     0 2012-08-10 06:24 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/.tmp  
 drwxr-xr-x  - prafkum supergroup     0 2012-08-10 06:24 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data  
 -rw-r--r--  1 prafkum supergroup    722 2012-08-10 06:24 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data/770ababfda4643f59956b90da2dc3d3f  

Look for the files in "data" directory and choose any one e.g. /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data/770ababfda4643f59956b90da2dc3d3f in the above output.

Now use the HFile tool as follows on store file. Please use only the -m option to print the meta data. Use another options like -p to print the actual content of file and -s for statistics. (Don't use them in production as the data might be huge in store file)
 Prafulls-MacBook-Pro:bin prafkum$ hbase org.apache.hadoop.hbase.io.hfile.HFile -v -p -s -m -f /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data/770ababfda4643f59956b90da2dc3d3f  
 Scanning -> /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data/770ababfda4643f59956b90da2dc3d3f  
 12/08/10 06:32:42 INFO hfile.CacheConfig: Allocating LruBlockCache with maximum size 246.9m  
 K: row1/data:employeeId/1344560028049/Put/vlen=3 V: 123  
 K: row1/data:firstName/1344560042111/Put/vlen=3 V: Joe  
 K: row1/data:lastName/1344560058448/Put/vlen=6 V: Robert  
 Block index size as per heapsize: 416  
 reader=/hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data/770ababfda4643f59956b90da2dc3d3f,  
   compression=none,  
   cacheConf=CacheConfig:enabled [cacheDataOnRead=true] [cacheDataOnWrite=false] [cacheIndexesOnWrite=false] [cacheBloomsOnWrite=false] [cacheEvictOnClose=false] [cacheCompressed=false],  
   firstKey=row1/data:employeeId/1344560028049/Put,  
   lastKey=row1/data:lastName/1344560058448/Put,  
   avgKeyLen=29,  
   avgValueLen=4,  
   entries=3,  
   length=722  
 Trailer:  
   fileinfoOffset=241,  
   loadOnOpenDataOffset=150,  
   dataIndexCount=1,  
   metaIndexCount=0,  
   totalUncomressedBytes=655,  
   entryCount=3,  
   compressionCodec=NONE,  
   uncompressedDataIndexSize=43,  
   numDataIndexLevels=1,  
   firstDataBlockOffset=0,  
   lastDataBlockOffset=0,  
   comparatorClassName=org.apache.hadoop.hbase.KeyValue$KeyComparator,  
   version=2  
 Fileinfo:  
   KEY_VALUE_VERSION = \x00\x00\x00\x01  
   MAJOR_COMPACTION_KEY = \x00  
   MAX_MEMSTORE_TS_KEY = \x00\x00\x00\x00\x00\x00\x00\x00  
   MAX_SEQ_ID_KEY = 10  
   TIMERANGE = 1344560028049....1344560058448  
   hfile.AVG_KEY_LEN = 29  
   hfile.AVG_VALUE_LEN = 4  
   hfile.LASTKEY = \x00\x04row1\x04datalastName\x00\x00\x019\x0E\x06PP\x04  
 Mid-key: \x00\x04row1\x04dataemployeeId\x00\x00\x019\x0E\x05\xD9\x91\x04  
 Bloom filter:  
   Not present  
 Stats:  
 Key length: count: 3     min: 28     max: 30     mean: 29.0  
 Val length: count: 3     min: 3     max: 6     mean: 4.0  
 Row size (bytes): count: 1     min: 123     max: 123     mean: 123.0  
 Row size (columns): count: 1     min: 3     max: 3     mean: 3.0  
 Key of biggest row: row1  
 Scanned kv count -> 3  

The above output could be used to validate the record size using formula
(8+avgKeyLen+avgValueLen)*columns per record

So for the above output
(8+29+4)*3=123 

and its equal to our calculation in the previous blog...

Please note that the result is approximate as its calculated based on the average values. Also the first parameter 8 is actually the "Key Length" and "Value Length" of 4 Byte each as described in the last blog.

3 comments:

  1. Really nice blog post.provided a helpful information.I hope that you will post more updates like thisHadoop Admin Online Course


    ReplyDelete
  2. Thanks for the useful content. Very helpful.

    ReplyDelete