From my previous blog "How to calculate the record size of HBase?" its easy to calculate the record size of HBase and estimate the storage requirements. But how to verify it during the testing phase and in production environment?
HFile tool of HBase could be used to find the average key size, average value size and number of records per store file in HDFS. It can also be used to see the actual records in the store file.
To use if first browse the HDFS in path "hadoop fs -lsr /hbase/<table name>" and find the store file of the table as follows for 'employee' table:
Look for the files in "data" directory and choose any one e.g. /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data/770ababfda4643f59956b90da2dc3d3f in the above output.
Now use the HFile tool as follows on store file. Please use only the -m option to print the meta data. Use another options like -p to print the actual content of file and -s for statistics. (Don't use them in production as the data might be huge in store file)
The above output could be used to validate the record size using formula
So for the above output
(8+29+4)*3=123
and its equal to our calculation in the previous blog...
Please note that the result is approximate as its calculated based on the average values. Also the first parameter 8 is actually the "Key Length" and "Value Length" of 4 Byte each as described in the last blog.
HFile tool of HBase could be used to find the average key size, average value size and number of records per store file in HDFS. It can also be used to see the actual records in the store file.
To use if first browse the HDFS in path "hadoop fs -lsr /hbase/<table name>" and find the store file of the table as follows for 'employee' table:
Prafulls-MacBook-Pro:~ prafkum$
hadoop fs -lsr /hbase/employee
-rw-r--r-- 1 prafkum supergroup 521 2012-08-10 06:22 /hbase/employee/.tableinfo.0000000001 drwxr-xr-x - prafkum supergroup 0 2012-08-10 06:22 /hbase/employee/.tmp drwxr-xr-x - prafkum supergroup 0 2012-08-10 06:24 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7 drwxr-xr-x - prafkum supergroup 0 2012-08-10 06:22 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/.oldlogs -rw-r--r-- 1 prafkum supergroup 124 2012-08-10 06:22 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/.oldlogs/hlog.1344559953739 -rw-r--r-- 1 prafkum supergroup 231 2012-08-10 06:22 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/.regioninfo drwxr-xr-x - prafkum supergroup 0 2012-08-10 06:24 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/.tmp drwxr-xr-x - prafkum supergroup 0 2012-08-10 06:24 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data -rw-r--r-- 1 prafkum supergroup 722 2012-08-10 06:24 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data/770ababfda4643f59956b90da2dc3d3f
Look for the files in "data" directory and choose any one e.g. /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data/770ababfda4643f59956b90da2dc3d3f in the above output.
Now use the HFile tool as follows on store file. Please use only the -m option to print the meta data. Use another options like -p to print the actual content of file and -s for statistics. (Don't use them in production as the data might be huge in store file)
Prafulls-MacBook-Pro:bin prafkum$ hbase org.apache.hadoop.hbase.io.hfile.HFile -v -p -s -m -f /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data/770ababfda4643f59956b90da2dc3d3f Scanning -> /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data/770ababfda4643f59956b90da2dc3d3f 12/08/10 06:32:42 INFO hfile.CacheConfig: Allocating LruBlockCache with maximum size 246.9m
K: row1/data:employeeId/1344560028049/Put/vlen=3 V: 123 K: row1/data:firstName/1344560042111/Put/vlen=3 V: Joe K: row1/data:lastName/1344560058448/Put/vlen=6 V: Robert
Block index size as per heapsize: 416 reader=/hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data/770ababfda4643f59956b90da2dc3d3f, compression=none, cacheConf=CacheConfig:enabled [cacheDataOnRead=true] [cacheDataOnWrite=false] [cacheIndexesOnWrite=false] [cacheBloomsOnWrite=false] [cacheEvictOnClose=false] [cacheCompressed=false], firstKey=row1/data:employeeId/1344560028049/Put, lastKey=row1/data:lastName/1344560058448/Put,
avgKeyLen=29, avgValueLen=4, entries=3,
length=722 Trailer: fileinfoOffset=241, loadOnOpenDataOffset=150, dataIndexCount=1, metaIndexCount=0, totalUncomressedBytes=655, entryCount=3, compressionCodec=NONE, uncompressedDataIndexSize=43, numDataIndexLevels=1, firstDataBlockOffset=0, lastDataBlockOffset=0, comparatorClassName=org.apache.hadoop.hbase.KeyValue$KeyComparator, version=2 Fileinfo: KEY_VALUE_VERSION = \x00\x00\x00\x01 MAJOR_COMPACTION_KEY = \x00 MAX_MEMSTORE_TS_KEY = \x00\x00\x00\x00\x00\x00\x00\x00 MAX_SEQ_ID_KEY = 10 TIMERANGE = 1344560028049....1344560058448 hfile.AVG_KEY_LEN = 29 hfile.AVG_VALUE_LEN = 4 hfile.LASTKEY = \x00\x04row1\x04datalastName\x00\x00\x019\x0E\x06PP\x04 Mid-key: \x00\x04row1\x04dataemployeeId\x00\x00\x019\x0E\x05\xD9\x91\x04 Bloom filter: Not present Stats: Key length: count: 3 min: 28 max: 30 mean: 29.0 Val length: count: 3 min: 3 max: 6 mean: 4.0 Row size (bytes): count: 1 min: 123 max: 123 mean: 123.0 Row size (columns): count: 1 min: 3 max: 3 mean: 3.0 Key of biggest row: row1 Scanned kv count -> 3
The above output could be used to validate the record size using formula
(8+avgKeyLen+avgValueLen)*columns per record
So for the above output
(8+29+4)*3=123
and its equal to our calculation in the previous blog...
Please note that the result is approximate as its calculated based on the average values. Also the first parameter 8 is actually the "Key Length" and "Value Length" of 4 Byte each as described in the last blog.
Really nice blog post.provided a helpful information.I hope that you will post more updates like thisHadoop Admin Online Course
ReplyDeleteThanks for the useful content. Very helpful.
ReplyDeletewhat about index file size?
ReplyDelete