From my previous blog
"How to calculate the record size of HBase?" its easy to calculate the record size of HBase and estimate the storage requirements. But how to verify it during the testing phase and in production environment?
HFile tool of HBase could be used to find the average key size, average value size and number of records per store file in HDFS. It can also be used to see the actual records in the store file.
To use if first browse the HDFS in path "hadoop fs -lsr /hbase/<table name>" and find the store file of the table as follows for 'employee' table:
Prafulls-MacBook-Pro:~ prafkum$
hadoop fs -lsr /hbase/employee
-rw-r--r-- 1 prafkum supergroup 521 2012-08-10 06:22 /hbase/employee/.tableinfo.0000000001
drwxr-xr-x - prafkum supergroup 0 2012-08-10 06:22 /hbase/employee/.tmp
drwxr-xr-x - prafkum supergroup 0 2012-08-10 06:24 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7
drwxr-xr-x - prafkum supergroup 0 2012-08-10 06:22 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/.oldlogs
-rw-r--r-- 1 prafkum supergroup 124 2012-08-10 06:22 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/.oldlogs/hlog.1344559953739
-rw-r--r-- 1 prafkum supergroup 231 2012-08-10 06:22 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/.regioninfo
drwxr-xr-x - prafkum supergroup 0 2012-08-10 06:24 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/.tmp
drwxr-xr-x - prafkum supergroup 0 2012-08-10 06:24 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data
-rw-r--r-- 1 prafkum supergroup 722 2012-08-10 06:24 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data/770ababfda4643f59956b90da2dc3d3f
Look for the files in "data" directory and choose any one e.g. /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data/770ababfda4643f59956b90da2dc3d3f in the above output.
Now use the HFile tool as follows on store file. Please use only the -m option to print the meta data. Use another options like -p to print the actual content of file and -s for statistics. (Don't use them in production as the data might be huge in store file)
Prafulls-MacBook-Pro:bin prafkum$ hbase org.apache.hadoop.hbase.io.hfile.HFile -v -p -s -m -f /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data/770ababfda4643f59956b90da2dc3d3f
Scanning -> /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data/770ababfda4643f59956b90da2dc3d3f
12/08/10 06:32:42 INFO hfile.CacheConfig: Allocating LruBlockCache with maximum size 246.9m
K: row1/data:employeeId/1344560028049/Put/vlen=3 V: 123
K: row1/data:firstName/1344560042111/Put/vlen=3 V: Joe
K: row1/data:lastName/1344560058448/Put/vlen=6 V: Robert
Block index size as per heapsize: 416
reader=/hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data/770ababfda4643f59956b90da2dc3d3f,
compression=none,
cacheConf=CacheConfig:enabled [cacheDataOnRead=true] [cacheDataOnWrite=false] [cacheIndexesOnWrite=false] [cacheBloomsOnWrite=false] [cacheEvictOnClose=false] [cacheCompressed=false],
firstKey=row1/data:employeeId/1344560028049/Put,
lastKey=row1/data:lastName/1344560058448/Put,
avgKeyLen=29,
avgValueLen=4,
entries=3,
length=722
Trailer:
fileinfoOffset=241,
loadOnOpenDataOffset=150,
dataIndexCount=1,
metaIndexCount=0,
totalUncomressedBytes=655,
entryCount=3,
compressionCodec=NONE,
uncompressedDataIndexSize=43,
numDataIndexLevels=1,
firstDataBlockOffset=0,
lastDataBlockOffset=0,
comparatorClassName=org.apache.hadoop.hbase.KeyValue$KeyComparator,
version=2
Fileinfo:
KEY_VALUE_VERSION = \x00\x00\x00\x01
MAJOR_COMPACTION_KEY = \x00
MAX_MEMSTORE_TS_KEY = \x00\x00\x00\x00\x00\x00\x00\x00
MAX_SEQ_ID_KEY = 10
TIMERANGE = 1344560028049....1344560058448
hfile.AVG_KEY_LEN = 29
hfile.AVG_VALUE_LEN = 4
hfile.LASTKEY = \x00\x04row1\x04datalastName\x00\x00\x019\x0E\x06PP\x04
Mid-key: \x00\x04row1\x04dataemployeeId\x00\x00\x019\x0E\x05\xD9\x91\x04
Bloom filter:
Not present
Stats:
Key length: count: 3 min: 28 max: 30 mean: 29.0
Val length: count: 3 min: 3 max: 6 mean: 4.0
Row size (bytes): count: 1 min: 123 max: 123 mean: 123.0
Row size (columns): count: 1 min: 3 max: 3 mean: 3.0
Key of biggest row: row1
Scanned kv count -> 3
The above output could be used to validate the record size using formula
(8+avgKeyLen+avgValueLen)*columns per record
So for the above output
(8+29+4)*3=123
and its equal to our calculation in the
previous blog...
Please note that the result is
approximate as its calculated based on the average values. Also the first parameter 8 is actually the "Key Length" and "Value Length" of 4 Byte each as described in the last blog.