Saturday, 11 August 2012

Comparison of RCFile and Column-oriented database

This blog tries to list the differences and synergies of the RCFile format of Hadoop and Column-oriented database (like Cassandra, HBase, Hypertable). What parameters should be considered to decide? I am trying to put my thoughts and comments are welcome to add the other factors.

  • Static vs Dynamic Data: The application data could be static(append only) and might never changes like the events from the web server and other fact data. On the other hand the application data could be dynamic and needs to keep the latest status like User profiles, bank transactions etc. For Dynamic data we must go for the column-oriented databases as it constantly needs to updated/deleted BUT for static data the RCFile format seems a better fit (of-course based on the other use cases also)
  • Point Query: Point query needs to retrieve the exact records on the request from application. The Point queries are easily possible with Column-oriented database as they keep the data sorted based on the Row-key and allows the user to search the specific row with the row-key. The Point Queries could be a challenge with RCFile, as it needs to be start a MapReduce job to find a specific row.
  • Range Query:  Range Query is to retrieve all the records with some matching criteria e.g. "All users in a specific group". Range query are easily possible both with RCFile format and with Column-oriented databases. The performance is an important factor to decided based on the use-cases for the range query.
  • Storage Requirements: The storage is important factor and the storage requirements should be calculated for all column-oriented options. Different compression algorithm should be used to find the best.
  • Performance: Based on the application needs, it could be most critical requirement. So needs to be tested with different options and evaluate based on the different parameters e.g. compression.The performance should be faster if the query require less columns compare to all columns.
  • Others: There are many features provided by column-oriented databases like Multiple Versions, Column Families, and storing the Binary data and should be considered before taking a final decision.

This is still work in process and the comments are highly welcome..

No comments:

Post a Comment