Native batch ingestion and multi-dimension range partitioning in 0.23.0

This feature decreases storage size by 40% while improving query speeds by 75%.

Partitioning motivation:

  • Parallelism: Multiple processes can operate on different portions of data simultaneously.
  • Distributed storage: Data can be spread across several data servers thus allowing even commodity hardware to meet memory and disk requirements.
  • Improved I/O management: Reasonably sized partitions make reading from/writing to disk easier and transmission over network less prone to failure.
  • Granular replication: With the same factor, say 2, replication at the partition level allows better fault tolerance than replication of unpartitioned data as a whole.

This feature is available for native batch ingestion and compaction. If you ingest your data with Hadoop you could have a compaction job rewrite it to range partitioning. For a bit more context, check here and here.