I am evaluating Druid (0.12.0) as a storage for the financial tick data.
Currently all the processing revolves around Vertica database.
We export the data in the monthly intervals producing 40Gb of compressed CSV files.
The whole data set is around 100 000 000 000 rows with monthly growth of 2 000 000 000.
Each monthly batch consists of around 20 csv files – one file per financial instrument. .
In the short term I would continue using Vertica but switch the export of the data into daily and load it into Druid.
How should I organize loading all the historical data? I would like to do it in parallel if possible.
How should I approach structuring the data? Is it better to create separate DataStore per instrument?
Whether druid is suitable and how to structure your data depends on your use case.
It’s been two years since I looked at Druid but one of the key foundations of Druid at the time was that it aggregated data. For ad data this is perfect as it saves a lot of storage while losing little of value.However with most tick data you want a record of every single event. Typically for running backtests etc.
I did an outline of some alternative data stores here:
Some omissions from that list would now be clickhouse, bigquery and redshift.
If anyone can highlight where my knowledge is outdated, I welcome the feedback.
It sounds like Druid would be good for this. In general Druid is designed to handle big datasources (rather than a huge amount of small ones), so if you end up having 20 datasources (one per instrument) then that sounds good to me. Probably modeling data in Druid the same way you’d model tables in Vertica would work out.
The best ways to load a bunch of historical data are:
If it’s from flat files in S3 / local disk, then split them up by time (usually by day works) and load each day as a separate ingestion with segmentGranularity DAY. The separate ingestions can run in parallel.
If it’s from Hadoop, then Hadoop indexing is automatically parallelized no matter how much data you are loading in a single job.
One thing that may have changed since you took a look is that now, ingestion-time aggregation is an option that you can disable. For this kind of data you’d probably want to disable it, so you just store all the ticks as-is. In this case you’d set “rollup” to false when you do your ingestion. In this mode, Druid acts like a normal column store and just stores one row for each row of input data. It does continue to index the data though for fast filtering.