We have a system that sends us about 50.000.000 unique events every 15 minutes (all have the same timestamp). That should be considered as after roll-up value, in Druid terms. We’re investigating the best way to get this data into Druid.
At first, we tried using the “index” task, but found out it was way too slow for this (it took more than 1 hour to ingest… 1 hour).
Then we turned to submitting a “index_realtime” task and pushing data through HTTP. That was much better, as we went down to ~30min per hour, but still too slow. Surprisingly, sending with more connections in parallel does not seem to offer any significant speedup.
We’re doing the tests on a Intel i7-6700 machine (8 cores) with 32GB of RAM and two SSD drives in RAID0. When the ingestion is taking place, we see that the load avg. is about 4 and there’s at most ~100MB/s IO, so there seems to still be plenty of power to use. Is there any way to tune a single “index_realtime” task to use more resources (I will attach the Overlord configs, etc. below)? Or is spawning more tasks in parallel the only way? The latter is cumbersome, as a single task working is currently using >10GB of RAM, so we’re pretty short on that.
One other problem we’ve been facing is that sometimes there are errors saying “Tried to write [<some_number>] bytes, which is too much”. As far as we understand, that’s because there are too many rows on the same timestamp. We think that sharding can help us here, is this correct?
Any general advices on handling such type of data?
Configs: https://gist.github.com/KenjiTakahashi/906998772f65cd3e68501ad6255a3c2c . That’s running Druid 0.9.0.
Let me know if we can supply any more information.