[druid-user] Re: Best ways to batch push TB's data from hdfs


So I can only answer a few of these I’m afraid (!)

On appends, check this table - ie, use index parallel:

On multiple S3 buckets, see the “properties” section of:

And re the issue, while the indexer is available, it’s not truly “GA” - although there is A LOT of people working to make it that way. IMHO stick with MM / Peons until it is squeaky clean. (Though other people (!!!) might be like - “Peter - seriously” — :smiley: :smiley: :smiley: )

I was trying to track down a KB I saw once on Hadoop ingestion tuning… but unfortunatlely couldn’t find it… :frowning:

I’ll try to answer some of those questions:

  1. There’s no ability to ingest data directly from Spark to Druid yet, but there is a WIP PR (by the awesome Julian Jaffe), see here.
    As per the Hadoop-based ingestion (which is very scalable) - this will simply overwrite existing segments (see here), but that’s OK, because you can in fact achieve what is called “delta ingestion” (which seems what you’re looking for) by using “multi” type, and merge existing data in Druid with new data from S3 (or wherever your new data resides), see here and here (go to the “Dimension-based TTL” section).
  2. Rommel Garcia, one of my colleagues from Imply, wrote a great post about tuning Hadoop-based ingestion, see https://imply.io/post/hadoop-indexing-apache-druid-configuration-best-practices.
  3. Do you mean you are unable to use S3 as the ingestion input location? Essentially, the deep storage (endpoint, credentials, etc.) is configured in the config files of Druid, whereas the ingestion input location (path, credentials, etc.) can be part of the ingestion spec (see Hadoop-based ingestion and native batch ingestion). Can you elaborate on the specific issue (e.g what error do you see)?
  4. Gian wrote here, that the issue you’ve mentioned (https://github.com/apache/druid/issues/9820), should be fixed by https://github.com/apache/druid/pull/10631.
    I suspect that the fix will only be available in 0.21.0 (not in 0.20.1), but I might be wrong.


This new library might satisfy your requirements:

If you end up trying it, it would be great to hear about your experience. For example, how the performance compares to the native batch ingestion in your case. Also if you have any trouble using the lib, don’t hesitate to ask for help.