Parallel Batch Ingestion with SQL Input Source

Hello,

Suppose we set maxNumConcurrentSubTasks > 1

The doc here: https://druid.apache.org/docs/latest/ingestion/native-batch.html#parallel-task talks about Splittable Batch Input Source, I would like to know if Splittable means that the user should split the the data in advance and provide multiple files in the task, for example multiple sql queries in ioConfig.inputSource.sqls for example so the ingestions can run in parallel in multiple subtasks OR Druid can perform the splitting for us and run multiple subtasks even if we provide only one query in ioConfig.inputSource.sqls but we tweak tuningConfig.splitHintSpec to make it run multiple subtasks.

put another way, can we run multiple ingestion subtasks even if we provide one query in ioConfig.inputSource.sqls if we tweak tuningConfig.splitHintSpec ?

I’ve tested with multiple queries in ioConfig.inputSource.sqls, and multiple subtasks get created but I don’t get the point of tuningConfig.splitHintSpec.

Thanks,

Hey! The work that’s issued by the Overlord to each task that gets created is split up according to the ingestion source.

For Sql I believe it is split up into individual queries that you specify

“each worker task will read from one SQL query from the list of queries”
https://druid.apache.org/docs/latest/ingestion/native-batch.html#sql-input-source

Under that bit of the docs there’s also some useful hints for running Sql ingestion.

I agree - https://druid.apache.org/docs/latest/ingestion/native-batch.html#size-based-split-hint-spec is a bit confusing because it only says that it doesnt apply to http - let me know if the stuff above helps solve the issue first, and if not I can ask around to see whether splitHintSpec is even applicable to sql ingestion…

Thanks for your response, as I said, I’ve tested with multiple sql queries and multiple workers tasks got created as expected.

I agree - [https://druid.apache.org/docs/latest/ingestion/native-batch.html#size-based-split-hint-spec](https://druid.apache.org/docs/latest/ingestion/native-batch.html#size-based-split-hint-spec) is a bit confusing because it only says that it doesnt apply to http - let me know if the stuff above helps solve the issue first, and if not I can ask around to see whether splitHintSpec is even applicable to sql ingestion...

Yes, you’re right, I would like to know if we can use splitHintSpec for sql ingestion, I’ve tried to split one sql query, but it didn’t seem to work.

I’ll ask around amongst some peeps I know and see what comes back. Maybe a docs update needed…

Hey so yeah it looks like that’s a docs issue.

“Not all batch InputSource implementations consider the SplitHintSpec (the SqlInputSource and HttpInputSource are examples that will ignore anything set there) and split solely based on sqls in the case of SqlInputSource and uris in the case of HttpInputSource.”

Those docs do need updating I think…

Thanks for your confirmation, it’s much clearer now.

Do you want me to rectify the documentation or will you take care of it?