We have at our organization a huge amount of tables that need to be created (which correlates to a huge amount of ingestion spec and a huge amount of different schemas).
I couldn’t find any information on how druid handles a big scale of different schemas.
I wonder if somewhere along the way druid will have lower performances somewhere if it is needed to handle millions of different ingestion specs, and can I recover from some of these problems, if they arise, by having some workarounds?
These are the pitfalls that I can think of -
- Having too many open connections to different sources will cause Druid to crash - I can overcome this by maybe having all ingestion be a push stream, will this solve it?
- Having too many ingestion specs files will create a big data problem with the metadata, which might cause a storage problem or performance issues in the loading of the specs. I’m not sure if this is supposed to even be an issue, but if it is, can I scale out the metadata handling somehow?
These are the issues I can think of, for now. I would love it if you have any other issues you see that will arise that I might be missing. Furthermore, I would love to know where you think the limit for the number of specs lies.
the ingestion tasks connect to the sources. So connecting to a large number of sources should not be a problem. However running millions of tasks will likely cause trouble for the overlord which manages the ingestion. Depending on the actual number of tasks the overlord will need to be scaled (one leading overlord per cluster…so the scaling has to be vertical). The metastore will also need to be scaled if you are bringing all these different schemas to different datasources in druid. what kind of data is this? How is it coming into druid? I am curious about what these millions of ingestion specs are. If this is batch ingestion of different small files then you could run a few 1000 tasks at a time and manage this
We have many tables we want to insert into, with many different schemas.
I’ve seen that the push approach (or - batch ingestion) should not be used if I want to use it in a micro-batches way, but instead I should use the stream ingestions.
The problem is that I would have to create a new topic for each table I’m creating.
The data is stream data in JSON format, the scale can be up to hundreds of thousands of data sources (which correlates to hundreds of thousands of Kafka topics I guess).
Multiple ingestion tasks can read from same topic (you could filter on some column and thus segregate the data). However, creating hundreds of thousands of source in druid is not recommended. The better approach is to create one source and use filter in your queries. Each row can have a completely different set of columns.
Ok, the first approach sounds interesting.
Regarding the second approach, how can I create one data source with a different schema for each row?
I’m worried that by having one spec that specifies every schema that I have, a giant JSON file will be created and we’ll need to update this spec every time I have a new data source with a new schema that I need to add.
With the new nested column support in Apache Druid 24.0, it is designed to deal with variable schemas.
You can declare some common schema elements, perhaps time, source and other elements that are common among sources and then ingest the rest as JSON data. The ingestion process will scan all the json objects in a given dataset and extract all the distinct fields into columnar structures. This is an impressive new functionality in Druid because it allows for mostly schemaless ingestion while producing fully indexed data on all the possible JSON scalar fields. It doesn’t do great yet with arrays within the JSON, it can ingest them, but each entry in the array creates a separate set of columns and they need to be queries by array index. This is soon to be improved with unnesting capabilities.
Give it a shot and let us know how it goes.