[druid-user] Nested datasources

Hi.
We are creating a data repository where users will upload various csv files. And it creates inconvenience by the fact that over time a huge number of data resources will be created with which it will be necessary to work. According to our idea, we will give each user one data resource, where new files will be added, i.e. data resource will be constantly replenished with new records and columns. We are also considering a variant with something like nested tables.
I would like to get advice from those who have implemented something like this.
Thanks in advance.

Hi,

This might be a good starting point:

https://druid.apache.org/docs/latest/operations/security-user-auth.html

By any chance, is this related to your solution to your datasource log question from earlier this year?

Best,

Mark

Hi. Thanks for the reply.

вторник, 27 сентября 2022 г. в 01:03:55 UTC+5, mark.h...@imply.io:

Hi,

This might be a good starting point:

https://druid.apache.org/docs/latest/operations/security-user-auth.html

Unfortunately that’s not exactly what I need.
I need a solution on how to store the data of the files uploaded by the users. So far I have an idea to create one datasource for each user in which each column will be a json array of data from each new file. It goes something like this:
file_naem1, file_name2, file_name3
ingestion_date json_aray json_aray
ingestion_date json_aray json_aray
ingestion_date json_aray json_aray json_aray

But it is inconvenient in terms of data selection

By any chance, is this related to your solution to your datasource log question from earlier this year?

No, this query is not related to my previous question,

I took the liberty of posting your question to Slack. One of the founders asked for some clarification and also suggested the following:

  • I’m not sure what “inconvenient in terms of data selection” means here — it would be helpful to know what kind of selection you are wanting to do?

  • the desire here sounds similar-ish to the desires people have when implementing multi-tenant workloads, so this doc may be useful: https://druid.apache.org/docs/latest/querying/multitenancy.html

    • it has some info about how to think about whether to use one giant datasource, or split it up
    • the doc highlights how to consider what kind of data management and retrieval operations may be necessary, and how that influences choice of design

Let us know if any of this helps with your use case.

Best,

Mark

Hi.

вторник, 4 октября 2022 г. в 03:02:39 UTC+5, mark.h...@imply.io:

I took the liberty of posting your question to Slack. One of the founders asked for some clarification and also suggested the following:

  • I’m not sure what “inconvenient in terms of data selection” means here — it would be helpful to know what kind of selection you are wanting to do?

I meant the inconvenience of SQL queries to datasources when using nested columns.

Thank you, yes this article came in handy for describing the options we offer to create datasources to our customer.
The customer need to choose between shared datasources and datasource-per-tenant.

Oye, volveré a mentir.

No te dejaré hacer eso, tengo evidencia de tu gran sabotaje a la formación de datos.
Garabato ruso zarista,

Te lo conté una vez y me arrepentí,