Multiple ingestion in the same segment

Hi,

I’m new in the druid universe, and i finally succeed to load some data with a druid cluster.

I submit a task to the overlord and the middle manager load data.

I can query it and obtain an answer:

[ {

“version” : “v1”,

“timestamp” : “2016-04-01T00:00:00.000Z”,

“event” : {

“rows” : 747432

}

} ]

Now i add a second file in the same segment, worked perfectly, see in s3 i have many folders under “2016-04-01T00:00:00.000Z_2016-04-02T00:00:00.000Z” which are like :

-2016-05-24T13:17:55.812Z

-2016-05-24T13:19:09.938Z

But when i do the same query:

[ {

“version” : “v1”,

“timestamp” : “2016-04-01T00:00:00.000Z”,

“event” : {

“rows” : 61763

}

} ]

The last file added is the only one treated when i query the segment “2016-04-01T00:00:00.000Z_2016-04-02T00:00:00.000Z”.

I’m a bit lost on this point, do i have to merge the files after ingestion or if i can append the second file data to the first file data during ingestion?

Thanks,

Ben

It seem like druid replace the old segment with the new one.
So if i want to handle multiple load file in one segment, i have to load them at the same time, or merged the segments. Am i right?

And one more question, if i want to load all my files in one time, can multiple peons work on ? Cause actually it creates one task and only one peon handles it.

Hi Benjamin,
Druid follows a MVCC architecture where any newly indexed segments replaces the older one.

When reIndexing a specific interval, you will need to provide all the input files that constitute data for that interval.

Index tasks are not very scalable right now and are generally only used to get data in druid for POCs.

From your use case, it seems Hadoop Index Task would be perfect fit for you, which uses MR job to do batch ingestion and is much more scalable.

Thanks for your answer Nishant !
I think we will use Hadoop Index to have better scalability.

But I have few more questions:

1 - MiddleManager seems useless now, or don’t need a lot of ressources at least. Am i right ?

2 - If Hadoop Index Task replace older indexed segments with newly newly ones, should i use index task for the first ingestion and update the older segment with multi (http://druid.io/docs/latest/ingestion/update-existing-data.html) ?

It seems like this solution is not very scalable … Because if i want to create 10 indexing tasks for one day segment i need to execute only one task to create the segment and 9 updating tasks one after the other. (10 different website data for one day -> one segment)

Thanks,

Ben

See Inline -

Thanks for your answer Nishant !
I think we will use Hadoop Index to have better scalability.

But I have few more questions:

1 - MiddleManager seems useless now, or don’t need a lot of ressources at least. Am i right ?

yeah, you dont need much resources for peons, all the indexing is done on your hadoop cluster. The peons just submits the MR job, waits for its completion, adds the segment metadata finally to metadata store.

2 - If Hadoop Index Task replace older indexed segments with newly newly ones, should i use index task for the first ingestion and update the older segment with multi (http://druid.io/docs/latest/ingestion/update-existing-data.html) ?

It seems like this solution is not very scalable … Because if i want to create 10 indexing tasks for one day segment i need to execute only one task to create the segment and 9 updating tasks one after the other. (10 different website data for one day -> one segment)

you don’t need to have an index task to ingest data initially, you can achieve this via Hadoop Index Task.