Re indexing task issue

Hi all!

I am trying to re index my old segments and change their query and segment granularity from HOUR TO DAY.

It seems to work fine when I re index unsharded segments (produced by my batch pipeline). However some of my segments are sharded (produced by my real time nodes) and when I try to re index those segments it seems I am loosing data.

I have attached an example of the sharded segments I am trying to re index (this is what I have in my database). I have also attached the logs of the re index task I am using to re index these segments.

I have 3 shards per hour, it seems to me the re index task only take one shard in consideration because the numbers I get when I query my new segments are more or less the old numbers divided by 3.

Do you know why this is happening? I am using Druid 0.7.0 right now but the segments I am trying to re index have been created with Druid 0.6.x (I updated my cluster few days ago).

Thank you for your answers!

Guillaume

auctions_sharded_segments_2015-02-25_2015-02-26.csv (17.5 KB)

re _index_segments_task_2015-02-25_2015-02-26.txt (244 KB)

Are you re-indexing off of the raw data, or are you using the segment firehose?

I ask because this tangentially related PR went in recently:

I am using the ingest segment firehose! But I cannot see any NPE in my task log.
Where should I look for if I want to see I’m hitting the same issue?

You might not see NPE, simply dropped data as a side-effect of the thing that was causing the NPE. As a fun side-effect of the fix, partitioned data will either all succeed or all fail when using the segment firehose. (either everything or nothing)

The fix is in 0.7.1. Are you able to try 0.7.1 instead of 0.7.0?

If I want to try to re index my segments with 0.7.1 I will have to update both overlord and middle manager nodes right?

I would prefer to avoid upgrading my indexing service to any unreleased version. If I’m right the 0.7.1 has not been released yet.

As a workaround could I just try to merge my sharded segments and then use the ingest segments firehose to re index these segments? Do you think that would work?

I’m now trying to merge my sharded segments before I re index them.
However I get this exception from jetty when I submit the task to the overlord:

javax.servlet.ServletException: com.fasterxml.jackson.databind.JsonMappingException: Instantiation of [simple type, class io.druid.indexing.common.task.MergeTask] value failed: segments without NoneShardSpec

It seems to me I can only merge unsharded segments with a shardSpec type property set to “none”. Am I correct? If I am, could you explain me why is this not possible?

I have attached the merge task I am submitting if you want to take a look.

Guillaume

merge_rtb_auctions_2015-02-25_2015-02-26.json (2.35 KB)

Hi Torche, you can only merge non sharded segments. The idea is that if you need to shard data, it is because each segment is already so large that merging makes no sense.

In the future, it might make sense to support merging of sharded segments. I could see it be useful to re-shard segments if improvements in storage format make it more attractive to re-shard data than having to re-index everything.

With the upcoming changes in dimensions compression I’m working on, that could provide significant storage benefits without having to spend a large amount of resources indexing from scratch.

It is also useful to shard segments when you want to scale out real time ingestion right?
In my case I am using 6 partitions for some of my real time tasks. Even if the total size of the shards is only 60 mb, this allows me to divide the ingestion work on 6 peons instead of one.

I could probably use one partition for these real time tasks if I decrease the number of peons and increase the number of processing threads on my middle managers. But the problem is that some of my real time task don’t need that much processing threads and can currently be handled by one peon. Decreasing the number of peons and increasing the number of processing threads would then be a waste of ressource for these tasks because they will still be handled by one peon with much more processing threads than needed…

I have another question about the re indexing task. Did someone try to re index both sharded and unsharded segments within the same re index task? Should this work with the version 0.7.1 ?

I am wondering because if I decide to upgrade to 0.7.1 I will definitely have this use case.

Hi Torche, please see inline.

It is also useful to shard segments when you want to scale out real time ingestion right?

Yes, that is the primary use case we have for sharding is to scale out ingestion. Sometimes we will try to have smaller segment than our normal size of 500mb to keep query latencies reasonable.

In my case I am using 6 partitions for some of my real time tasks. Even if the total size of the shards is only 60 mb, this allows me to divide the ingestion work on 6 peons instead of one.

Usually we try to size segments at around 500mb.

I could probably use one partition for these real time tasks if I decrease the number of peons and increase the number of processing threads on my middle managers. But the problem is that some of my real time task don’t need that much processing threads and can currently be handled by one peon. Decreasing the number of peons and increasing the number of processing threads would then be a waste of ressource for these tasks because they will still be handled by one peon with much more processing threads than needed…

You can always experiment on different combinations to see what the results will be. IMO, you can probably get away with less shards as your segments are very small right now.

I have another question about the re indexing task. Did someone try to re index both sharded and unsharded segments within the same re index task? Should this work with the version 0.7.1 ?

I am wondering because if I decide to upgrade to 0.7.1 I will definitely have this use case.

We haven’t declared 0.7.1 stable quite yet. We’ve been running with rc1 in production for a few weeks but there were some small things added after rc1 that we want to test before declaring anything.

Hi Xavier,

I’m using Druid 10.0.0. When I submit a merge task, I’m getting following error.

{

“error”: “Instantiation of [simple type, class io.druid.indexing.common.task.MergeTask] value failed: segments without NoneShardSpec”

}

You said that:

In the future, it might make sense to support merging of sharded segments.

I’m wondering if merging of sharded segments is supported nowadays.

Regards,

Jason

2015년 4월 1일 수요일 오후 1시 43분 2초 UTC+9, Xavier Léauté 님의 말: