Question about segment merging tasks

Hi all,

I have a question about what is the difference between the append task and the merge task. I’m trying to merge twice segments which have the same interval time and dimensions, but they have different data value. Is it possible using these merging tasks? I know that I can do this re-indexing raw data using HDFS for example, but I don’t know if I can do it using only the segments without raw data. Can someone give me some information about this???

Regards,

Andres

The append task is intended to combine adjacent segments that are for different intervals, whereas the merge task is intended to combine segments for overlapping or identical intervals. In either case you should provide the full list of segments for the overall interval you want to combine.

Also, IIRC both the tasks have a limitation that they only work with non sharded data.

Hi Gian and Nishant,

I have tried do a test like this:

{

“type”: “merge”,

“id”: “merge1111”,

“dataSource”: “rb_flow”,

“segments”: [“2015-05-04T18:00:00/2015-05-04T19:00:00”, “2015-05-04T19:00:00/2015-05-04T20:00:00”]

}

``

but it doesn’t work:

Error 500

HTTP ERROR: 500

Problem accessing /druid/indexer/v1/task. Reason:

    javax.servlet.ServletException: com.fasterxml.jackson.databind.JsonMappingException: Can not instantiate value of type [simple type, class io.druid.timeline.DataSegment] from String value ('2015-05-04T18:00:00/2015-05-04T19:00:00'); no single-String constructor/factory method

at [Source: HttpInputOverHTTP@446baaa; line: 1, column: 89] (through reference chain: java.util.ArrayList[0])

Powered by Jetty://

``

Can you give me a example of ?? And I have other question, Nishant, you told me these tasks only work with non sharded data, in this case I can’t use these tasks to merge segments which are created using realtime index task with more than one partition, isn’t??

Regards,

Andres

Andres, FWIW, I would recommend just using the coordinator to automatically merge segments for you. You can configure automatic merging by enabling the functionality in the coordinator runtime.properties.

If you want to run this task by hand, you will need the full JSON of a segment in order to run the merge tasks. If you look inside your metadata store for example, you can see the JSON form of a segment looks like.

It should have the fields:

@JsonProperty("dataSource") String dataSource,
@JsonProperty("interval") Interval interval,
@JsonProperty("version") String version,
// use `Map` *NOT* `LoadSpec` because we want to do lazy materialization to prevent dependency pollution
@JsonProperty("loadSpec") Map<String, Object> loadSpec,
@JsonProperty("dimensions") @JsonDeserialize(using = CommaListJoinDeserializer.class) List<String> dimensions,
@JsonProperty("metrics") @JsonDeserialize(using = CommaListJoinDeserializer.class) List<String> metrics,
@JsonProperty("shardSpec") ShardSpec shardSpec,
@JsonProperty("binaryVersion") Integer binaryVersion,
@JsonProperty("size") long size

Thanks for clearing up the difference between the merge task and append tasks. I was having the same question.

I would recommend just using the coordinator to automatically merge segments for you
Last time I checked this feature out, it was only working for unsharded segments or segments with only a single shard. Is this still the case?

I imagine that having no shards is an edge case and that most people practically cannot be without sharding. Is this a misconception?

I had an issue with the segment merge task as described in this topic
https://groups.google.com/forum/#!topic/druid-user/xZQLZ8-npDg

But using the coordinator for this task is much better option as recommended by Fangjin Yang

Best Regards

Denis

It appeals to me the too, but would the coordinator-based merging work with sharded segments?