Some questions about "shardSpec" of realtime node

Hi:
I have read this link:http://druid.io/docs/0.7.3/Realtime-ingestion.html#sharding.

Is the “shardSpec” for scaling the data from a datasource into the multi realtime nodes?

For example,there is a kafka topic having 3 partitions,

if I want to improve the performance of indexing build from realtime nodes,

I should add 3 real time nodes in the same consumer group for this topic, and assign the different “partitionNum” to 3 realtime nodes. Is it right?

Thanks,

Tao

It depends on the amount of data that you are going to ingest. 3 realtime nodes for 3 partitions might be overkill, if 1 realtime node can already keep up with the incoming data.

So i would say, 3 realtime nodes for 3 partitions, its an optimum.

Hi, Olaf:
Is the my setting is right?

3 realtime node are in the same kafka group and have different “partitionNumber”. Will the data of topic distribute into these 3 realtime nodes?

在 2015年6月24日星期三 UTC+8上午7:28:45,Olaf Krische写道:

Hey,

with having 3 kafka clients (via 3 realtime nodes), and having 3 partitions for the kafka topic, each kafka client should get assigned to consume from one partition automagically. And thus, each realtime node fills its own shard from its currenlty assigned partition. And when you add another partition on kafka side, then one realtime node might consume 2 partitions and each other only one partition. And so on. And when you shutdown a realtime node, another one will take over that partition.

A good partition key for the message is also important. If you dont provide a good one (i made the mistake not to set any key), then only one partition of the kafka topic is filled with messages. And then it looks like, that only one realtime node fills its data, while all others are inactive.You can monitor that, i used a kafka offset monitor from the tool section of the kafka website. Then i can see, who consumes what.

Thanks Olaf very much.
But I’m not clear about “shardSpec”.

The data of topic will distribute into different realtime nodes who are in the same consumer group, if these realtime nodes have different “partitionNumber” which is in “shardSpec” json body, does they inform the druid the data are managed by them?

For example, there are 3 realtime nodes, who have different “partitionNumber” and in the same consumer group, so the data of topic who have more than 3 partition will distribute into these 3 realtime nodes. When we want to query these data, will the broker inform these 3 realtime nodes to get result and aggregate these result. If we don’t set the “shardSpec”, the broker will randomly select only one realtime node to get result, is it right?

在 2015年6月24日星期三 UTC+8下午4:11:25,Olaf Krische写道:

Heya,

Thanks Olaf very much.
But I’m not clear about “shardSpec”.

The data of topic will distribute into different realtime nodes who are in the same consumer group, if these realtime nodes have different “partitionNumber” which is in “shardSpec” json body, does they inform the druid the data are managed by them?

For example, there are 3 realtime nodes, who have different “partitionNumber” and in the same consumer group, so the data of topic who have more than 3 partition will distribute into these 3 realtime nodes. When we want to query these data, will the broker inform these 3 realtime nodes to get result and aggregate these result. If we don’t set the “shardSpec”, the broker will randomly select only one realtime node to get result, is it right?

Right.

a) If the realtime nodes have different partitionNumbers, then the broker will assume, that they will have distinct data for the same “hour” and merge the results from the query to one realtime node of each partitionNumber.

b) If the realtime nodes have the same partitionNumber, then the broker will assume, that they have the same data for the same “hour” and returns the result from querying one realtime node.

So basically, b) is even a special case of a) :slight_smile:

Hi, Olaf:

If I want to generate the realtime nodes Redundant, Should I get another group realtime nodes to implement it?

For example, realtime nodes A1, B1, C1 in consumer group 1, and be assigned different partitionNumber(0,1,2).

I get another realtime nodes A2, B2, C2 in consumer group 2 for redundancy, and assigned partitionNumber(0,1,2)

If realtime node A1 crashed, broker will get the result from A2.

But we can’t generate the data of A2 realtime are the same as A1, because the kafka partition data distributed are random.

So how can I generate the realtime node redundant?

在 2015年6月24日星期三 UTC+8下午11:25:55,ol…@adsquare.com写道:

Hi,

This post might actually be interesting for you concerning scaling real-time ingestion currently:

https://groups.google.com/forum/#!searchin/druid-development/fangjin$20yang$20"thoughts"/druid-development/aRMmNHQGdhI/muBGl0Xi_wgJ

If you are reading from multiple Kafka topics across multiple multiple realtime nodes for the same consumer group, and also trying to replicated data, I believe you will hit the problems described in the thread. There is a new Firehose coming soon, or you can use the indexing service as a workaround.

– FJ