droppedCount is equal to receivedCount--how to resolve this?

Hello, I am trying to use druid to query Kafka streams.

The log shows like this:

2017-10-20 22:26:49,389 [KafkaConsumer-CommitThread] INFO c.m.tranquility.kafka.KafkaConsumer - Flushed {MyTopic_2={receivedCount=183874, sentCount=0, droppedCount=183874, unparseableCount=0}} pending messages in 1ms and committed offsets in 5ms.

That means, nothing is pushed to Druid. If I increase the windowPeriod from PT1S to PT1M, I would have other errors which crushed Kafkaconsumer. Basically my data is quite large volume.

any tips? Thanks!

Hey Joanne,

The droppedCount is probably high because none of the messages are inside the windowPeriod.

I would try raising the windowPeriod and tackling the other errors you run into at that time (maybe scale out?). Another option is using the Kafka indexing service, a built-in Kafka ingestion mechanism in Druid: http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html

Just to add to Gian's response, how large is "my data is quite large".
I know of multiple clusters that are sustaining over a million
events/s ingested. Is your data significantly larger than that mark?
If not, you should feel comfortable knowing that whatever issues you
have run into should be solvable so it would be good to know what they
are.

If you are dealing with significantly more data than that, it would
also be great to understand what specific problems you are running
into in case you are pushing a boundary that we haven't seen yet.

--Eric

Thank you Gian. I tried the Kafka ingestion. But I don’t know how to use it. What I did is run this:
curl -X POST -H ‘Content-Type: application/json’ -d @/path-to/supervisor-spec.json http://localhost:8090/druid/indexer/v1/supervisor

Then I do a query like this:

curl -L -H’Content-Type: application/json’ -XPOST --data-binary @/path-to/search.json http://localhost:8082/druid/v2/?pretty

The result is empty. And this time I don’t even know where are the logs. Did I miss something?

I use mvn install to build the GitHub tranquility code (so I can have more choices of window period such as PT5M, the one on tutorial is too old) and will try that way again. That way at least showed something in logs.

Hi Eric, Thank you for your reply.
My data comes at ms interval.
2017-10-20T22:52:58.230Z
2017-10-20T22:52:58.234Z

I could assume the time is unique most of time. So based on that assumption, the rate is like 1/ms = 1000 events/s. But I use one node now.

I have a question on druid tranquility real time. What time zone it uses? currently I convert my timestamp to millisecond.

String timeFormat = “yyyy-MM-dd’T’HH:mm:ss.SSS’Z’”;

Long mtime = (new SimpleDateFormat(timeFormat)).parse(sTime).getTime();

“timestampSpec” : {

“column” : “mtime”,

“format” : “millis”

},

Not sure if tranquility interprets the milliseconds the same way?

I build a newer version tranquility. However it does not support more granularity.

/data/tranquility-distribution-0.9.0-SNAPSHOT/bin/tranquility kafka -configFile /path-to/kafka-clicks-local1M.json

java.lang.IllegalArgumentException: No enum constant com.metamx.common.Granularity.PT5M
java.lang.IllegalArgumentException: No enum constant com.metamx.common.Granularity.PT10M

segmentGranularity works with “minute” only

I tried to comply this constraint:

intermediatePersistPeriod (PT1S) ≤ windowPeriod (PT1M) < segmentGranularity (minute) and queryGranularity (none) ≤ segmentGranularity (minute)

not sure PT1M < minute would be ok.

This time I still get all dropped.

2017-10-25 00:45:25,005 [KafkaConsumer-CommitThread] INFO c.m.tranquility.kafka.KafkaConsumer - Flushed {Topic_2={receivedCount=173772, sentCount=0, droppedCount=173772, unparseableCount=0}} pending messages in 2ms and committed offsets in 6ms.

I can increase the windowPeriod longer. But which one is the next level enum I can use? It is so hard to dig the source code.

java.lang.IllegalArgumentException: No enum constant com.metamx.common.Granularity.10MINUTE

Any better tutorials? It is really hard to figure it out.

Thanks!

1000 events/s isn't too much data, one node should be more than enough
to handle that (depending on the shape of the data you *might* start
seeing issues around the 10k/s mark on a single node). If you really
want to use tranquility, I'd recommend setting your intermediate
persist period to PT5m, your window period to PT15m and your segment
granularity to hour.

But, the way tranquility works is to create "tasks" via your overlord,
those tasks are what actually do the indexing. So, the logs you are
seeing with tranquility are actually the logs when pushing data into
those tasks, they won't tell you anything about the indexing itself.
In order to see the logs for indexing, you will need to look at your
overlord console, which should have the tasks listed. This is going
to be the exact same place to see all of the logs if you are using the
kafka supervisor, so I'd recommend going back to that path (kafka
supervisor is the recommended method).

With the kafka supervisor, when you load the overlord console you will
see the supervisor registered and that will then give you links for
the payload and logs and stuff (you will also see tasks on the console
and each task will have logs).

The empty response could be caused by a number of different things,
without the actual query that you are sending it's hard to know what
could be mismatched. Can we reset back to the kafka supervisor
approach and can you share the spec and query that you are sending?

--Eric

Thank you so much Eric. I used your suggested granularity and it works like a charm. I can see no more dropped.
2017-10-26 00:50:47,265 [KafkaConsumer-CommitThread] INFO c.m.tranquility.kafka.KafkaConsumer - Flushed {Topic={receivedCount=16675, sentCount=16675, droppedCount=0, unparseableCount=0}} pending messages in 0ms and committed offsets in 1ms.

I can do some query like search.

But when I want to run a group by it is really hard to read this example.

http://druid.io/docs/latest/querying/groupbyquery.html

It does not look like a group by at all. It looks more like a select with a filter or condition.

Then I tried this:

http://druid.io/docs/latest/querying/sql.html

Unfortunately it does not work at all.

curl -XPOST -H’Content-Type: application/json’ http://localhost:8082/druid/v2/sql/ -d ‘{“query”:“SELECT * FROM INFORMATION_SCHEMA.COLUMNS”}’

has no response.

Any tips? Thank you very much!

Joanne


You have to turn on the sql endpoints to get that to actually work.

For the group by, the group by dimensions are part of the "dimensions"
field in the JSON. If you want to do the simplest query possible, it
might be best to just start with a timeseries query:

{ "queryType": "timeseries",
  "dataSource": put_your_data_source_name_here,
  "granularity": "hour",
  "interval": "2017-10-26/2017-10-27"
  "aggregations": [{"type": "count", "name", "rows"}]
}

That's essentially equivalent to

SELECT truncate_hour(timestamp), count(*) FROM
put_your_data_source_name_here GROUP BY trncate_hour(timestamp)

If you want to add other dimensions, you can switch it to a groupBy
query and just add the "dimension" field, which ill then add extra
dimensions to the group by query.

--Eric