{ "error" : "Maximum number of rows [500000] reached" }

I am have pretty small dataset loaded in my druid.
I am querying it and I am constantly getting this error.
Any idea how to limit the output rows like limit n. I tried threshold:5 but does not help, it gives all the results it finds(if they are less than 500000.

query:
{
“queryType”: “groupBy”,
“dataSource”: “avpingFull”,
“granularity”: {“type”: “duration”, “duration”: “86400000”, “origin”:“2015-01-01T23:55:00Z”},
“dimensions”: [“file_sha2”],
“aggregations”: [
{“type”: “count”, “name”: “rows”}
],
“OrderByColumnSpec”:{
“dimension”:“rows”,
“direction”:“descending”
},
“intervals”: [“2015-01-01T10:00:00/2016-01-01T11:01:00”]
}

``

max time I have in my cluster it 2015-01-01T23.59.59
Please advice.

Looks like it relates to the maxRowsinMemory in realtlime config file. Eventhough I tried giving a limitspec with a value 5. So now there is no way I can query my whole data at once. I have only one day of data.
{
“queryType”: “groupBy”,
“dataSource”: “avpingFull”,
“granularity”:“all”,
“dimensions”: [“file_sha2”],
“aggregations”: [
{“type”: “count”, “name”: “rows”}
],
“limitSpec”:{
“type” : “default”,
“limit” : 5,
“columns” : [{
“dimension”:“rows”,
“direction”:“descending”
}]
},
“intervals”: [“2015-01-01T00:10:00/2015-01-01T00:15:00”]
}

``

Another thing I noticed using plyql I am able to query full dataset for the same query

plyql -h hostname:8082 -a select -q "SELECT file_sha2 ,count() FROM avpingFull where __time>‘2015-01-01T00:00:00’ AND __time < ‘2015-01-01T23:59:59’ group by file_sha2 limit 5 "
This give me the result as I need.

Hey Amey,

You’re running into a limit on the size of the pre-limited resultset for groupBy queries (by default this is 500000). plyql is likely working because your query could actually be run as a topN (it’s a single dimension ordered by a metric), and so plyql is probably doing that. You can see in verbose mode the query it’s issuing (–verbose).

If topNs work for you, you should use them, as they are much more efficient than groupBys. They are approximate though and may have some ‘noise’ (especially near the bottom of the results) due to potentially not including 100% of the data points.

Ok i found the default properties in the documentation. Thanks for the help. Can I restart the realtime node without loosing the data. because the time range which realtime node is serving is stored in its memory(i guess) because there is no segment data in the basepersist directoy, it is empty. I am not 100 percent sure what happens if I restart the realtime node with changed configurations.

Should be fine.

I did restart the realtime node and I lost the last two segments when I restarted. The basepersist folder is empty.
now the max time I get
[
{
“max___time”: {
“type”: “TIME”,
“value”: “2015-01-01T23:49:59.000Z”
}
}
]
which should be 23:59:59. Why would this happen.
Another problem is I updated the groupby setting to following in the historical nodes but still I get the same error of 50000 max rows.
druid.query.groupBy.maxIntermediateRows=60000000
druid.query.groupBy.maxResults=24000000

Amey, I’m guessing your setup is misconfigured. How often are you persisting?

Addressing “Another problem is I updated the groupby setting to following in the historical nodes but still I get the same error of 50000 max rows.”:

Try setting these properties on the broker nodes as well.

Thanks changing the broker node solved the problem.

intermediatePersist time is 10mins(default). From what I understand is realtime node keeps buffering the data unless the segment reaches the granularity and a data of later time stamp is received. It will still be in the memory till the window period. After that realtime node will write the data to local disk and commit to kafka that message is received. If something goes wrong with realtime node before it persists to local disk, it will consume the non commited data from
kafka. in my case it didn’t happened. I checked the kafka topic and all data from cluster is gone now. As the rention period in kafka cluster is over the data is deleted from kafka as well. How should I handle this case?

Can we do something like realtime node write all the data it has because once our data is finished, realtime node will still hold the last chunk in its memory which might be lost in future? As in my case the last 10 minutes data should be written to the disk and loaded in historical nodes rather than being served by realtime node?
Here is my realtime spec file
[
{
“dataSchema” : {
“dataSource” : “ping”,
“parser” : {
“type” : “string”,
“parseSpec” : {
“format” : “tsv”,
“columns” : [“some dimensions”],

      "delimiter":"\t",
      "timestampSpec" : {
        "column" : "server_ts",
        "format" : "yyyy-MM-dd HH:mm:ss"
      },
      "dimensionsSpec" : {
        "dimensions": ["some dimensions"],

        "dimensionExclusions" : [],
        "spatialDimensions" : []
      }
    }
  },
  "metricsSpec" : [{
    "type" : "count",
    "name" : "count"
  }],
  "aggregations": [{
     "type": "longSum", "name": "numIngestedEvents", "fieldName": "count" }],
  "granularitySpec" : {
    "type" : "uniform",
    "segmentGranularity" : "FIVE_MINUTE",      "queryGranularity" : "NONE"
  }
},
"ioConfig" : {
  "type" : "realtime",
  "firehose": {
    "type": "kafka-0.8",
    "consumerProps": {
      "zookeeper.connect": "hostname:2181",
      "zookeeper.connection.timeout.ms" : "15000",
      "zookeeper.session.timeout.ms" : "15000",
      "zookeeper.sync.time.ms" : "5000",
      "group.id": "avping-full",
      "fetch.message.max.bytes" : "1048586",
      "auto.offset.reset": "largest",
      "auto.commit.enable": "false"
    },
    "feed": "pingFull"
  },
  "plumber": {
    "type": "realtime"
  }
},
"tuningConfig": {
  "type" : "realtime",
  "maxRowsInMemory": 250000000,
  "intermediatePersistPeriod": "PT10m",
  "windowPeriod": "PT5m",
  "basePersistDirectory": "/mnt/druid/realtime/basePersist",
  "rejectionPolicy": {
    "type": "messageTime"
  }
}

}
]

``

``

I have used messageTime rejection policy to show the realtime ingestion functionality to my team. I am not sure this can cause problems. Other than that everything is pretty standard. Please let me know as this will be big question from my time about data loss.