I have ingested about 10 million rows of data into Druid. When I do select count(*), it returns only 40K.
When I run:
SELECT * FROM sys.segments WHERE datasource = ‘my_datasource’
it shows all segments marked as 1 for is_published and is_available. It shows the current num_rows correctly.
I have a single node deployment with only one historical and one middleManager. I changed the replication as 1.
What may be causing my count to be lower than the actual number of segments num_rows?
Any help is appreciated.
Do you have auto-limit turned on?
And also, what happens when you break it down by time periods? You could add a GROUP BY and use the DATE_TRUNC function:
Thanks for your quick reply Peter.
I do not have the auto-limit on.
Is the select count(*) from my_table also affected by auto-limit?
The count with DATE_TRUNC in the group still returns 40k.
Any more thoughts on what might be causing this?
Thank you again.
If you’re using the druid console, you could try turning off “use approximate topN” and “use approximate count(distinct)” (for good measure) in the query options. (In the three dots next to “Run”.)
Also, when you say you ingested 10M rows, was that input rows, and is there rollup? Is it possible you got some significant rollup?
We ingested the data with rollup = False in the granularitySpec.
When I execute SELECT * FROM sys.segments WHERE datasource = ‘my_datasource’ and sum the num_rows, I see the correct count but count(*) from my_datasource gives me the count from one segment only.
Any more thoughts?
I’m not sure. Would it be possible to attach the ingestion spec, ingestion task log, and the output of
SELECT segment_id, is_available, is_overshadowed FROM sys.segments WHERE datasource = ‘my_datasource’
as well as the historical and middle manager logs?