Segment and 'num rows' doesn't match actual rows? (from unified-console)

Running: druid 0.14.0-incubating

Hi all, I am trying to understand/test updating/re-indexing a segment, and trying to ‘proove’ myself how I can identify what segments need updating.

I have loaded 6 hours of data, using 15 minute granularity…(using a kafka-indexer with 1 partition)

so I have segments like…:

  1. updatetest_6hours_jan_2019-01-01T00:00:00.000Z_2019-01-01T00:15:00.000Z_2019-05-08T11:58:12.052Z

  2. updatetest_6hours_jan_2019-01-01T00:15:00.000Z_2019-01-01T00:30:00.000Z_2019-05-08T11:58:54.826Z

  1. updatetest_6hours_jan_2019-01-01T05:45:00.000Z_2019-01-01T06:00:00.000Z_2019-05-08T12:15:14.157Z

using the UI, it shows the first segment:


with 846,603 Rows

I assume this will contain all data rows where:

2019-01-01T00:00:00.000Z >= __time < 00:15:00.000Z


so I would assume running the query:

select count(*) from updatetest_6hours_jan where __time > ‘2019-01-01 00:00:00’ and __time < ‘2019-01-01 00:15:00’;

would yield the same number…but it doesn’t…I get



so…thinking maybe the query should be?

select count(*) from updatetest_6hours_jan where __time > ‘2019-01-01 00:00:00’ and __time <= ‘2019-01-01 00:15:00’;

I get:



I’m guessing …I’m missing something?

or…is the ‘num rows’ per segement displayed an estimate and not the actual row count?



I believe the default behavior for count is using approximate.

Rommel Garcia

From the docs,
Whether to use an approximate cardinalty algorithm for COUNT(DISTINCT foo).
trueFor count distinct, the default is to use an approximate cardinality algorithm. I didn’t find a reference for just count without distinct.