Segment and 'num rows' doesn't match actual rows? (from unified-console)

Running: druid 0.14.0-incubating

Hi all, I am trying to understand/test updating/re-indexing a segment, and trying to ‘proove’ myself how I can identify what segments need updating.

I have loaded 6 hours of data, using 15 minute granularity…(using a kafka-indexer with 1 partition)

so I have segments like…:

  1. updatetest_6hours_jan_2019-01-01T00:00:00.000Z_2019-01-01T00:15:00.000Z_2019-05-08T11:58:12.052Z

  2. updatetest_6hours_jan_2019-01-01T00:15:00.000Z_2019-01-01T00:30:00.000Z_2019-05-08T11:58:54.826Z

  1. updatetest_6hours_jan_2019-01-01T05:45:00.000Z_2019-01-01T06:00:00.000Z_2019-05-08T12:15:14.157Z

using the UI, it shows the first segment:

updatetest_6hours_jan_2019-01-01T00:00:00.000Z_2019-01-01T00:15:00.000Z_2019-05-08T11:58:12.052Z

with 846,603 Rows

I assume this will contain all data rows where:

2019-01-01T00:00:00.000Z >= __time < 00:15:00.000Z

``

so I would assume running the query:

select count(*) from updatetest_6hours_jan where __time > ‘2019-01-01 00:00:00’ and __time < ‘2019-01-01 00:15:00’;

would yield the same number…but it doesn’t…I get

845,733

``

so…thinking maybe the query should be?

select count(*) from updatetest_6hours_jan where __time > ‘2019-01-01 00:00:00’ and __time <= ‘2019-01-01 00:15:00’;

I get:

846,600

``

I’m guessing …I’m missing something?

or…is the ‘num rows’ per segement displayed an estimate and not the actual row count?

Thanks

Dan

I believe the default behavior for count is using approximate.

Rommel Garcia

From the docs, http://druid.io/docs/latest/configuration/index.html#sql
druid.sql.planner.useApproximateCountDistinct
Whether to use an approximate cardinalty algorithm for COUNT(DISTINCT foo).
trueFor count distinct, the default is to use an approximate cardinality algorithm. I didn’t find a reference for just count without distinct.