segments in hdfs, but not all read from metadata?

Hi,

I have 7 days imported in HDFS, with 10-20GB zipped segments.

From them, only 3 days I see with proper size like 40GB, in coordinator UI.

Also, I see Polled and found 1,568 segments in the database, but in MySQL there are 3000 rows in druid_segments.

Also I noticed that reprocessing a day took it from the lowest hdfs usage to the highest.

What can be the explanation? The map reduces finalized in both cases; I can only speculate that the metadata was not fully written. But in the logs I could not yet find an exception for that.

Please advise,

Nicu

Hey Nicu,

“Polled and found X segments in the database” indicates the number of segments that are able to be served - in other words, segments in data sources that are not disabled and have not been filtered out through load and drop rules. If you look at the druid_segments table I would expect 1,568 rows to have ‘used’ set to 1 and the rest to be 0.

Do you have any rules configured or are you using only the default rule (load forever) ?

Hi,

Indeed, the 1568 are used, the rest not.

I only use default rule, load forever, did not modify it.

One perhaps related question: I felt that the previous versions of data segments still linger in hdfs: every import adds a new hdfs subdirectory, but unless it leaves some usefull info in the old ones and the new partitions or maybe deltas in the current one, looks like the old data is not cleaned up

hadoop fs -du /druid/impression/impression/20151104T000000.000Z_20151105T000000.000Z

6782372725 /druid/impression/impression/20151104T000000.000Z_20151105T000000.000Z/2015-11-26T10_13_44.289Z

32839003 /druid/impression/impression/20151104T000000.000Z_20151105T000000.000Z/2015-11-29T09_03_10.568Z

18355477676 /druid/impression/impression/20151104T000000.000Z_20151105T000000.000Z/2015-12-01T14_15_08.748Z

This remained so even if restarting nodes.

Hey Nicu,

Druid does not automatically remove segments that have been made obsolete by a new version from deep storage. If you do want to clean up unused segments, you can issue a kill task: http://druid.io/docs/latest/misc/tasks.html