Indexing is succesful, but segments doesn't show up

Hello,

I recently started exploring druid for one our requirement. I have setup a 3 node druid cluster. I have tried loading a dataset with 35 dimensions & 10 metrics. I am using Indexing service with local firehose . I am triggering one task to load the data for each month.

I could load data for couple of months without any issues. However couple of indexing tasks shows status as SUCCESS, However the related segments doesn’t show up in indexer/co-ordinator console. I could see segment data stored in deep storage. I have checked metadata db and I do see the segment entries even in there. I dig the source code to understand how the segment info is retrieved and displayed on the console. Looks like the code use the ZooKeeper to identify the segments. I see there are only 2 entries out of 4 in there. what can cause the segment info not be added to Zookeeper?

I know Hadoop is recommended for batch indexing for better performance. For various reasons I can’t use Hadoop. And current indexing (with out hadoop) is performing very poor. The task takes 4 days to index 250 million records and still running. How can I improve the performance. I have attached the druid config used for your reference.

Hardware config :

Node1 :

x86_64 x86_64 x86_64 GNU/Linux

cpu cores : 24

Memory : 125 gb

Node 2:

cpu cores : 12

memory : 47 g

Node 3 :

cpu cores: 12

memory : 47 g

Please share your thoughts. Appreciate your help.

Thanks

Nalini

common.props (2.17 KB)

coordinator_runtime.properties (1.05 KB)

historical.props (1.51 KB)

middlemanager.props (1.21 KB)

overlord.props (1.64 KB)

Hey Nalini,

Do you have historical nodes running? If so, they should show up in the coordinator console (http://COORDINATOR_HOST:PORT/) and also in the coordinator logs. They are the ones that load and serve segments.

The inefficiency of the non-hadoop index task is mostly due to the fact that it does a lot of needless passes over the data when it’s writing out more than one segment. You can make this less-bad by having fewer segments written out per task. So, if you currently have 250 million records per month with a targetPartitionSize of 5 million, that’s 50 segments per month- you’d probably do better to partition your data by day and run one task for each day. The hadoop index task does this partitioning automatically using Map/Reduce jobs.

Hi Gian,

Thanks for the quick response. I have 3 historical nodes running in 3 different nodes. I have been using the coordinator console mentioned above. Infact I have posted this issue only after I find some segments are not showing up in there. However I could see some.

Please let me know if you need any logs . Meanwhile as you suggested I will paritionize the dataset and try to load.

Nalini

Gian,

Some more findings. I have noticed the segments are being created in deep storage (descriptor.json, index.zip). I have checked the co-ordinator logs , the segments which I claim to be missing are not being announced and not being assigned between the nodes. What do you think ( may be a setting) can cause the segments not be announced ?

I see below two errors (only two of them in the entire log file). Do you think the below errors can impact the segments from not being announced .

2015-10-29 23:10:56,115 ERROR c.m.c.l.Lifecycle$AnnotationBasedHandler [Thread-22] Exception when stopping method[public void io.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[io.druid.curator.discovery.ServerDiscoverySelector@1192b58e]

2015-11-02 08:22:36,757 ERROR c.m.c.l.Lifecycle$AnnotationBasedHandler [Thread-22] Exception when stopping method[public void io.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[io.druid.curator.discovery.ServerDiscoverySelector@1192b58e]

Hey Nalini,

Is it possible that you have some unintended load rules in your coordinator configuration preventing certain segments from being loaded? If the segment exists in deep storage and there is valid metadata referencing these segments, the coordinator should load them unless prevented from doing so by a load/drop rule. See:

http://druid.io/docs/latest/operations/rule-configuration.html

Hi David,

I don’t see anything fancy in the rules. I have been using the default rules that default tier ships with. The problem has been resolved when I restart the Zookeeper and followed by Druid nodes. I believe the overlord daemon (I believe overlord in this case) has not attempted to announce the segment by making an entry in the zookeeper despite successful segment creation. When I did the clean restart, as you mentioned the segments are made available as the metadata in DB is perfect.

However I would like to thank everyone for sharing the valuable information.

Hi,

I also have a similar issue, except I do not see data in the deep storage - only empty files for segments. However, besides being SUCCESSFUL, the indexer reports ingesting how many rows it ingested and how many it ignored for every hour. Summing up the counts for the 4 ingested hours, I get the 3.5M events I expected, but no data was actually stored!

Please advise,
Nicu

Hi,

I also have a similar issue, except I do not see data in the deep storage - only empty files for segments. However, besides being SUCCESSFUL, the indexer reports ingesting how many rows it ingested and how many it ignored for every hour. Summing up the counts for the 4 ingested hours, I get the 3.5M events I expected, but no data was actually stored!

Please advise,
Nicu

Hi,

I have the same issue: I see the segments both in mysql and in hdfs; there is data in the index.zip , exactly what it should be 800MB.

The segments are not available: timeseries query in historical node gives empty result, and the coordinator browser console shows 0B consumed from 10GB.

Pls advise! How did you solve this?

Thanks,

Nicu

Fixed, in fact there was a classpath issue on the historical process.

Hi Gian,
Don’t we need historical nodes only while querying for loading segments?
While indexing is there any role of historical nodes?

Regards
Sidharth Singla

Hi Sidharth,

The role of historicals is serving segments and executing queries for locally stored segments. They are not involved in the indexing tasks (http://druid.io/docs/0.10.0/design/indexing-service.html). Once an indexing task completes and the result segments are stored in deep storage, historicals pull those segments into their local storage.

Thanks,

Jihoon

2017년 7월 9일 (일) 오후 10:43, Sidharth Singla sidpkl.singla@gmail.com님이 작성:

Thanks Jihoon for clearing it out.

-Sidharth