Large segments can't load into HDFS

I have one datasource,which segment has about 2G per hour and 200 millions rows per hour. The question now is large datasource can‘t load into HDFS,and can’t find information about the datasource in HDFS , coordinator and historical node. It is worse that always “Query timeout” in Pivot.

In same realtime node ,I have another datasource relatively small amount,about 200M per hour and 20 millions rows per hour. It work well.

What can i config?

Thanks.

I have one datasource,which segment has about 2G per hour and 200 millions rows per hour.

is the segment size 2G and the actual segment contains 200 millions row ?

The question now is large datasource can‘t load into HDFS

Theoretically the limit is your HDFS file size limit, which is pretty big (http://stackoverflow.com/questions/5493873/hadoop-hdfs-maximum-file-size)

,and can’t find information about the datasource in HDFS , coordinator and historical node. It is worse that always “Query timeout” in Pivot.

I am not sure but If your druid segment contains about 200 Millions row it doesn’t surprise me that it times out. We recommend to have about 5M rows per segment. Probably you have use sharding http://druid.io/docs/latest/ingestion/overview.html#sharding

In same realtime node ,I have another datasource relatively small amount,about 200M per hour and 20 millions rows per hour. It work well.

is it the same ingestion spec ?

What can i config?

Probably the first issue is how many row you have per one segment.

Hi,Slim. Thanks for helping.

  1. Right,the segment is size 2G and the actual segment contains 200 millions row.

  2. My point is sharding the segments may be solve the problem. 200 millions row should be sharded into 40 parts. Can i work on one realtime node or few nodes for this purpose? Because the machine i have not much. And my data consume from Kafka.

  3. I am not clear by document about the configuration of shardSpec. My attempt was not successful, is it related to just one realtime node?

4.Yes , these datasources are confined in the same ingestion spec.

Thanks.

在 2016年2月23日星期二 UTC+8上午3:22:31,Slim Bouguerra写道:

Hi,
Can you share your ingestion spec files it will be easier to figure out what is wrong ?

Hi Slim,Here is my spec file.
Sorry to bother you.

在 2016年2月23日星期二 UTC+8下午11:02:11,Slim Bouguerra写道:

realtime.spec (3.4 KB)

First i notice that you set the queryGranularity to none ““queryGranularity” : "NONE”” , Please note that this will generate segment without aggregation at all based on time bucket. For example if your events are at a milli second time granularity they will be ingested that way which makes segment become beefy and huge. Depending on your use case i would recommend to change it to minute level maybe or even higher.

For sharding spec please use linear unless you really want to ensure that all the segment are there before querying it

For instance you can replace whit this and keep changing partitionNum for every node.

“shardSpec”: {
“type”: “linear”,
“partitionNum”: 0
}

please let me know if you need more help !

Hi
I follow what you said ,set queryGranularity=‘minute’ and shardSpec type is linear.

Now I try on two realtime nodes.

first spec contains

“shardSpec”: {
“type”: “linear”,
“partitionNum”: 0
}
and second spec contains
“shardSpec”: {
“type”: “linear”,
“partitionNum”: 1
}The present situation is the second realtime node can’t ingest data, I also can’t find data file in basePersistDirectory.

and i feel realtime node consume data a lot slower and data much less.

I think there must be my configuration wrong.

Here are two realtime nodes log,it is no exception in first node , but an ERROR in second one :

2016-02-24T23:32:43,085 ERROR [chief-api-0224[1]] io.druid.segment.realtime.RealtimeManager - Exception aborted realtime processing[api-0224]: {class=io.druid.segment.realtime.RealtimeManager, exceptionType=class java.lang.NoClassDefFoundError, exceptionMessage=scala/Function0}

java.lang.NoClassDefFoundError: scala/Function0

at io.druid.firehose.kafka.KafkaEightFirehoseFactory.connect(KafkaEightFirehoseFactory.java:84) ~[?:?]

at io.druid.firehose.kafka.KafkaEightFirehoseFactory.connect(KafkaEightFirehoseFactory.java:45) ~[?:?]

at io.druid.segment.realtime.FireDepartment.connect(FireDepartment.java:97) ~[druid-server-0.8.1-iap2.jar:0.8.1-iap2]

at io.druid.segment.realtime.RealtimeManager$FireChief.initFirehose(RealtimeManager.java:203) ~[druid-server-0.8.1-iap2.jar:0.8.1-iap2]

at io.druid.segment.realtime.RealtimeManager$FireChief.run(RealtimeManager.java:247) [druid-server-0.8.1-iap2.jar:0.8.1-iap2]

Caused by: java.lang.ClassNotFoundException: scala.Function0

at java.net.URLClassLoader$1.run(URLClassLoader.java:366) ~[?:1.7.0_79]

at java.net.URLClassLoader$1.run(URLClassLoader.java:355) ~[?:1.7.0_79]

at java.security.AccessController.doPrivileged(Native Method) ~[?:1.7.0_79]

at java.net.URLClassLoader.findClass(URLClassLoader.java:354) ~[?:1.7.0_79]

at java.lang.ClassLoader.loadClass(ClassLoader.java:425) ~[?:1.7.0_79]

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) ~[?:1.7.0_79]

at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ~[?:1.7.0_79]

… 5 more

Exception in thread “chief-api-0224[1]” java.lang.NoClassDefFoundError: scala/Function0

at io.druid.firehose.kafka.KafkaEightFirehoseFactory.connect(KafkaEightFirehoseFactory.java:84)

at io.druid.firehose.kafka.KafkaEightFirehoseFactory.connect(KafkaEightFirehoseFactory.java:45)

at io.druid.segment.realtime.FireDepartment.connect(FireDepartment.java:97)

at io.druid.segment.realtime.RealtimeManager$FireChief.initFirehose(RealtimeManager.java:203)

at io.druid.segment.realtime.RealtimeManager$FireChief.run(RealtimeManager.java:247)

Caused by: java.lang.ClassNotFoundException: scala.Function0

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:425)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

… 5 more

Next is full log file . look like error about scala?

在 2016年2月24日星期三 UTC+8下午9:39:27,Slim Bouguerra写道:

node1.log (254 KB)

node2.log (46.1 KB)

i think your second node is missing some kafka jars