Historical server is very slow during segments loading

Hi guys!

My
historical server is very slow during segments loading from HDFS. The hdfs namenode and datanode are installed with historical server (same host), so bandwith or latency shouldn’t be a problem in this case.

  • Host has 8 core and 14GB RAM

  • The segment size is ~30MB

  • The number of segments is ~10.000 with a total size of ~60GB

Some interesting config parameters:

Historical conf:

druid.segmentCache.deleteOnRemove=true

druid.segmentCache.numBootstrapThreads=7

druid.server.http.numThreads=25

druid.processing.numThreads=5

druid.processing.numMergeBuffers=1

druid.historical.cache.useCache=true

druid.historical.cache.populateCache=true

druid.historical.cache.unCacheable=

druid.cache.type=local

I/O Performance during segments loading:

With the iotop and iostat commands, the disk writes are about 1,5MB/s (for druid user) while if I check writes in MB/s with dd command it is about 55MB/s (dd if=/dev/zero of=/opt/zeroes bs=512 count=200), so disk performarce might not be a bottleneck.

Do you know any way to improve that behaviour?
thanks in advance!!

No suggestion?

1 Your segment size is very small, usually a good size is around 700MB, you can increase the segment granularity or target partition size.

2 This PR [Skip OS cache on Linux when pulling segments (#5421)] might help view it on GitHub

Thanks a lot! I am re-indexing segments into a new dataSource but I have a question…

Before, the hourly segment was about 200kB (only 1 shard) after re-indexed them the dayly segment is about 60kB. The main idea was that the segments will be more weight but it isn’t so, why?

I am using hadoop-indexing task:

{
“type”: “index_hadoop”,
“spec”: {
“dataSchema”: {
“dataSource”: “reicmp2”,
“parser”: {
“type”: “hadoopyString”,
“parseSpec”: {
“format”: “json”,
“timestampSpec”: {
“column”: “timestamp”,
“format”: “auto”
},
“dimensionsSpec”: {
“spatialDimensions”: ,
“dimensions”: [“ping_lost”,“endpoint”,“min”,“max”,“company”,“probe”,“tool”,“site”,“ping_total”,“mdev”,“time”,“probe_ip”,“avg”,“ipv4”,“percent_response”]
}
}
},
“metricsSpec”: [
{
“type”: “count”,
“name”: “event_count”
},
{
“type”: “longSum”,
“fieldName”: “response_percent”,
“name”: “response_percent_sum”
},
{
“type”: “longMin”,
“fieldName”: “response_percent”,
“name”: “response_percent_min”
},
{
“type”: “longMax”,
“fieldName”: “response_percent”,
“name”: “response_percent_max”
},
{
“type”: “doubleSum”,
“fieldName”: “media_deviation”,
“name”: “media_deviation_sum”
},
{
“type”: “doubleMin”,
“fieldName”: “media_deviation”,
“name”: “media_deviation_min”
},
{
“type”: “doubleMax”,
“fieldName”: “media_deviation”,
“name”: “media_deviation_max”
},
{
“type”: “doubleSum”,
“fieldName”: “minimum_rtt”,
“name”: “minimum_rtt_sum”
},
{
“type”: “doubleMin”,
“fieldName”: “minimum_rtt”,
“name”: “minimum_rtt_min”
},
{
“type”: “doubleMax”,
“fieldName”: “minimum_rtt”,
“name”: “minimum_rtt_max”
},
{
“type”: “doubleSum”,
“fieldName”: “average_rtt”,
“name”: “average_rtt_sum”
},
{
“type”: “doubleMin”,
“fieldName”: “average_rtt”,
“name”: “average_rtt_min”
},
{
“type”: “doubleMax”,
“fieldName”: “average_rtt”,
“name”: “average_rtt_max”
},
{
“type”: “doubleSum”,
“fieldName”: “maximum_rtt”,
“name”: “maximum_rtt_sum”
},
{
“type”: “doubleMin”,
“fieldName”: “maximum_rtt”,
“name”: “maximum_rtt_min”
},
{
“type”: “doubleMax”,
“fieldName”: “maximum_rtt”,
“name”: “maximum_rtt_max”
}
],
“granularitySpec”: {
“type”: “uniform”,
“segmentGranularity”: “DAY”,
“queryGranularity”: “MINUTE”,
“intervals” : [ “2018-01-07T00:00:00Z/P1W” ]
}
},
“ioConfig”: {
“type” : “hadoop”,
“inputSpec” : {
“type” : “dataSource”,
“ingestionSpec”: {
“dataSource”: “spark-ef-icmp”,
“intervals”: [“2018-01-07T00:00:00Z/P1W”]
}
}
},
“tuningConfig”: {
“type”: “hadoop”,
“leaveIntermediate”: true,
“ignoreInvalidRows”: false,
“numBackgroundPersistThreads”: 1
}
},
“hadoopDependencyCoordinates”: [“org.apache.hadoop:hadoop-client:2.7.3”]
}

Thanks in advance!!

Before, the hourly segment was about 200kB (only 1 shard) after re-indexed them the dayly segment is about 60kB. The main idea was that the segments will be more weight but it isn’t so, why?

Does your data look correct after the re-ingestion? Maybe some of the original data wasn’t ingested.

Also, did the original datasource also have “queryGranularity”: “MINUTE”?

Jonathan I think that the size has changed because the rollup in re-indexing is more “agressive”. Now, the queryGranularity es MINUTE but before, it was NONE. The columns have been compressed. When I queried the data, they are OK so all seems good

Thanks so much!!