Does segment cache use lru algo or sth else

I create a hot tie for history nodes which retain 14 days of data. Recently I found that these nodes always keep older data like 7 days ago in memory cache and don’t keep all newest data(maybe 1 hour ago). What I expect is to keep the newest data first. So I wonder if segment cache has mechanism that flush newest data into cache and drop some older data which history still server relative segments.

Welcome @Ze_yu_Huang! Do you have any retention rules in place? Regarding a segment cache mechanism, there are a number of properties which might be in play. Can you please tell us a bit about how you are currently storing segments?

Thks. I did use retention rule to keep 15 days data in hot historical nodes. So in my segment cache dir, there are data for 15 days. I used losf command to check these segment files and found that not all files were keep in memory, especially those most recent data. Actually every historical node is set with 16gb men, and 15 days data in one node is more than multiple hundreds gb. It’s normal that it can’t keep all segment in memory. But what I expect is that the most recent data is given higher priority for segment caching

I don’t believe that druid controls which of the segments are memory-mapped, this is controlled directly by the OS . Which segments are memory-mapped at any given time will depend on user behavior and the segments needed to service the sort of queries they have issued in the recent past.

1 Like

I agreed that it’s controlled by the os. But it seems like segment cache is not relevant to user behavior (query it served). Query cache does have the sth like LRU . So I wonder if segment cache has similar
algo to clean up older data in cache.

This section of the tuning guide describes how the segments are mapped to free system memory.

In principle, the OS memory mapping will do this automatically. When a query requests a segment that is not in memory, it will try to map it to available free memory, or if it is already full then swap out portions of other segments that were previously mapped. If your query workload is such that it changes which segments are needed frequently and free OS memory is limited, then it is likely that at any given time, the set of memory mapped segments will not match any particular expectation. If most queries are on a particular subset of segments, those will tend to be in memory, until they are bumped out by some other segments from alternate queries. So I guess the next questions are: what does your query workload look like? How much free memory do you have?

1 Like

Thanks for your reply. I want to show you my cluster’s config.
For historical,
{
“druid.server.http.numThreads”: “160”,
“druid_cache_sizeInBytes”: “3g”,
“druid_emitter_http_minHttpTimeoutMillis”: “50”,
“druid_historical_cache_populateCache”: “true”,
“druid_historical_cache_useCache”: “true”,
“druid_monitoring_monitors”: “["org.apache.druid.java.util.metrics.JvmMonitor","org.apache.druid.client.cache.CacheMonitor","org.apache.druid.server.metrics.QueryCountStatsMonitor","org.apache.druid.server.metrics.HistoricalMetricsMonitor"]”,
“druid_plaintextPort”: “18083”,
“druid_processing_buffer_sizeBytes”: “500MiB”,
“druid_processing_numMergeBuffers”: “4”,
“druid_processing_numThreads”: “13”,
“druid_processing_tmpDir”: “/data/druid/var/processing”,
“druid_query_groupBy_maxOnDiskStorage”: “10737418240”,
“druid_segmentCache_info”: “/data/druid/segment-info”,
“druid_segmentCache_locations”: “[{"path":"/data/druid/segment-cache","maxSize":"500g","freeSpacePercent":5.0}]”,
“druid_server_maxSize”: “500g”,
“druid_server_tier”: “latest”,
“druid_service”: “druid/historical”
}
For broker,
{
“druid_broker_balancer_type”: “connectionCount”,
“druid_broker_cache_populateCache”: “false”,
“druid_broker_cache_useCache”: “true”,
“druid_broker_http_maxQueuedBytes”: “31457280”,
“druid_broker_http_numConnections”: “50”,
“druid_monitoring_monitors”: “["org.apache.druid.java.util.metrics.JvmMonitor","org.apache.druid.client.cache.CacheMonitor","org.apache.druid.server.metrics.QueryCountStatsMonitor","org.apache.druid.server.metrics.HistoricalMetricsMonitor"]”,
“druid_plaintextPort”: “18082”,
“druid_processing_buffer_sizeBytes”: “536870912”,
“druid_processing_numMergeBuffers”: “6”,
“druid_processing_numThreads”: “2”,
“druid_processing_tmpDir”: “/data/druid/processing”,
“druid_query_scheduler_laning_maxLowPercent”: “10”,
“druid_query_scheduler_laning_strategy”: “hilo”,
“druid_query_scheduler_numThreads”: “40”,
“druid_query_scheduler_prioritization_adjustment”: “10”,
“druid_query_scheduler_prioritization_durationThreshold”: “P1W”,
“druid_query_scheduler_prioritization_strategy”: “threshold”,
“druid_request_logging_dir”: “/usr/local/druid/logs”,
“druid_request_logging_type”: “file”,
“druid_server_http_numThreads”: “60”,
“druid_service”: “druid/broker”
}

I deployed brokers and historical in k8s.
historical: limits:
cpu: ‘14’
memory: 120Gi
requests:
cpu: ‘14’
memory: 120Gi
brokers:
limits:
cpu: ‘4’
memory: 14Gi
requests:
cpu: ‘1’
memory: 8Gi

I saw all the data in his were mapped into memory cache. But there are still so many timeout query even I set the timeout to 40 secs. It’s really wierd.
I also checked the metric emitted but all normal. Can you give me some advice?

Can you give me some advice

Some of the error msg in historical logs below.
msg=Timeout waiting for task., code=Query timeout, class=java.util.concurrent.TimeoutException, host=null})

I am assuming you are using groupBy queries? If this is true, you could try increasing the following to check if it helps:

“druid_processing_numMergeBuffers”: “4” to say… “8”
“druid_processing_buffer_sizeBytes”: “500MiB” to say… “1GiB”

  • Please ensure that you are adding a filter on ‘__time’ at the very least
  • Please ensure that the segments you are creating are around 500MB and have about 5M rows
  • I’d suggest using hashed or range partition if available
  • If you have metrics enabled, please check the query/segment/time and make sure that there are no queries with large query/segment/time reaching into minutes

Even though those segments are memory mapped, they still need to be processed unless the druid cache is being hit.

Actually, timeout queries covered different types of query(topN, time series, timeboundary).
For druid_processing_numMergeBuffers and druid_processing_buffer_sizeBytes, I follow the server tuning guideline to set the values. We did use __time for filtering. And segments size is about 3M rows.
Hashed or range partition would be better for query performance?
As I said , all the metrics seems normal.query/segment/time is below 300ms .
I have no idea about the timeout problem now lol.

A few questions:

Is the problem exacerbated by load? Do these queries that are timing out, also time out when run in isolation? How many data and query nodes do you have? what is the volume of data you are querying? Can you paste a typical query after sanitizing it?

Also, the server tuning guidelines are meant for an average typical load. If your workload is skewed, you may have to tweak these settings.

Range or hashed will give you better performance than dynamic partitioning.

HI Ze_yu_Huang, I was looking through your configs once again and see that you are using druid.query.groupBy.maxOnDiskStorage which is an uncommon parameter and unless you have a very good reason for using it, I’d suggest disabling it (set it to 0) and retrying.

I have 3 query nodes and 5 data nodes with ‘latest’ tier. There are two big datasource which hold 30gb daily data.
eg query:
2022-08-19T04:47:51.877Z 10.73.4.167 {“queryType”:“topN”,“dataSource”:{“type”:“table”,“name”:“adx-request-metrics”},“virtualColumns”:,“dimension”:{“type”:“default”,“dimension”:“ab_test”,“outputName”:“ab_test-d08”,“outputType”:“STRING”},“metric”:{“type”:“LegacyTopNMetricSpec”,“metric”:“sum_buyer_net_revenue”},“threshold”:100,“intervals”:{“type”:“LegacySegmentSpec”,“intervals”:[“2022-08-19T04:00:00.000Z/2022-08-19T05:00:00.000Z”]},“filter”:{“type”:“and”,“fields”:[{“type”:“in”,“dimension”:“ab_test”,“values”:[null,“10.73.13.42”,“10.72.14.145”]},{“type”:“selector”,“dimension”:“server_region”,“value”:“use”,“extractionFn”:null}]},“granularity”:{“type”:“all”},“aggregations”:[{“type”:“doubleSum”,“name”:“sum_request”,“fieldName”:“request”,“expression”:null},{“type”:“doubleSum”,“name”:“sum_response”,“fieldName”:“response”,“expression”:null},{“type”:“doubleSum”,“name”:“sum_impression”,“fieldName”:“impression”,“expression”:null},{“type”:“doubleSum”,“name”:“!T_0”,“fieldName”:“buyer_net_revenue”,“expression”:null},{“type”:“doubleSum”,“name”:“!T_1”,“fieldName”:“seller_out_request”,“expression”:null},{“type”:“doubleSum”,“name”:“!T_2”,“fieldName”:“total_seller_payment_net_revenue”,“expression”:null},{“type”:“doubleSum”,“name”:“!T_3”,“fieldName”:“block”,“expression”:null},{“type”:“doubleSum”,“name”:“!T_4”,“fieldName”:“total_bid_cost_time”,“expression”:null}],“postAggregations”:[{“type”:“expression”,“name”:“sum_buyer_net_revenue”,“expression”:“(cast("!T_0",‘DOUBLE’)/1000)”,“ordering”:null},{“type”:“expression”,“name”:“sumresp-5c2”,“expression”:“if("!T_1"!=0,(cast("sum_response",‘DOUBLE’)/"!T_1"),null)”,“ordering”:null},{“type”:“expression”,“name”:“sumwins-07a”,“expression”:“if("sum_response"!=0,(cast("sum_impression",‘DOUBLE’)/"sum_response"),null)”,“ordering”:null},{“type”:“expression”,“name”:“sumcolu-2f4”,“expression”:“if("sum_impression"!=0,(cast("!T_0",‘DOUBLE’)/"sum_impression"),null)”,“ordering”:null},{“type”:“expression”,“name”:“sumbuye-423”,“expression”:“if("!T_1"!=0,(cast(("!T_0"*1000),‘DOUBLE’)/"!T_1"),null)”,“ordering”:null},{“type”:“expression”,“name”:“sumbuye-3eb”,“expression”:“if("!T_0"!=0,(cast(("!T_0"-"!T_2"),‘DOUBLE’)/"!T_0"),null)”,“ordering”:null},{“type”:“expression”,“name”:“sumbloc-d4e”,“expression”:“if(("!T_3"+"sum_request")!=0,(cast("!T_3",‘DOUBLE’)/("!T_3"+"sum_request")),null)”,“ordering”:null},{“type”:“expression”,“name”:“sumbloc-574”,“expression”:“("!T_3"+"sum_request")”,“ordering”:null},{“type”:“expression”,“name”:“sumtota-0b0”,“expression”:“(cast(if("sum_response"!=0,(cast("!T_4",‘DOUBLE’)/"sum_response"),null),‘DOUBLE’)/1000)”,“ordering”:null}],“context”:{“implyDataCube":“adx-xyz-cpm-metric-t1”,“implyFeature”:“visualization”,“implyUser”:"yanjing@do-global.com,“implyUserEmail”:"yanjing@do-global.com",“implyView”:“f955”,“implyViewTitle”:“AB_Ching,Yim”,“isBi”:true,“minTopNThreshold”:20000,“priority”:1,“queryId”:“e27c8ce7-f660-4c19-a83c-8af601140244”,“timeout”:40000},“descending”:false} {“query/time”:568,“query/bytes”:1067,“success”:true,“identity”:“allowAll”}

I haven’t run those timeout query in isolation. If these queries run normally in isolation, what could be the problem when they run in together. Thanks!

Processing at the historical is constrained by “druid_processing_numThreads”: “13” so each historical can only process 13 segments at a given time. As the number of queries increases, the prioritized processing queue in the historical gets larger (filling up with segments to be processed) as queries arrive and segments wait for processing numThreads to become free so they can be processed. So increasing concurrency will cause queries to slow down after a point.

I will look at your query in the morning to suggest optimizations if there are any (It’s midnight here)

Thanks!

I followed the tuning instructions to set the “druid_processing_numThreads” as 13 cuz each historical node only has 14 cores. I did find that there are a lot of small segments which contain recent data(maybe several hours earlier). Small segments may cause the overhead of “druid_processing_numThreads”, right? If it’s true, do you have any advice about “druid_processing_numThreads” or I need to compact segments frequently.
Have a good night!

Compaction of small segments into larger ones is a best practice, so yes, it should help. Also consider rollup if that is functional for the queries you need to service. Another great doc is this one:

Another consideration and common practice is to ingest data into both a granular datasource and one or more rolled up ones for common aggregations queries. This helps reduce the CPU utilization for some common workloads as you are doing aggregation ahead of time instead of doing it in every query. The reduction in resource consumption goes a long way in dealing with concurrency. Also, I don’t know if this is your case, but consider that queries that require aggregation on high cardinality dimensions can cause bottlenecks in the broker.

Setting numThreads to 13 is absolutely the right thing to do if your pod has a request of 14 cores and you have followed the best practice (n - 1) so no changes required. What I was inferring is that there is only a finite number of processing threads available in each historical. You can either scale the data nodes horizontally if it is an issue of too much concurrency OR try and squeeze more out of the system by tweaking parameters tuned to your use case.

If you do find some segments that are taking too long to process, then tuning of queries is needed. The query example you gave was successful and looks ok.

How do your segment wait times look? Are they high or seem within range? I am trying to understand if this is an issue of concurrency or heavy queries.

I have scaled my data nodes a few days ago. But the cpu and memory usage of node stay in a low level before and after my adjustment, especially the cpu usage. My segment wait times look fine. And I found ‘msg=Timeout waiting for task., code=Query timeout, class=java.util.concurrent.TimeoutException, host=null})’
It’s frustrating that my horizontal scaling doesn’t seem to decrease the num of timeout query.