Number of dimensions & granularity affecting query performance

Hi,

Does the query performance gets affected as the number of dimensions increases and granularity decreases(becomes finer)?

Thanks,

Prathamesh

Query performance does not have a direct implication with number of fields in data source. Query performance depends on following

  • of fields in Query

  • Data selection criteria
  • of expected records in resultset

  • Data to be queried e.g. Last 1 day, Last 1 week, Last Month, Last Qtr.
    Most importantly understanding the cardinality of the field(s) in WHERE clause. The higher the cardinality (more unique values e.g. primary keys in RDBMS have highest cardinality of 1) the faster query response time.

Other areas of investigation are cluster size, # of historicals, broker cache etc.

Hello Gaurav,

I am facing performance issue on production with few MB of data in my druid cluster. I did try tuning some of the parameters related to CPU usage and Memory but i also wanted to look at the problem from other perspective.

What i was wondering was, whether having too many dimension and a fine granularity (15 mins in my case) could make the queries slower. As in, due to the event/record having multiple dimensions and the need to store the data in 15 mins buckets, does this lead to too many segments which could take time to consolidate at the broker/historical node when a query is made for a week’s data? or a month’s data? (we use 15 mins granularity since we would want to be able to query at 15 mins and higher.i.e. day, week, month etc.)

Thanks,

Prathamesh

Prathmesh,

Don’t break down into smaller segments if your segment size is between 300-700 mb to start with. Are you querying last 15 minutes data or multiples of last 15 minutes of data? If not then leave segment granularity to days. Check # of segments in historicals.

Default granularity of segment is 1 day. Your segment size should be between 300-700 mb. If segment size is too large then consider changing granularity to smaller than 1 day. Remember, Druid is designed to take advantage of memory and it tries not to perform disk seeks so there’s a broker cache, historical cache option available. If your segment size is too small then consider adding more records per segment. A good starting point is 5 million rows per segment. More details are found here https://druid.apache.org/docs/latest/design/segments.html

Smaller segments is what i am worried about. If i check the coordinator console each segment shows up to be just few KBs. I did read about the idea size of 300-700 mb but i am not sure how i’ll achieve that with my use case.

I have around 15 dimensions in my datasource. Some high cardinality values(such as couple of UUIDs) and some low cardinality values such as browserName, deviceType, osName etc. I have specified segment granularity to be 15 mins. My understanding (which could be incorrect) is that this combination of dimensions and granularity leads to the smaller segment size (as druid needs to index the data to allow slicing and dicing based on various dimensions as well as slice and dice based on time component i.e. query granularity. If i specify granularity to be a day i won’t be able to query data for 15 minute interval as an when required). I believe 15 mins granularity is not that uncommon and if segment size could be less than the 300-500mbs. Are you saying that is not possible or segments with segment size of few KBs would lead to serious performance issues?

Thanks,

Prathamesh

Hi Prathamesh,

Segment level granularity does not restrict you from querying data below granularity. For example wikipedia data-source may be at granularity of P1D but you can still query between few hours. e.g.

SELECT FLOOR(__time to HOUR) AS HourTime, SUM(deleted) AS LinesDeleted
FROM wikipedia
WHERE “__time” BETWEEN TIMESTAMP ‘2016-06-26 00:00:00’ AND TIMESTAMP ‘2016-06-27 12:00:00’ GROUP BY FLOOR(__time to HOUR);

You should try changing segment granularity of 1 Day and rebalance your segments. Historical process has to read more segments per query if there’s more number of segments per query spec. 1 Day is a good start. Hope this helps.