I want to preface by question with this quote from design/segments page
In a basic setup, one segment file is created for each time interval, where the time interval is configurable in the
segmentGranularity parameter of the
granularitySpec, which is documented here. For druid to operate well under heavy query load, it is important for the segment file size to be within the recommended range of 300mb-700mb
So basically, I want the intervals (as I see on my coordinator console) to be between 300-700 mb…
In my datasource, the intervals (segment files) are 1.5 GB and I want to re-balance by cluster with a lower segment granularity.
My question is - whats the most efficient way to do this without having to re-ingest the data (pr resubmitting a task)? Can I run a job to create a datasource from an already existing datasource?
As a follow up to that, will it leverage all the worker nodes or just run the whole job on one thread?
Thanks… happy to provide more details to my kinda vague question.
I did try the below spec and it works but in a single thread… I cant use index_parallel to spread this task among my 20 worker peons.
“type” : “jsonLowercase”,
"type" : "ingestSegment",
"dataSource" : "data_monthly",
"interval" : "2001-02-01/2019-03-31"
The ““type”: “index”” basically forces it to run on single worker.
Can you please elaborate on “In my datasource, the intervals (segment files) are 1.5 GB and I want to re-balance by cluster with a lower segment granularity.” ?
Do you have a 1.5GB segments or the total size of an interval is 1.5GB (one interval can have multiple segments).
It should also be noted that 1.5GB is not that bad for a segment, the warning from the docs is really trying to warn you about having too many segments that are too small which can happen due to certain ingestions methods.
The size of an interval is 1.5 GB. Like I quoted from the documentation, there are two things they talk about.
Segment file —> The interval (as specified by the segment granularity parameter).
Segment —> Each partition in an interval.
And the above is coming from the documentation. Let me quote it again.
Druid stores its index in segment files, which are partitioned by time. In a basic setup, one segment file is created for each time interval, where the time interval is configurable in the
segmentGranularity parameter of the
granularitySpec, which is documented here. For druid to operate well under heavy query load, it is important for the segment file size to be within the recommended range of 300mb-700mb. If your segment files are larger than this range, then consider either changing the granularity of the time interval or partitioning your data
The above section talks about the “segment file” which makes me to believe the interval size is supposed to be in the range of 300-700 MB.
And I am aware that having 700 segments of 1MB each is worse than having one segment of 700MB. By problem statement is if the segment file sizee (interval size) is going beyond the recommended value (main reason being, the data I ingest explodes), I need to lower the segment granularity to keep the size of intervals in the recommended range… My question was if there was a way to do it.
The segment interval size (which is the sum of the sizes of all the segments for that interval) does not matter only the segment file needs to be kept down.
If you are running Druid 0.13.x or better (maybe 0.12.x will work also?) run this handy DruidSQL query:
where “size” > 700000000
it will look for all segments that are over 700MB. Druid is pretty good about not creating segments that are too big. If it sees that there will be too much data for an interval then it will split up the data into multiple segments.
Have a look at: http://druid.io/docs/latest/ingestion/native_tasks.html#tuningconfig
targetPartitionSize which defaults to 5 million will make sure non of your segments have too much data in it.
I am not sure what “basic setup” the docs are referring to. Going to raise an issue to clarify.
Thanks Vadim. I will check those things out.