Data ingestion queries

Hi ,

I am a newbie to druid.

I have a few basic questions on the batch ingestion .

1.Why do we have to specify an interval in the index task spec. Doesnt druid detect the timestamp from the source itself?

in the http://druid.io/docs/0.8.2/tutorials/tutorial-loading-batch-data.html example

"granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "DAY",
        "queryGranularity" : "NONE",
 **        "intervals" : [ "2013-08-31/2013-09-01" ]**      }

2.What happens if some data exceeding the interval is present in the source file. Does this get rejected

3.I tried ingesting two data files with same interval to same datasource. It seems that the second load overwrites this interval rather than insert/append. Is this the expected behaviour?

  1. I am looking at some use case of realtime ingestion where there maybe new metrics created on the fly in upstream systems. I was looking at tranquility

https://github.com/druid-io/tranquility/blob/master/core/src/test/scala/com/metamx/tranquility/example/ScalaExample.scala and it looks like you need new tranquility objects for each new data store. Do you have any simple examples of tranquility applications that listen to realtime data for multiple datasources?

Thanks and Regards

Manohar

Hey Manohar,

1/2) The ‘intervals’ in the task spec are basically a filter for indexing. If Druid encounters rows outside that time range when scanning your source data, then those rows will not be indexed.

  1. Yes, this is expected. Batch data loads in Druid are “replace-by-interval” by default, so when you load data for a particular interval, it replaces all existing data for that same interval. This is designed to make it easier to reload data after potentially altering it.

  2. You do need a new sender for each datasource. Tranquility itself includes two applications that can interact with multiple datasources: Tranquillity Server and Tranquility Kafka. You can look at the source of those apps- they’re in the “server” and “kafka” directories of the tranquility repo. You might even be able to just use them if they meet your needs as-is.

Hi Gian ,

Thanks for the reply

  1. I will take a look at these and see if I have more questions

  2. In our use case we have realtime data loaded in the system. As per some of the druid docs it seems an optimal segment size is 300mb-700mb.

How do we measure segment size is it the folder size of indexCache + localStorage . As per what I see from a quick test , our use case per daily segment (based on the folder size ) would be less than 200 mb.

However we have frequent upstream pipeline issues and the requirement to replay hourly batches is a strong requirement. Does this effectively mean that we need to store segments on an hourly basis.? Will this be a “too many small segments and large meta data” situation and have a bad performance impact . Any other options on handling replays?

Thanks and Regards

Manohar

Hey Manohar,

You have a couple options for replaying hours: either use DAY segments and replay the entire day of data at once or use HOUR segments and accept that they will be a bit small. In most cases the second option (HOUR segments) works best, as the pipeline is easier to set up and the effect of the smaller segments is usually not too bad. It could be helpful to test out both in your particular environment though.