How to merge multiple set of metrics+dimensions from HDFS to Druid

I have a processing platform that generates ‘N’ categories of metrics each in its own location in HDFS. I would like to ingest each of these categories into Druid. I see one way of doing this

Run ingestion query for each category into its own datasource in druid. Hence category1 available in XYZ/category1/YYYY/MM/DD goes into category1 (datasource) into druid , category2 available in XYZ/category2/YYYY/MM/DD goes into category2 (datasource) into druid .

As each category is residing in its own datasource i need to run queries against each datasource.

If there is a change in schema for any of category i just ingest with modified schema and then i can query with changed schema.

If historical data changes (in HDFS) for some reason, i again run re-ingestion for those time periods and druid will simply create and overwrite new druid segments for those time periods.

Do you think there is better way of doing it ?

And is my assumptions correct ?



I am installing Druid in a secured network that requires specific ports to be opened up to outside world.

What is the port number when Druid is started for first time , on what port does it downloads various dependency?


Any suggestions here ?

If you look at Druid 0.9.0, it should come packaged with all dependencies. Other than that, look at the “druid.port” config for the port Druid uses for HTTP communications.

You can setup your self which ports druid is using to serve.

For dependency pulling i guess it is HTTP so 80 thought.