Clarification on Data Sources

Hello All,

I’m new to druid and the stream push mode for creating aggregates seem very promising for us.

I was going over the documentation on druid.io, and it suggested to use Tranquility for pushing data.

There was a predefined data source metrics while starting the tranquility server which came as an example. After running the sample generation of events for metrics, I am able to query via http.

I wanted to try out my own data source, D1. So I defined it in the tranquility server config as well as while creating the stream push client via Java

DruidBeams.fromConfig(myDsConfig).buildTranquilizer(myDsConfig.tranquilizerBuilder());

I am able to push events to druid and query. Couple of issues

  1. When I restart tranquility server, previously ingested data is lost, whereas metrics data stays. What am I doing wrong

  2. I am able to see Metrics data source in

http://127.0.0.1:8081/#/datasources

but not D1 in the data sources list.

I am able to see both Metrics and D1 in the cluster view as a cluster segment

http://127.0.0.1:8081/old-console/cluster.html

but it doesnt show up in

http://127.0.0.1:8081/#/

Could some one please share the differences in declaring the data sources while

  1. Adding it in tranquility server config

  2. Adding it in the config for the druid beam tranquilizer

Also even though I have added dimensions in the dimension spec, its not listing down the dimensions in

http://127.0.0.1:8081/old-console/cluster.html (segment dimensions)

Please let me know if I have to share more screenshots / configs etc

PS: I am aware im pasting links from my localhost, I just wanted to share the links for people to get an idea where I am looking at

Correction:

  1. When I restart druid services, previously ingested data is lost, whereas metrics data stays. What am I doing wrong

Abhilash,
I feel you have restarted the middle manager before the realtime tasks can finish hence leading to loss of data.

Druid has two types of data nodes

  1. Realtime nodes/middle managers

  2. Historical nodes

As the name implies Realtime node are responsible for data belong to current time and historical nodes are responsible for older data.

Realtime nodes hold the data in memory and if they are killed before they can finish there is a possibility of data loss whereas historical nodes persist data on deep storage as well as local disk hence data is persisted across service restarts.

Realtime node run for the duration of (segment granularity+window period) and after that duration they handover the data to historical nodes.

So please verify if your realtime tasks are successful before restarting the service.

Druid doesn’t maintain any additional metadata about the datasource it is serving,it is derived from the data that is ingested and in case all the data is lost for a particular data source then the data source would also disappear from druid’s console.

Thanks

Rohit

Thanks a lot Rohit for the explanation.
I had used one DAY segmentGranularity
I’ll try and use something lesser and see if its handed over to the historical nodes and get back.

I would have expected the real time nodes to maybe flush to disk before shutting down. Not sure if it makes sense.

Also, while I was able to query for D1 via http, it was not showing up as a datasource in Airbnbs Superset.

I will post the same on superset forum once the base is ready.