What's the best way to add a column when the database is already running in production mode?


I had to pause the Druid integration for a while, but I restarted my work and now I’m considering a potential problem and trying to find the solution before we have to deal with it.

In the title I’ve stated only one problem, but in fact we are thinking about two problems:

  1. Is it possible to add columns at “runtime”? If yes, there is a preferred/best way to do it?
  2. Is it possible to add schemas at “runtime”? My doubt appears because when I’ve worked with schemas I had to define all of them in a single file and pass it as an argument when I start the server. Is it possible to reload the schemas file at runtime? Or to split the schemas specs into different files?
    Thank you in advance!

Since you are defining all schemas together in a single file, I believe you are using realtime nodes instead of druid indexing service.

FWIW, Druid segments are immutable and self contained. They contain information about all the dimensions and metrics they have.

Different segments can have different schemas for the same datasource and you should be able to change the schema of a segment.

With Realtime nodes, If you need to change the schema, you will need to update the spec file and restart the node (this cannot be done at runtime without a restart). Realtime nodes recover the data which was already persisted to disk and will start again from there. Adding columns, new datasources should still work fine with realtime nodes. one caveat is that if you change any dimension to metric or vice versa, you may get a error about merging segments with conflicting schemas.

For more seamless schema changes, It is recommended to setup indexing service and create realtime index tasks with limited lifetime ( usually an hour ). If you need to do any schema changes, they will be reflected from the next set of tasks. Also have a look at Tranquility (https://github.com/druid-io/tranquility) which handles all the schema changes and creation of index tasks for you.

Hello Nishant, thank you for your response.

I’m working with Tranquility, not with RT nodes (and I’m not planning to remove columns or to change their type, only adding columns), but Tranquility also uses a single file to describe all the data-sources specs.

Does it reload the specs file every time I alter it? Or it doesn’t matter because the file is not loaded at all by tranquility, but by the temporally spawned workers?.

With Tranquility it should be even easier to change up schemas as Tranquility creates short lived tasks. These tasks have a schema for their task duration. You should be able to change your Tranquility schemas and restart Tranquility at any time.

The idea is that you restart your tranquility process when you want to change configs, and they will get picked up when a new set of tasks spawn (see https://github.com/druid-io/tranquility/blob/master/docs/overview.md#task-creation).

Most people using Server or Kafka do this by rolling restart of their deployed services. Most people using the stream processor integrations (Storm/Spark/etc) do this by restarting their stream processing job.