How to add a column in a table with data?

I have a table that has records from several months. One more column was added and I need to create it in the data table

Can you reingest the data?

I would think so… But I would like to know how to add a column when I have a new requirement.

The process I do is the following:

  1. Extract extraction of information from a database in dynamodb. This is done through a python program. The end result is a CSV that is stored on S3. This process is daily. That is, I have a CSV for each day of the month.

  2. I have a transform process where I remove columns I don’t need and other stuff.

  3. Using a curl command I upload the file to druid. This is done every day through a cron.

  4. I’ve started uploading data since March 2022. Let’s say I’m at a point where I can do the ingest again. But if in a few months a new field comes out, how could I do it?

Thanks for your attention

Druid allows flexibility in the schema of a datasource.
In principle you should be able to change your ingestion spec to include the new column and that’s it.
When you query the data, the column will not have any values in the past, it will only have values from the time you first start using the new ingestion.

Here’s some info on this from the documentation:

1 Like

:+1: @Sergio_Ferragut – the “schema” is determined from each segment itself – so as each segment relates to a time period and / or a cluster of data (secondary partitioning), the schema itself can change over time as you need it to. When you want to add a column to new data, just add the column in the dimensionsSpec. When you want it to appear in historical data, you would re-ingest those time periods and it will create a new version of each segment with your column in it. There’s not an ALTER TABLE command, per se.

Hi @Sergio_Ferragut ,

Thanks for your response. I followed the recommendations of the documentation and your explanation and implemented it by doing something additional that I detail below

I really don’t know if that’s the “orthodox” solution, but I write the way it worked for me.

  1. Check the flat file (CSV) and I saw that although the field name (header) was in the data line it was blank.

Example:

nroorder,course, content_marketing
1,“test1”,
2,“test”,

  1. Replace blank value with null text

nroorder,course, content_marketing
1,“test1”,null
2,“test”, null

  1. Check that the field name was in the DimensionSpec

    “dimensionsSpec”: {
    “dimensions”: [
    “nroorder”,
    “course”,
    “content_marketing”
    ]
    },

  2. I did the data ingestion and it worked.

Thanks.

2 Likes