Updating data...over 1gb?

I would like to confirm if this is still an issue? if yes…is there an alternative to doing updates not VIA hadoop?

The docs say:FROM: http://druid.io/docs/latest/ingestion/update-existing-data.html

Note that IndexTask is to be used for prototyping purposes only as it has to do all processing inside a single process and can’t scale. Please use Hadoop batch ingestion for production scenarios dealing with more than 1GB of data.

For my testing I have been using the HttpFirehose on an indexTask

something like:

“ioConfig” : {
“type” : “index”,
“firehose” : {
“type” : “http”,
“uris” : [“http://uswest2-dev-alrepo-001.aws-dev:8888/split-files.jsonaa”]
“appendToExisting” : false
“tuningConfig” : {
“type” : “index”,
“maxRowsPerSegment” : 5000000,
“maxRowsInMemory” : 25000,
“forceExtendableShardSpecs” : true


What update are you referring to - fact or a dimension? For dimension update you can use Lookups. For fact update, that would require restatement of your data.

Rommel Garcia

hrm…not sure I follow…

In my case I am talking about dimension values only…when you say ‘fact’ are you referring to metrics/roll-up values?

for my specific use case in my testing…I ONLY have dimensions atm, no roll-up, no metrics.

(I am running my first roll-up test as we speak…but that is not this use case…even though it may apply in the near future.).


Too be clear…I have 3+Billion Rows of Dimension data (~30 columns) (~month of data)…but now that you mention ‘lookup’ tables…perhaps there is better ‘data-model’ that be applied to many of them…

An update of 15 minute window…(which is the current segment sized, will exceed 1GB)…eventually will likely compact to 1 hour segments sizes.