[druid-user] How to rollup older data by monthly

Can someone guide me on how to do the rollup by monthly for the data thats already ingested into druid cluster at weekly level.

Chitra

You can reindex the data, which is basically ingesting from the existing data and either overwriting or sending to a new datasource, with a monthly query granularity. Eg, https://druid.apache.org/docs/0.20.1/ingestion/data-management.html#reindexing-with-native-batch-ingestion . I’m not sure why they warn against doing this for data > 1G, I’ve seen it used on much larger datasets.

I am unable to find the exact api payload, can you provide an example, that can pull data from current data sources and build indexed data ?
Chitra

Here is one I had lying around. I’m not sure if leaving dimensions and metrics out of the spec will grab everything or not, you can try, or list them all explicity. In this one, I load from one datasource into another, but you can make them the same, and keep appendToExisting false so that you overwrite the old with the new.

{

“type”: “index_parallel”,

“spec”: {

“dataSchema” : {

“dataSource”: “reLoad”,

“timestampSpec”: {

“format”: “auto”,

“column”: “ts”

},

“transformSpec”: {

“transforms”: [

{

“type”: “expression”,

“name”: “legacy_order_downloaded”,

“expression”: “bytes_sent >= byte_offset”

}

],

“filter”: {

“type”: “not”,

“field”: {

“type”: “selector”,

“dimension”: “order_id”,

“value”: “”

}

}

},

“dimensionsSpec”: {

“dimensions”: [

{“name”: “order_id”, “type”: “string”},

{“name”: “legacy_order_downloaded”, “type”: “long”}

]

},

“metricsSpec”: [

{“name”: “bytes_sent”, “type”: “longSum”, “fieldName”: “bytes_sent”},

{“name”: “byte_offset”, “type”: “longMax”, “fieldName”: “byte_offset”}

],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “HOUR”,

“queryGranularity”: “MINUTE”,

“rollup”: true

}

},

“ioConfig”: {

“type”: “index_parallel”,

“inputSource”: {

“type”: “druid”,

“dataSource”: “firstLoad”,

“interval”: “2020-11-09/2020-11-11”,

“dimensions”: [“order_id”,“legacy_order_downloaded”],

“metrics”: [“byte_offset”,“bytes_sent”]

},

“appendToExisting”: false

}

},

“tuningConfig”: {

“type”: “index_parallel”,

“maxRowsPerSegment”: 500000,

“maxNumConcurrentSubTasks”: 5

}

}

Hey Chitra - just a note of caution - it sounds like you are wanting to roll up from week to month? Have you checked whether metrics for weeks that span multiple months end up inside an acceptable month?

HI Peter/Ben,

Current the indexes are in hourly rollups during ingestion time.

I applied below json to run rollups by week with segment months.
Json payload
{
“type” : “index”,
“spec” : {
“dataSchema” : {
“dataSource” : “rollup-developmentReq”,
“dimensionsSpec” : {
“dimensions” : [
“repo”,
“username”
]
},
“timestampSpec”: {
“column”: “timestamp”,
“format”: “iso”
},
“metricsSpec” : [
{ “type” : “count”, “name” : “count” },
{ “type” : “longSum”, “name” : “duration”, “fieldName” : “duration” },
{ “type” : “longSum”, “name” : “bytes”, “fieldName” : “bytes” }
],
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “month”,
“queryGranularity” : “week”,
“intervals” : [“2021-01-22/2021-02-01”],
“rollup” : true
}
},
“ioConfig” : {
“type” : “index”,
“inputSource” : {
“type” : “druid”,
“index” : “development-requests”
},
“inputFormat” : {
“type” : “json”
},
“appendToExisting” : false
},
“tuningConfig” : {
“type” : “index”,
“maxRowsPerSegment” : 5000000,
“maxRowsInMemory” : 25000
}
}
}

ref: https://druid.apache.org/docs/0.16.0-incubating/tutorials/tutorial-rollup.html


Error after submitting task to overlord using post-index json

2021-02-22T21:08:29,020 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Encountered exception in BUILD_SEGMENTS.
java.lang.NullPointerException
	at org.apache.druid.indexing.common.task.FiniteFirehoseProcessor.process(FiniteFirehoseProcessor.java:96) ~[druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.common.task.IndexTask.generateAndPublishSegments(IndexTask.java:859) ~[druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.common.task.IndexTask.runTask(IndexTask.java:467) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:137) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:419) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:391) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_242]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_242]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_242]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
2021-02-22T21:08:29,033 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.firehose.ServiceAnnouncingChatHandlerProvider - Unregistering chat handler[index_rollup-developmentReq_2021-02-22T21:08:24.482Z]
2021-02-22T21:08:29,033 INFO [task-runner-0-priority-0] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_rollup-devhub_2021-02-22T21:08:24.482Z] status changed to [FAILED].
2021-02-22T21:08:29,041 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_rollup-developmentReq_2021-02-22T21:08:24.482Z",
  "status" : "FAILED",
  "duration" : 71,
  "errorMsg" : "java.lang.NullPointerException\n\tat org.apache.druid.indexing.common.task.FiniteFirehoseProcessor.pro...",
  "location" : {
    "host" : null,
    "port" : -1,
    "tlsPort" : -1
  }
}
2021-02-22T21:08:29,066 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering org.apache.druid.server.http.SegmentListerResource as a root resource class
2021-02-22T21:08:29,067 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering com.fasterxml.jackson.jaxrs.smile.JacksonSmileProvider as a provider class
2021-02-22T21:08:29,067 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering com.fasterxml.jackson.jaxrs.json.JacksonJsonProvider as a provider class
2021-02-22T21:08:29,067 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering org.apache.druid.server.initialization.jetty.CustomExceptionMapper as a provider class
2021-02-22T21:08:29,067 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering org.apache.druid.server.initialization.jetty.ForbiddenExceptionMapper as a provider class
2021-02-22T21:08:29,068 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering org.apache.druid.server.initialization.jetty.BadRequestExceptionMapper as a provider class

Did you get to the bottom of this?

I noted that your ioconfig might need changing

When you’re using index type [Native batch ingestion · Apache Druid] I think you need to specify the firehose as ingestSegment [Native batch ingestion · Apache Druid]?