merging some data with historical segment

Hello,

I have segment on historical node and I need to merge some metrics into it. Like I have click events and then I now receive conversion events and I need to merge them together. I need to mention that cardinality of segment is about 1 million records. Conversions are coming every second and ideally if I could add them to clicks in near realtime.

Which approach can be applied in this case? Did not find any answers to this. Docs are also pretty poor about this.

As I understand, it is possible be done with Hadoop Indexer Task with delta ingestion, but can’t figure it out because of lack of documentation about it. Can you please provide working part with “multi” section? And maybe any advices on my problem. What I need to do — add delta or periodically reindex whole segment? A matter of principle. If I can’t find the way to easily and quickly update segments then I need to search for another database.

I read that someone got just 2 historical nodes, 2 realtime nodes and about 20(!) middle-workers only for reindexing last historical segment for fixing CAP theorem in lambda architecture. Does it only the way to update data with new metrics? What if I constantly get new metrics in arbitrary intervals e.g. I need to update arbitrary already archived segments, so, for user satisfaction with realtime stats I need to regenerate multiple segments at the same time permanently. Seems irrational.

Constant raw data re-indexing is not an option for sure. But I have idea that need to be tested. For instance, I have two type of events: clicks — realtime flow and need to be logged in time, and conversions — delayed, which impact multiple historical segments with clicks (maybe 100-200 every minute) every minute. What if I index click events with granularity of **1 hour, **but conversions will be indexed with granularity of **1 minute. **1 minute granularity is required for near-realtime stats displaying. So I will get pairs of segments which can be easily (I hope) merged together after small delay. But still it seems too complicated.

Problem of data updating is not fully disclosed, I think. I will be glad to hear the possible solutions from Druid developers or experienced users.

http://druid.io/docs/0.9.0-rc2/ingestion/update-existing-data.html

Thank you for sharing. I see many changes in Druid 0.9 and decided to start using Druid from this version too. But I still do not understand how to do a merge data with segment. That is what I’m trying to do:

“ioConfig” : {
“type” : “hadoop”,
“inputSpec” : {
“type” : “multi”,
“children”: [
{
“type” : “dataSource”,
“ingestionSpec” : {
“dataSource”: “events”,
“intervals”: [“2015-11-01T00:00:00Z/P2D”]
}
},
{
“type” : “static”,
“paths”: “hdfs://localhost:9000/druid_json/druid_2015-11-01_10k.json”
}
]
}
}

``

And get error:

java.lang.Exception: com.metamx.common.ISE: Unable to form simple file uri
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) [hadoop-mapreduce-client-common-2.3.0.jar:?]
Caused by: com.metamx.common.ISE: Unable to form simple file uri
at io.druid.indexer.JobHelper.getURIFromSegment(JobHelper.java:709) ~[druid-indexing-hadoop-0.9.0-rc2.jar:0.9.0-rc2]

``

If I just index data from HDFS in this way:

“ioConfig” : {
“type” : “hadoop”,
“inputSpec” : {
“type” : “static”,
“paths” : “hdfs://localhost:9000/druid_json/druid_2015-11-01_10k.json”
}
}

``

everything goes well.

If I do so as described in documentation, e.g. use “paths”: “/druid_json/druid_2015-11-01_10k.json” :

“ioConfig” : {
“type” : “hadoop”,
“inputSpec” : {
“type” : “multi”,
“children”: [
{
“type” : “dataSource”,
“ingestionSpec” : {
“dataSource”: “events”,
“intervals”: [“2015-11-01T00:00:00Z/P2D”]
}
},
{
“type” : “static”,
“paths”: “/druid_json/druid_2015-11-01_10k.json”
}
]
}
}

``

I got:

Caused by: java.lang.RuntimeException: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/druid_json/druid_2015-11-01_10k.json
at io.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:211) ~[druid-indexing-hadoop-0.9.0-rc2.jar:0.9.0-rc2]
at io.druid.indexer.JobHelper.runJobs(JobHelper.java:323) ~[druid-indexing-hadoop-0.9.0-rc2.jar:0.9.0-rc2]

``

The solution is trivial:

  • load extension hdfs extension (druid.extensions.loadList=[“druid-hdfs-storage”])

  • write full hdfs path to deep storage: druid.indexer.logs.directory=hdfs://localhost:9000//druid/indexing-logs