Possible editing of event data

Hello everybody,

We are thinking of giving it a shot to use druid for our new project, form analytics for jotform.com, (for which I am a developer for).

We normally implemented an analytics backend using our custom scripts. namely, all pageview data are located at a mongo collection,

We consume that pageviews periodically to produce final desired stats data for every form and storing that computed data daily. As similar to druid, for historical data

you do not have to re calculate, since there can not be any pageviews and form submission events for past data.

Here is our problem, for a given page view, we log following event data

{

formID : someID,

timestamp: …,

useragent: …,

submissionTimestamp : 0,

eventID: some event ID

}

and if a pageview later results a submission of a form, we use eventID key to match corresponding pageview event and update it as follows:

{

formID : someID,

timestamp: …,

useragent: …,

submissionTimestamp : NEWLY_UPDATED_SUBMISSION_TIMESTAMP,

eventID: some event ID

}

from this we can calculate, average time spent on forms.

Can you predict where I am getting at?

My question is, is there a way to modify an event data that is previously sent to druid?

Hi Kemal, Druid does not have a means of modifying a value directly per-se, but segments are versioned, and only the latest version is used. So if you have a realtime stream that has a submissionTimestamp of 0, and a batch fixup on data which contains the “correct” submissionTimestamp, then once the batch fixup completes, you will have the expected submissionTimestamp in the data.

Hi Kemal,

How long do you anticipate the time between a pageview and form submission to be? I believe your use case is similar to one we see in ad tech, someone views an ad, and may click it later. This requires a join between two events. Currently we do this at our streaming processing layer (Samza). If your requirements and timing window is similar, you can also consider such a solution. We generally prefer joins at ETL time as joins on the query side can be very resource intensive.

Let me know if this makes sense.

– FJ