Looking to hire Druid expert for basic setup + configuration, will pay premium for help.

Hello,

So I want to add an analytics dashboard for our advertisers that tracks their campaigns (similar to Google Analytics, but specifically tailored for our site).

I looked at InfluxDB and many of the alternatives, and Druid seems far and above… Just very complex.

Data ingestion: We get ~300 million pageviews a month, and I want to track every pageview along with some basic metadata that comes along in the request headers (timestamp, device, URL of current page, referrer with subdomain, type of impression, etc). Pretty basic stuff.

Then, in the realtime dashboard, the types of queries being done would be very simple. I would imagine 3-4 different ones total, only basic slicing of data along 1 dimension over time. For example, I might have total visits by device over x days, total visits by URL over x days, etc.

There will be a dropdown with fixed time periods (1d, 2d, 7d, 30d), as well as 1 very basic stacked line chart.

Overall, I think it’s about as easy of an implementation as it gets. I really need this to be solid for our advertisers though, which is why I’m going the extra mile in choosing something flexible and complicated, rather than something like InfluxDB where I’m essentially locked in.


I’ve spent a good 6 hours so far today, and I barely got a hacked-together modified docker container working. I feel like I know about 2% of what’s going on, and time is becoming somewhat of an issue. Not good.

I’m looking for someone to help me do the following - and I’m ok with paying more for quality work.

  1. Look at my project in detail and recommend the best overall strategy for using Druid (I’ve already coded it so I can show you exactly how it will look.)
  2. Deploy the ~7-9 server EC2 production setup using docker, Ansible, or some method which would allow us to easily add/remove nodes. Any other method you prefer would also be reasonable, as long as we can increase or decrease different node types easily
  3. Do a full, proper, tailored configuration for our project, doing your best to make it production ready and not just leaving a bunch of things at the default or non-optimal values. I’d expect you’d ask us some questions to help you figure out our needs of how long to keep data, what level of estimation do we want on TopN queries, etc.
  4. Not required, but would be amazing: wait a few days while we push production data to it, and we can fine-tune the configuration if anything was missed.

I know this isn’t the best place to post this, but it seems Druid information is sadly a bit thin.

I would be forever grateful if anyone is interested in this offer… don’t hesitate to reply with a way for me to contact you.

Hey,

when you say,that you wanna track referrer URLs or URL of current page and you wanna query on it, then it might be, that you will run into memory constraints.

I think of druid as a statistic bucket for a block of rows, not as a datastore itself. So for one column druid will build up a dictionary for its values, that you want to put into a segment (“dimensions”), and then probably bitmapping the dictionary entries to positions somewhere.

When you have random values for a column, then this dictionary might “blow up” or simply become too big.

Its a different issue, when you just want to create metrics for the columns, without storing the values into the segments. Such as the hyperloglog thingie.

But maybe someone has experience about this. I usually dont see druid as a datastore.

Cheers.