Getting issue with benchmarking data

Hi,

I am having issue with benchmarking druid as mentioned in

http://druid.io/blog/2014/03/17/benchmarking-druid.html

Ingested 100 gb data with hadoop indexing . As per Documentation there should be 600,037,902 rows but druid has only 29999795 rows

Query:

{

“queryType”: “groupBy”,

“dataSource”: “tpch_lineitem”,

“granularity”: “all”,

“aggregations”: [

{ “type”: “count”, “name”: “rows” }

],

“intervals”: [“1990-01-03T00:00/2000-11-30T00”]

}

Thanks and Regards,

Ankit Gupta

Hi Ankit,

The count aggregator is a little bit confusing because it does not look at the original data, but rather the current data. What that means is that if Druid found a way to optimize compression (example: two rows can be combined into one) then it will see 1 count not 2 count.

If you had a count metric defined at ingestion time, you can do a longsum on that metric to get the “real” count of the ingested data.

Regards,

Charles Allen

Hi,

One important thing I ran the 100 files of 1 Gb in 20 jobs with 5 files of each job

I think previous segments are overriding here with the data in new job

Is there any way to prevent overriding of segments

out of curiosity, why not 100 files of 1GB each in 1 job?

Hi Ankit, or Druid Folks,

The benchmarking blog post is woefully incomplete for druid newbies like me.

Using druid-0.82-rc1, and the configs provided, I barely managed to get the cluster up and running. I had to comment out the monitoring section druid.monitoring.monitors because I got some sigar library exception. I’m running one broker node which also runs zookeeper, then co-ordinator JVM, and broker JVM. On the compute node, I’m running it as a “historical” JVM. Do I need to run indexer node as well?

The next step is to load the 100GB data. I have downloaded the data and uploaded it to an S3 bucket.

Looking at the task description - https://github.com/druid-io/druid-benchmark/blob/master/lineitem.task.json, Amazon EMR does not support 0.20.205-emr version. Would it work with the latest 4.1.0 EMR version? Finally, where do I provide the info for the EMR endpoints?

I apologize for these newbie/basic questions but I could not find the answers anywhere.

Thanks in advance,

-Shashi

Hi Shashi, the benchmarking blog post was written 2 years ago and hasn’t seen updates since then. If you are having a lot of trouble setting up stock Druid, which folks do have, you can alternatively try the instructions here: http://imply.io/docs/latest/quickstart.html

You’ll likely find the setup process much more straighforward.

We haven’t had a chance to port back these setup steps and scripts back to stock Druid just yet. I also want to rewrite the getting started guide on Druid, which hasn’t been done just yet.