Parquet file format

We are evaluating Drill and Druid as a solution for querying gzipped json data files in s3. To make it more fun each line in a file can be one of like 10 different json structures. The structures can be very different. We want to be able to query this data in s3 so it doesn’t have to live on ebs volumes ($$). Parquet seems to be more compact so this will save time in transferring the file at least and supposedly can allow for faster querying. I read this http://druid.io/docs/latest/comparisons/druid-vs-sql-on-hadoop.html and am a bit confused. It sounds like parquet can make druid queries faster. Can Druid ingest parquet files and is it efficient or at least as efficient as json?

Scott, I think you may be misunderstanding how Druid works.

Druid is not a SQL-on-Hadoop solution. All Druid segments must be downloaded locally before they can be queried, unlike a system like Drill that can query for Parquet files in S3 directly. To use Parquet with Druid, you would have to read data from Parquet and convert it into Druid’s segment format. There is an existing extension to do this.

The tradeoff is of course the latency in which queries return. If you can live with queries that take minutes or hours, then you can pull data from S3 into Drill and have Drill do the computation. Depending on the frequency of queries, this may be a more expensive option because of network transfer costs than if you had Druid download all segments locally and query that data instead.

Ah, ok.
We are trying to move a bunch of data out of MySQL to S3. If we want to query it with Druid we will end up pulling it down from s3 and onto the Druid ebs volumes defeating the whole purpose.
I was starting to think Druid might work for us but now I don’t think so.

Thanks Fangjin.

Scott, what are you trying to do?

In terms of general product requirements and data volumes?

Hey Scott,

Druid is really different from Drill– Drill is a query engine, it queries data where it lies. Druid by contrast is a data store. It indexes your data into its own data format and then distributes that data across all the Druid nodes before queries are issued. The query path involves local reads of data already present on the Druid nodes. This design generally offers better performance, but you do need to store the indexed dataset on the Druid nodes. This can be substantially smaller than the raw data (thanks to compression and rollup).

Hope this helps.

We are trying to move a big chunk of data out of mysql and into s3. Keeping the data on ebs volumes is expensive but we still want to query it.

Ok, i’ll do some test to see if the compression and rollup will be enough.

Depending on how much data you have, and how much compute you want to dedicate to querying, you may be able to use the local disks that instances already come with. r3 and i2 instances are pretty popular for Druid query nodes, and the i2s especially have a lot of local storage. Many users in aws find that they don’t need ebs.

I was just thinking that we could use the instance store volumes since the data will still persist in s3. Maybe, there’s a lot of data.

Hi Scott, just out of curiosity, what issues are you facing with mysql?

This is more about cost saving. All this EBS is expensive.