Help Ingesting Data from S3

I am evaluating Druid as a possible data warehouse for large quantities of location data, this data primarily consists of Lat, Lng, Timestamp, UserID. Currently this data resides in S3 as json files. I setup a simple Druid cluster as described on the clustering page using the aws instance sizes recommended there. I then created a 12 node EMR cluster using large r3.8xlarge instances. I tried to ingest just one file ~3.5gig uncompressed but the job has been stuck at

2016-10-18 15:19:35,460 INFO [main] io.druid.segment.IndexMerger: Starting dimension[timezone] with cardinality[296]
2016-10-18 15:19:40,897 INFO [main] io.druid.segment.IndexMerger: Completed dimension[timezone] in 5,437 millis.
2016-10-18 15:20:05,753 INFO [main] io.druid.segment.IndexMerger: Starting dimension[customerregistrationid] with cardinality[4,758,142]
2016-10-18 15:20:21,671 INFO [main] io.druid.segment.IndexMerger: Completed dimension[customerregistrationid] in 15,918 millis.
2016-10-18 15:20:21,672 INFO [main] io.druid.segment.IndexMerger: Starting dimension[coordinates] with cardinality[3,611,658]

``



for over an hour.  Below is the ingest spec file I used:

{
  "type": "index_hadoop",
  "spec": {
    "dataSchema": {
      "dataSource": "test",
      "parser": {
        "type": "hadoopyString",
        "parseSpec": {
          "format": "json",
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [
              "latitude",
              "longitude",
              "timezone",
              "customerregistrationid"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": [
              {
                "dimName": "coordinates",
                "dims": [
                  "latitude",
                  "longitude"
                ]
              }
            ]
          }
        }
      },
      "metricsSpec": [
        {
          "type": "count",
          "name": "count"
        },
        {
          "type": "doubleSum",
          "name": "added",
          "fieldName": "added"
        },
        {
          "type": "doubleSum",
          "name": "deleted",
          "fieldName": "deleted"
        },
        {
          "type": "doubleSum",
          "name": "delta",
          "fieldName": "delta"
        }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "NONE",
        "intervals": [
          "2016-10-05/2016-10-25"
        ]
      }
    },
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "paths": "s3n://pathtofile"
      }
    },
    "tuningConfig": {
      "type": "hadoop",
      "jobProperties": {
        "fs.s3.awsAccessKeyId": "key",
        "fs.s3.awsSecretAccessKey": "secret",
        "fs.s3.impl": "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
        "fs.s3n.awsAccessKeyId": "key",
        "fs.s3n.awsSecretAccessKey": "secret",
        "fs.s3n.impl": "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
        "io.compression.codecs": "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"
      }
    }
  }
}

Any advice on what settings need to be tweaked to ingest the data at a more reasonable rate would be greatly appreciated. Also is there anyway to tell if my current ingest job will ever finish?

Thanks,

Nathan

Removing the spatial dimension and adding

“partitionsSpec”: {

“type”: “hashed”,

“targetPartitionSize”: 5000000

},

``

but the job was still pretty slow, is my definition of the spatial dimension incorrect or are they just very slow to create?

Spatial dimensions are an experimental feature so I dont think they’ve been strenuously tested for performance. The process of creating the geo-index, I can imagine, is going to be slower than the bitmap indexes for example.

Hi Nathan,

If you look at your Hadoop ResourceManager UI, how many map/reduce tasks were created, are they running in parallel, and what stage was taking the longest amount of time? It feels like the configuration you have isn’t utilizing your EMR resources effectively.