index_hadoop task stops logging

I’ve defined a job to index data using hadoop. However the log in /tmp/persistent/task/

Just stops mid way through:

   } ],

   "1999-11-22T00:00:00.000Z" : [ {

     "actualSpec" : {

       "type" : "none"

     },

     "shardNum" : 1055

   } ],

   "1999-11-23T00:00:00.000Z" : [ {

     "actualSpec" : {

       "type" : "none"

     },

     "shardNum" : 1056

   } ],

   "1999-11-24T00:00:00.000Z" : [ {

     "actualSpec" : {

       "type" : "none"

     },

     "shardNum" : 1057

   } ],

   "1999-11-25T00:00:00.000Z" : [ {

     "actualSpec" : {

       "type" : "none"

     },

     "shardNum" : 1058

``

The log stops mid way through printing the shards (always reasonably near the end of the interval I specify).

The process is still running, but has just stopped logging. No segments got added to cluster within 30 minutes, so I assume something is wrong.

I’m trying to load in a single .gz file which is ~1.5GB (it’s 35GB uncompressed).

This is running as part of a middlemanager.

JAVA_MAX_DIRECT_MEMORY: 3g

JAVA_MAX_HEAP: 3g

JAVA_PEON_MAX_HEAP: 3g

JAVA_PEON_MIN_HEAP: 1g

JAVA_PEON_MAX_DIRECT_MEMORY: 3g

{

“type” : “index_hadoop”,

“spec” : {

"dataSchema" : {

  "dataSource" : "tickdata.com-ES",

  "parser" : {

    "type" : "string",

    "parseSpec" : {

      "format" : "csv",

      "timestampSpec" : {

        "column" : "timestamp",

        "format" : "auto"

      },

      "dimensionsSpec" : {

        "dimensions": ["instrument","timestamp","price","volume","conditionCode"],

        "dimensionExclusions" : [],

        "spatialDimensions" : []

      },

      "columns": ["instrument","timestamp","price","volume","conditionCode"]

    }

  },

  "metricsSpec" : [

    {"type": "count", "name" : "count"},

    {"type": "min", "name": "low", "fieldName" : "price" },

    {"type": "max", "name": "high", "fieldName" : "price" },  

    {"type": "longSum", "name": "totalVolume", "fieldName" : "volume" }   

  ],

  "granularitySpec" : {

    "type" : "uniform",

    "segmentGranularity" : "DAY",

    "queryGranularity" : "NONE",

    "intervals" : [ "1997-01-01/2020-01-01" ]

  }

},

"ioConfig" : {

  "type" : "hadoop",

  "inputSpec" : {

    "type" : "static",

    "paths" : "/druid/clean/ES.gz"

  }

},

"tuningConfig" : {

  "type" : "hadoop"

}

}

}

``

Really not sure what’s going on. Any advice on where to look for errors / help would be appreciated.

Thanks,

Richard

Is it possible that the machine ran out of disk space, or that the process got stuck GCing, or that it was killed by the OS?

Thanks for the help Gian.

I think this was a GC issue I added GC logging and noticed a lot of garbage collection. I then upped the max memory.

-Ddruid.indexer.runner.javaOpts="-server -XX:+PrintGCDetails -XX:+PrintGCTimeStamps …

``

I have ~35GB of uncompressed CSV files. I ended up breaking up my data into months. Which made each task run in < 10 minutes, which I found more manageable.

Thanks,

Richard