Ingesting Sequence format data from hdfs

Hi Druid Gurus,

I am trying to ingest data into druid which is in ‘sequence’ format stored in hdfs. Can we use sequence format to ingest data? There are 2 issues here that i am facing

  1. The filed delimiter used is**"\20"** which is not acceptable for json. When i put in below code in index spec it fails for obvious reasons. What is the alternative way to mention “\20” in Json
    “format”: “sequence”,

                 "delimiter": "\20",
    
       2.        my file has a column (1st column in the file) which does not need to be ingested into file. And also this column uses different delimeter ( "#") as opposed to rest of the columns which i need in the file. will this be a problem. if not how do i ignore that column?
    

Can you please help? creating a hive table to this file as pointer help us in a anyway? at least i can ignore the 1st column in there.

Data that i am trying to ingest is highlighted

1152982098007131650#1900010100 2015-04-0111529820980071316507211Radio Shack Corporation6891Radio Shack CorporationUSUNITED STATESy24Electronics127.1300.0000

1153010432441888789#1900010100 2015-04-0111530104324418887897211Radio Shack Corporation6891Radio Shack CorporationUSUNITED STATESy24Electronics.0000.0000

1153038481117272004#1900010100 2015-04-0111530384811172720042043Collectors Universe, Inc.1953Collectors Universe, Inc.USUNITED STATESn46Memorabilia8897532.44002041480.0200

1153091337863022384#1900010100 2015-04-0111530913378630223844576J. C. Penney Company, Inc.4373J. C. Penney Company, Inc.USUNITED STATESy28Fashion.0000.0000

1153114799226469973#1900010100 2015-04-01115311479922646997364544HostGator Brasil62089HostGator BrasilBRBRAZILy82Services658434.7600156976.0800

1153149933059816523#1900010100 2015-04-0111531499330598165232653Dollar General Corporation(Master)2539Dollar General Corporation(Master)USUNITED STATESy68Variety349.4200.0000

1153159555813978458#1900010100 2015-04-01115315955581397845868148eDogPro Inc.65434eDogPro Inc.CACANADAn24Electronics58780.070010893.3000

1153160422425990119#1900010100 2015-04-01115316042242599011966681Capital Excell LLC64038Capital Excell LLCUSUNITED STATESn8Auto-Parts78002.590033947.5500

1153215088808594977#1900010100 2015-04-0111532150888085949772653Dollar General Corporation(Master)2539Dollar General Corporation(Master)USUNITED STATESy68Variety450.0100.0000

1153271787079605722#1900010100 2015-04-0111532717870796057222653Dollar General Corporation(Master)2539Dollar General Corporation(Master)USUNITED STATESy68Variety107.1900.0000

Index Spec

{

“type”: “index_hadoop”,

“spec”: {

“dataSchema”: {

“dataSource”: “dtmrt_pci_rcvrs”,

“parser”: {

“type”: “string”,

“parseSpec”: {

“format”: “sequence”,

“delimiter”: “\20”,

“timestampSpec”: {

“column”: “segment_date”,

“format”: “YYYY-MM-DD”

},

“dimensionsSpec”: {

“dimensions”: [

“rcvr_id”,

“chld_merch_key”,

“chld_merch_name”,

“prnt_merch_key”,

“prnt_merch_name”,

“rcvr_cntry_code”,

“rcvr_cntry_name”,

“le_flag_y_n”,

“indy_key”,

“indy_name”

],

“dimensionExclusions”: ,

“spatialDimensions”:

},

“columns”: [

“sequence_key”,

“segment_date”,

“rcvr_id”,

“chld_merch_key”,

“chld_merch_name”,

“prnt_merch_key”,

“prnt_merch_name”,

“rcvr_cntry_code”,

“rcvr_cntry_name”,

“le_flag_y_n”,

“indy_key”,

“indy_name”,

“rcvr_12m_ntpv”,

“rcvr_3m_ntpv”

]

}

},

“metricsSpec”: [

{

“type”: “doubleSum”,

“name”: “rcvr_12m_ntpv”,

“fieldName”: “rcvr_12m_ntpv”

},

{

“type”: “doubleSum”,

“name”: “rcvr_3m_ntpv”,

“fieldName”: “rcvr_3m_ntpv”

}

],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “DAY”,

“queryGranularity”: “NONE”,

“intervals”: [

“2015-04-01/2015-04-02”

]

}

},

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “static”,

“paths”: “hdfs://hdc.lvs.zzzzzz.com:8020/sys/pp_dt/SD/mozart/dtmrt_pci_rcvrs/sequence/incremental/1900/01/01/00/part-r-00000”

}

},

“tuningConfig”: {

“type”: “hadoop”

}

}

}

Log File

Hi Karteek, see inline.

Hi Druid Gurus,

I am trying to ingest data into druid which is in ‘sequence’ format stored in hdfs. Can we use sequence format to ingest data? There are 2 issues here that i am facing

  1. The filed delimiter used is**"\20"** which is not acceptable for json. When i put in below code in index spec it fails for obvious reasons. What is the alternative way to mention “\20” in Json
    “format”: “sequence”,
                "delimiter": "\20",

Please see http://druid.io/docs/latest/ingestion/data-formats.html. You can specify custom delimiters.

      2.        my file has a column (1st column in the file) which does not need to be ingested into file. And also this column uses different delimeter ( "#") as opposed to rest of the columns which i need in the file. will this be a problem. if not how do i ignore that column?

Yes, this will be an issue as Druid does not support multiple delimiters right now. Can you remove this column at ETL time or change the delimiter? Once the delimiter is changed, you can choose to ignore a column in Druid.

Fangjin,

is ingesting data from Hive a possibity?

Regards

Karteek

Hi,
  I need to submit Hadoop index tasks on a periodic basis. Is there a Java/scala api to do so?
To help with spec creation and job submit just like tranquility.

Thanks,
-Vinay

Druid doesn’t have any special Hive integration. I think your best options are:

  1. If your SequenceFiles have Text values, you may be able to load them as tsv files with a custom delimiter. Druid can’t handle the double-delimiter; it will consider the first field, the “#”, and the next field as a single field. This may or may not be acceptable to you. The way to do this would be to add “inputFormat”: “org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat” to your “inputSpec” and “format”: “tsv”, “delimiter”: “\u0020” to your “parseSpec”.

  2. Use another hadoop job (hive/pig/something) to convert your SequenceFiles to TSV or JSON text files, then load them into Druid.

  3. Write a custom InputFormat that extends SequenceFileInputFormat, but transforms the text such that Druid can read it (by stripping the things you don’t want to index).

Hey Vinay,

There’s no “official” library for doing this, but, you could use the IndexService client from tranquility (https://github.com/druid-io/tranquility/blob/master/src/main/scala/com/metamx/tranquility/druid/IndexService.scala) or you could build your own by combining your favorite HTTP client + Jackson + the *Task objects from the druid-indexing-service jar.

HI,

we started ingesting data as you have suggested. we converted sequence format data to tsv file and started ingesting it.thanks for all the help

karteek