Batch ingestion performace

Hi,

I have a use case where get a dump of 150 million records every hour (spit across 500 files) into S3. I have indexer running on a fleet of r3.8xl hosts (3 of them). What is the recommended way of ingesting these records. I could not split the files to be ingested by different jobs as they are all in the same hour and latest one would overwrite the previous ones. Ingesting all 500 files through one job is taking ~7 hours. Is this normal? Is there a way I can improve the performance?

Let me know if you need any further details.

Thanks!

Sumatheja did you try splitting “intervals” in the index tasks in intervals of 5 mins, so 12 intervals(12 index spec files) for 1 hour.

Hi,

Thank you for the response. All the files have same timestamp, do you mean tweaking the time timestamp to some value within that hour?

Does all 150 million records have same timestamp ?

Yes. They are all part of hourly data dumps and the timestamp is that of the hour. I can attach a second to each file and ingest with second granularity (that way parallel ingestion can happen) but wanted to check what is the recommended way to go about this.

Thanks!