I have a quite large group of parquet files stored in S3 dataset. I want to ingest this into Druid. The parquet files sums up to about 1.5 billion rows.
Now, my problem is that the ingestion process takes forever, I am already 3 hours in and it seems it’s not done yet.
What could I tweak in my cluster so that I can make the ingestion process faster?
My cluster setup has 2x c3.8xlarge (32cores/60GB RAM) as the Historical/MiddleManager servers.
I made some changes in the config file, but alas, it’s still a bit slower than I anticipated.
Here are my current configs for indexer:
druid_indexer_runner_javaOptsArray=["-server", “-Xmx12g”, “-Xms1g”, “-XX:MaxDirectMemorySize=6g”, “-Duser.timezone=UTC”, “-Dfile.encoding=UTF-8”, “-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager”]
One thing to note with my ingestion script is that I increase the *maxBytesInMemory *to around 11GB. This somehow made my ingestion a bit faster according to logs since it reports of higher number of rows (around 40k) when spilling to disk.
Is it correct to assume that maxBytesInMemory relies on the indexer heap size (-Xmx)? And what it does is it collects 11GB worth of input data and aggregate it before spilling to disk?
Would appreciate if someone can guide me on what should I do to optimize my ingestion process.