Is it possible for druid to have 3k columns?

I tried to use Druid for a dataset with 3k dimension and 10 measures. Most of the dimensions are spatial, with cardinality ~5.

The middleManager is configured as

druid.indexer.runner.javaOpts=-server -Xmx6g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

a segment have ~1M events, and the task failed:

Exception in thread “plumber_merge_0” java.lang.OutOfMemoryError: Direct buffer memory

at java.nio.Bits.reserveMemory(Bits.java:695)

at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)

at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)

at io.druid.segment.CompressedPools$4.get(CompressedPools.java:105)

at io.druid.segment.CompressedPools$4.get(CompressedPools.java:98)

at io.druid.collections.StupidPool.makeObjectWithHandler(StupidPool.java:112)

at io.druid.collections.StupidPool.take(StupidPool.java:103)

at io.druid.segment.CompressedPools.getByteBuf(CompressedPools.java:113)

at io.druid.segment.data.DecompressingByteBufferObjectStrategy.fromByteBuffer(DecompressingByteBufferObjectStrategy.java:49)

at io.druid.segment.data.DecompressingByteBufferObjectStrategy.fromByteBuffer(DecompressingByteBufferObjectStrategy.java:28)

at io.druid.segment.data.GenericIndexed$BufferIndexed.bufferedIndexedGet(GenericIndexed.java:427)

at io.druid.segment.data.GenericIndexed$2.get(GenericIndexed.java:573)

at io.druid.segment.data.CompressedVSizeColumnarIntsSupplier$CompressedVSizeColumnarInts.loadBuffer(CompressedVSizeColumnarIntsSupplier.java:367)

at io.druid.segment.data.CompressedVSizeColumnarIntsSupplier$CompressedVSizeColumnarInts.get(CompressedVSizeColumnarIntsSupplier.java:340)

at io.druid.segment.column.SimpleDictionaryEncodedColumn.getSingleValueRow(SimpleDictionaryEncodedColumn.java:79)

at io.druid.segment.StringDimensionHandler.getEncodedKeyComponentFromColumn(StringDimensionHandler.java:186)

at io.druid.segment.StringDimensionHandler.getEncodedKeyComponentFromColumn(StringDimensionHandler.java:35)

at io.druid.segment.QueryableIndexIndexableAdapter$2$1.next(QueryableIndexIndexableAdapter.java:266)

at io.druid.segment.QueryableIndexIndexableAdapter$2$1.next(QueryableIndexIndexableAdapter.java:187)

at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)

at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)

at com.google.common.collect.Iterators$PeekingImpl.peek(Iterators.java:1162)

at io.druid.java.util.common.guava.MergeIterator$1.compare(MergeIterator.java:48)

at io.druid.java.util.common.guava.MergeIterator$1.compare(MergeIterator.java:44)

at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:670)

at java.util.PriorityQueue.siftUp(PriorityQueue.java:646)

at java.util.PriorityQueue.offer(PriorityQueue.java:345)

at java.util.PriorityQueue.add(PriorityQueue.java:322)

at io.druid.java.util.common.guava.MergeIterator.(MergeIterator.java:57)

at io.druid.java.util.common.guava.MergeIterable.iterator(MergeIterable.java:52)

at io.druid.collections.CombiningIterable.iterator(CombiningIterable.java:95)

at io.druid.segment.IndexMergerV9.mergeIndexesAndWriteColumns(IndexMergerV9.java:456)

at io.druid.segment.IndexMergerV9.makeIndexFiles(IndexMergerV9.java:209)

at io.druid.segment.IndexMergerV9.merge(IndexMergerV9.java:837)

at io.druid.segment.IndexMergerV9.mergeQueryableIndex(IndexMergerV9.java:710)

at io.druid.segment.IndexMergerV9.mergeQueryableIndex(IndexMergerV9.java:688)

at io.druid.segment.realtime.plumber.RealtimePlumber$2.doRun(RealtimePlumber.java:425)

at io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:42)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Just want to know, is there anyone ever succeeded with columns of this scale? If so, how should I change the configure?

Just want to know, is there anyone ever succeeded with columns of this scale? If so, how should I change the configure?

Does every row in your data need to have all 3000 columns, or would it be possible to split the input data into several datasources with fewer columns each?

If such a split is possible, I would recommend trying that, I don’t think Druid as-is could handle 3000 columns without enormous amounts of RAM available.

Thanks,

Jon