Data model: dimensions are all strings?

Hi! I have a kind of basic Druid model question that I’m not quite finding answers to in the docs. Are all dimensions strings?

I see you can use outputType when doing queries to ask for them as LONGs etc, and that the parseSpec can define how they are parsed from JSON (etc) input. But is there a concept of types for the dimensions themselves when they are stored, or just how you parse them on the way in and out?

–dave

Druid supports String, Long, Float, and Double column types for dimensions, they’re stored as those types, and you can specify the type in the ingestion spec: http://druid.io/docs/latest/ingestion/index.html#dimension-schema

Thanks,

Jon

Thanks! Here’s what I’m a bit confused about with respect to the dimensionsSpec.

dimensionsSpec is a subfield of parseSpec. We actually wrote our own custom InputRowParser subclasses, one per type, which know how to parse our data out of protobufs into InputRows. Because our Java code already understands the schema of each of our data sources, our InputRowParsers actually entirely ignore the ParseSpec object handed to it by withParseSpec. My guess based on this was that parseSpec was primarily used by the InputRowParsers, and since our IRP doesn’t use the ParseSpec, it’s been unclear to us how much what we write in the ingestionSpec’s parseSpec matters.

We do still fill out the parseSpec in the ingestionSpecs (we generate our ingestionSpecs from the same code that configures our InputRowParser classes), but we’ve realized that some of our dimensions are specified there as strings when they should probably be longs.

So I guess I have two questions:

(a) Is the parseSpec used by any other part of the system than the InputRowParser?

(b) If we have existing data where a dimension is a “string” and we want to change it to “long”, can we do this and how? What happens if we just change the type in the spec and re-submit it to the supervisor? I see http://druid.io/docs/latest/ingestion/schema-changes.html but it doesn’t really answer my question.

This code in the IncrementalIndex constructor determines what type a column will be stored as:

for (DimensionSchema dimSchema : dimensionsSpec.getDimensions()) {
ValueType type = TYPE_MAP.get(dimSchema.getValueType());
String dimName = dimSchema.getName();
ColumnCapabilitiesImpl capabilities = makeCapabilitiesFromValueType(type);
capabilities.setHasBitmapIndexes(dimSchema.hasBitmapIndex());

if (dimSchema.getTypeName().equals(DimensionSchema.SPATIAL_TYPE_NAME)) {
capabilities.setHasSpatialIndexes(true);
} else {
DimensionHandler handler = DimensionHandlerUtils.getHandlerFromCapabilities(
dimName,
capabilities,
dimSchema.getMultiValueHandling()
);
addNewDimension(dimName, capabilities, handler);
}
columnCapabilities.put(dimName, capabilities);
}

The dimensionsSpec ultimately comes from the InputRowParser, from the following code in IncrementalIndexSchema.Builder:


public Builder withDimensionsSpec(InputRowParser parser)
{
if (parser != null
&& parser.getParseSpec() != null
&& parser.getParseSpec().getDimensionsSpec() != null) {
this.dimensionsSpec = parser.getParseSpec().getDimensionsSpec();
} else {
this.dimensionsSpec = new DimensionsSpec(null, null, null);
}

return this;
}


So if you want to ingest a column as something other than String, the custom InputRowParser will need to return a ParseSpec -> DimensionsSpec that specifies the non-String types.

If no type is provided, the default type is String. If a dimension is not specified in the dimensionsSpec, and it appears in the input data, Druid will “discover” the column as a String type.

Thanks Jon, this is really helpful! A couple other quotes inline.

OK, so what happens along the way as I do this? If I take a “string” dimension field (which happens to currently only contain stringified integer values), edit the field’s type to be “long”, and submit it to the Kafka Indexing Service, it’ll start indexing new segments as long. Until I reingest, what happens? Do queries stop returning values from the old segments with the outdated string dimension type? Do values just get coerced based on outputType? Is there a huge downside to just leaving them as stringified integers and using outputType to convert (I guess storage is less efficient)?

Values will get coerced based on the outputType. The only real downside to keeping them as strings is performance and storage space: usually things are faster/smaller when stored as the most ‘natural’ type.