How to store JSON events data when we don't know the schema upfront? [stuck, please help]

Hi there,
I’m N00b to Druid. It’s an amazing product! Kudos to whoever came up with. I have watched several videos about Druid to understand it better and I think this is exactly what I need.I understand Druid is Schema-light DB. I’m thinking of building a product analytics tool something like Amplitude (https://amplitude.com) (a much smaller version). My use case is that my customers will send whatever events they want and with whatever properties and data payload in the form of JSON. So I have no way to create schema like all the examples show.

Only thing that events will have on them are the uid, timestamp, event_type, customer_id, {and the rest is the event payload}. So the event’s data’s keys will make up the rest of the columns. Wondering how to store such a schema-less data? Any help appreciated?

For example, below is the event that’s sent to Amplitude. Notice that the request contains an “events” array to accommodate multiple events to be kind of sent in a batch. But each event has a “user_properties” field that contains user related data, which is a JSON and then the “event_properties” contain the actual event’s payload. And the rest of the data are all meta data about where the event itself is coming from, like what device, OS, that event is coming from. In the below example, the user is using an android device. But if the user is using a laptop, the keys will all change.

{
  "api_key": "my_amplitude_api_key",
  "events": [
    {
      "user_id": "datamonster@gmail.com",
      "device_id": "C8F9E604-F01A-4BD9-95C6-8E5357DF265D",
      "event_type": "watch_tutorial",
      "time": 1396381378123,
      "event_properties": {
        "load_time": 0.8371,
        "source": "notification",
        "dates": [
          "monday",
          "tuesday"
        ]
      },
      "user_properties": {
        "age": 25,
        "gender": "female",
        "interests": [
          "chess",
          "football",
          "music"
        ]
      },
      "groups": {
        "company_id": "1",
        "company_name": [
          "Amplitude",
          "DataMonster"
        ]
      },
      "app_version": "2.1.3",
      "platform": "iOS",
      "os_name": "Android",
      "os_version": "4.2.2",
      "device_brand": "Verizon",
      "device_manufacturer": "Apple",
      "device_model": "iPhone 9,1",
      "carrier": "Verizon",
      "country": "United States",
      "region": "California",
      "city": "San Francisco",
      "dma": "San Francisco-Oakland-San Jose, CA",
      "language": "English",
      "price": 4.99,
      "quantity": 3,
      "revenue": -1.99,
      "productId": "Google Pay Store Product Id",
      "revenueType": "Refund",
      "location_lat": 37.77,
      "location_lng": -122.39,
      "ip": "127.0.0.1",
      "idfa": "AEBE52E7-03EE-455A-B3C4-E57283966239",
      "idfv": "BCCE52E7-03EE-321A-B3D4-E57123966239",
      "adid": "AEBE52E7-03EE-455A-B3C4-E57283966239",
      "android_id": "BCCE52E7-03EE-321A-B3D4-E57123966239",
      "event_id": 23,
      "session_id": 1396381378123,
      "insert_id": "5f0adeff-6668-4427-8d02-57d803a2b841"
    }
  ]
}

In your specification can you try setting your dimension list to empty? As long as the JSON is not nested it should create a table schema based on the key-value pairs you have.

Thank you. But how to tell Druid the datatype of each of the JSON properties?

They will be imported as a string if you use this type of open data schema.

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io

Thanks, Eric.

I also learned that we can setup something upfront to detect the datatypes of each of the properties of the JSON data and create a schema on-the-fly before we store such events. Questions:

  1. Are there any out-of-the box solution that can do this that we can simply setup?
  2. The event’s type doesn’t change all the time, so we don’t need to keep checking for every event that comes in, however occasionally if/when there is a new JSON property, we need to detect it and update the schema
    Any help appreciated

Hi Raja,

I am not aware of any tools that will recognize the JSON datatype and add this information for Druid on the fly. I guess a tool could make some assumptions about the datatype based on it being a numeric value or not but even in that case, there are times when you might store a number as a string. In the Druid data loader, for instance, we do try and auto-detect what each column should be represented as. In the Druid specification, you can always define the datatype for each key: value pair.

Eric

You said that:
I also learned that we can setup something upfront to detect the datatypes of each of the properties of the JSON data and create a schema on-the-fly before we store such event

Can you tell me how did you actually do that, i am currently facing the same issue.

You can leave the dimensionSpec undefined and Druid will create all dimensions as strings. Some formats do require some type of column definition if you don’t already have one defined in the datasource.

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io

So i have couple of questions:

  1. If all dimensions are strings, will it affect query or index performance?
  2. What will happen when i add a new column, will it try to reindex old segments, or it will not affect them?
    Also you mentioned that some formats require some type of column definition, can you point me to one of them? and will it support adding new columns dynamically without changing schema?
    And what will happen if you change the column type? let’s say that i have a column “X” of type “String” then i changed it to “Number”, is it possible?

Let me tell you a little bit about my use case so that you understand what i am looking for, i will publish events to kafka, and then consume it in druid so that i can query it, i don’t know the schema of these events, and i want druid table to hold multiple events with different schemas, and the events schema can change at any time, also i want to be able to query these events using math functions like (max, avg,…) but these functions does not work with strings out of the box, i would need to keep using “PARSE_LONG” and also will need to include a regex filter in my query so that i can ignore rows that does not have this column, so basically i want schema less but without using string type for all columns, is it possible?