advice on configuring data sources

I’m evaluating druid for storing aggregations of user behavioral data. To give two examples of types of events I might get:

{“eventType”: “pageview”, “userId”: “1232542”, “sessionId”: “99999999”, “productId”: “1234”, “country”: “US”, “region”: “NY”, “city”: “New York”, “zip”: “23424”, “productId”: “1234”, “pagename”:“blah”}

{“eventType”: “purchase”, “userId”: “1232542”, “sessionId”: “99999999”, “productId”: “1234”, “country”: “US”, “region”: “NY”, “city”: “New York”, “zip”: “23424”, “productId”: “1234”, “price”: 12.99, “discounts”: 5.99, “paymentmethod”: “amex”}

Which is to say that all events will share some common set of dimensions (eventType, geo, user info, etc.), and each event will also have some event specific dimensions.

Should I ingest those all into druid as a single dataSource, or should every event be a separate dataSource? The first option seems to have some obvious advantages, but I wasn’t sure if I would run into problems because I’m asking druid to aggregate on dimensions or fields that aren’t always there. For example, in the two events above, only the purchase event has a payment method dimension, and only the purchase event has price and discount fields to aggregate.

You can think of Druid datasources as tables in other databases. You can take a look into Druid’s schemaless dimension capabilities, which means that Druid can determine your list of dimensions as events are streamed in. However, all of your events must have the same set of metrics. If you have different sets of metrics, you can create a superset of all possible metrics, or use different datasources for different groupings of events.