i need to reingest existing data. furthermore i need to change it, add new column(dimension) which will be processed based on existing one. actually i need to get value of one column and then replace this value by another one which will be taken from lookup table.
If you have the raw data, you can use the HadoopIndexTask to reIndex your data with the changed dimension,
Druid also support lookup queries where you can replace a dimension value with another one from a lookup table at query time.
More details on query lookups can be found here -
We also recently added support for “delta” ingestion, where you can easily append a new column or new events to a segment. You can find more information here: http://druid.io/docs/latest/ingestion/batch-ingestion.html (search for “multi”)
Does druid store raw data? or should i store raw data in parallel? And what is better( and easier) for that case to use standalone hadoop cluster or just multiple middlemanagers/peons. we have about 0.5 billion of rows
Druid stores rolled up data. Please see http://druid.io/docs/latest/design/index.html
You should save a copy of your raw data somewhere.