We use MiddleManager with hadoop index taks to ingest csv data. What is the typical speed of ingestion? Is it faster to use standalone Hadoop cluster instead of embedded?May be i can ingest historical data with realtime ingestion. will it be faster?.
The Hadoop indexing code is designed to be sent off to a remote cluster. The local cluster is just an example and shouldnt be used for production. See: http://imply.io/docs/latest/ingestion.html about using other Hadoop clusters such as EMR.
Why do we have that Middlemanager/peons mechanism, and built-in hadoop if we still need to use standalone cluster?
The whole indexing service setup is mainly for realtime ingestion, where the framework can autoscale based on streaming load. For Hadoop based indexing, you can just use the standalone command line hadoop indexer and not run an indexing service at all.
If you don’t do realtime ingestion, then overlord alone is enough to run “hadoop index tasks” and you don’t need to configure any middle manager nodes.