As we launch Karmasphere’s new web site, I thought I’d take a few minutes to explain one of the key concepts that informs what we do at Karmasphere: Full-Fidelity Analytics for Hadoop.
Ensuring Full-Fidelity Analytics means not compromising the data available to us in Hadoop in order to analyze it. There are three principles of Full-Fidelity Analytics:
1. Use the original data. Don’t pre-process or abstract it so it loses the richness that is Hadoop
2. Keep the data open. Don’t make it proprietary which undermines the benefits of Hadoop open standards
3. Process data on-cluster without replication. Replication and off-cluster processing increases complexity and costs of hardware and managing the environment
By adhering to these principals during analytics, the data remains rich and standard empowering deep insights faster for companies in the era of Big Data.
Full-fidelity Big Data Analytics has powerful benefits for users, including:
- Maintains maximum richness of insights
- Lowers the total cost of ownership
- Reduces complexity
- Ensures portability and reuse of analytics assets
- Avoids vendor lock-in
Big Data Analytics Solutions that adhere to Full-Fidelity principals will have the following characteristics:
Avoids data replication and sampling
The original data is a single source of truth. A sampling-based approach to big data analytics is at odds with the promise of Big Data. Full fidelity access to the complete data maximizes the insights that can be gained from the data. Additionally, data replication should, wherever possible, be avoided to minimize costs and maximize time to insight in analytic workflows.
Uses the standard Hadoop metastore
Analytic assets are stored using the standard Hadoop metastore for data description. This enables re-use of the metadata across use cases and avoids duplication and lock-in to a proprietary metastore.
Moves analytics to the data and harnesses native Hadoop engines
In the world of big data, processing should be performed on-cluster, not in large, expensive footprint, off-cluster parallel systems that have to be scaled out. Apache Hadoop projects include many dedicated to analytic processing. Hadoop innovation moves fast – faster than any one company because of community-driven, open source contributions. Using those engines delivers maximum analytic power.
Provides extensibility through Hadoop standards
As an open source system, extensibility should, wherever possible, be based on standard Hadoop mechanisms. Companies often compete by adding their own innovation to out-of-the-box analytics systems and they should be able to do that with proven, portable and re-usable components.