What is big data ?.
Big data is defined wit 3Vs, extreme volume of data, the wide variety of data types and the velocity at which the data must be processed. Big data is a term used for a collection of data sets so large and complex that it is difficult to process using traditional applications/toolsBig.
Why there are so many open source big data tools in the market ?
Most of active groups or organizations develop tools which are open source to increase the adoption possibility in the industry. Besides, big data is profitable for industry.
- Apache Hadoop
Apache hadoop is a java based free software framework that can effectively store large amount of data in a cluster using simple programming models. Hadoop consist of four parts :
- Hadoop Distributed File System or HDFS, is a distributed file system compatible with high scale bandwidth
- YARN platform managing and scheduling resources
- MapReduce programming model
- Libraries : help other modules
Advantages of hadoop:
- Scalable , because it can stores and distribute very large data sets across hundreds of inexpensive servers that operate in parallel.
- Cost effective
- Flexible , enables businesses to easily access new data sources and tap into different types of data (both structured and unstructured) to generate value from that data.
- Apache Spark
Spark is a big data tool that does-in memory data processing. Spark provided simplicity because it is accessible via a set of rich API, that designed specially for interacting quickly and easily at scale. Spark is also designed for speed, operating both in memory and on disk.
- Apache Storm
Apache Storm is a distributed real-time framework for reliably processing the unbounded data stream. The framework supports any programming language. The unique features of Apache Storm are:
- Massive scalability
- “fail fast, auto restart” approach
- The guaranteed process of every tuple
Apache Cassandra is distributed type database to manage a large set of data across the server. Cassandra has certain capabilities which no other relational database and any NoSQL database can provide. These capabilities are :
- Continuous availability as a data source
- Linear scalable performance
- Simple operations
- Across the data centers easy distribution of data
- Cloud availability points
Rapid miner is flow based programming allows visualixzation of pipelines, that no coding required and easy to set up.
MongoDB is an open source NoSQL database which is cross-platform compatible with many built-in features. It is ideal for the business that needs fast and real-time data for instant decisions. It is ideal for the users who want data-driven experiences. MongoDB is :
- Best way to work with data
- Put data wherever we need
- Run anywhere
- R Programming Tool
R Programming Tool is one of the widely used big data tools. It consist of 900 modules and algorithms for statistical analysis of data. Using R tool one can work on discrete data and try out a new analytical algorithm for analysis. It is a portable language.
Neo4j is one of the big data tools that is widely used graph database in big data industry. Neo4j is scalable and reliable, also high availability.
- Apache SAMOA
Apache SAMOA is among well known big data tools used for distributed streaming algorithms for big data mining. Samoa some advantages are :
- Run and program anywhere
- No system downtime
- Existing infrastruktur is reusable
- HPCC(High Performance Computer Cluster)
High-Performance Computing Cluster (HPCC) is another among best big data tools. Some of the core features of HPCC are:
- Helps in parallel data processing
- Open Source distributed data computing platform
- Follows shared nothing architecture
- Runs on commodity hardware
- Comes with binary packages supported for Linux distributions
- Supports end-to-end big data workflow management
- The platform includes:
source page :