Big Data Ecosystem Dataset
Data
Projects
- Frameworks
- Distributed Programming
- Distributed Filesystem
- Key-Map Data Model
- Document Data Model
- Key-value Data Model
- Graph Data Model
- NewSQL Databases
- Columnar Databases
- Time-Series Databases
- SQL-like processing
- Integrated Development Environments
- Data Ingestion
- Message-oriented middleware
- Service Programming
- Scheduling
- Machine Learning
- Benchmarking
- Security
- System Deployment
- Container Manager
- Applications
- Search engine and framework
- MySQL forks and evolutions
- PostgreSQL forks and evolutions
- Memcached forks and evolutions
- Embedded Databases
- Business Intelligence
- Data Analysis
- Data Warehouse
- Data Visualization
- Internet of Things
Frameworks
- Apache Hadoop - framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).
Distributed Programming
- AddThis Hydra - distributed data processing and storage system originally developed at AddThis.
- Akela - Mozilla's utility library for Hadoop, HBase, Pig, etc..
- Amazon Lambda - a compute service that runs your code in response to events and automatically manages the compute resources for you.
- AMPLab SIMR - run Spark on Hadoop MapReduce v1.
- AMPLab Succinct - Enabling Queries on Compressed Data.
- Apache Crunch - a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
- Apache DataFu - collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
- Apache Flink - high-performance runtime, and automatic program optimization.
- Apache Gora - framework for in-memory data model and persistence.
- Apache Hama - BSP (Bulk Synchronous Parallel) computing framework.
- Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
- Apache Pig - high level language to express data analysis programs for Hadoop.
- Apache S4 - framework for stream processing, implementation of S4.
- Apache Spark - framework for in-memory cluster computing.
- Apache Spark Streaming - framework for stream processing, part of Spark.
- Apache Storm - framework for stream processing by Twitter also on YARN.
- Apache Tez - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
- Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
- Cascalog - data processing and querying library.
- Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
- Concurrent Cascading - framework for data management/analytics on Hadoop.
- Damballa Parkour - MapReduce library for Clojure.
- Datasalt Pangool - alternative MapReduce paradigm.
- DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
- DistributedR - scalable high-performance platform for the R language.
- Drools - a Business Rules Management System (BRMS) solution.
- eBay Oink - REST based interface for PIG execution.
- Facebook Corona - Hadoop enhancement which removes single point of failure.
- Facebook Peregrine - Map Reduce framework.
- Facebook Scuba - distributed in-memory datastore.
- Geotrellis - geographic data processing engine for high performance applications.
- GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework.
- Google Dataflow - create data pipelines to help themæingest, transform and analyze data.
- Google MapReduce - map reduce framework.
- Google MillWheel - fault tolerant stream processing framework.
- Hazelcast - In-Memory Data Grid.
- HParser - data parsing transformation environment optimized for Hadoop.
- IBM Streams - advanced analytic platform that allows user-developed applications to quickly ingest, analyze and correlate information as it arrives from thousands of real-time sources.
- JAQL - declarative programming language for working with structured, semi-structured and unstructured data.
- Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
- Kryo - Java serialization and cloning: fast, efficient, automatic.
- LinkedIn Cubert - a fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop.
- Lipstick - Pig workflow visualization tool.
- Metamarkers Druid - framework for real-time analysis of large datasets.
- Netflix Aegisthus - Bulk Data Pipeline out of Cassandra. implements a reader for the SSTable format and provides a map/reduce program to create a compacted snapshot of the data contained in a column family.
- Netflix Lipstick - Pig Visualization framework.
- Netflix Mantis - Event Stream Processing System.
- Netflix PigPen - map-reduce for Clojure whiche compiles to Apache Pig.
- Netflix STAASH - language-agnostic as well as storage-agnostic web interface for storing data into persistent storage systems.
- Netflix Zeno - Netflix's In-Memory Data Propagation Framework.
- Nextflow - Dataflow oriented toolkit for parallel and distributed computational pipelines.
- Nokia Disco - MapReduce framework developed by Nokia.
- PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.
- Pinterest Pinlater - asynchronous job execution system.
- Pubnub - Data stream network.
- Pydoop - Python MapReduce and HDFS API for Hadoop.
- ScaleOut hServer - fast, scalable in-memory data grid for Hadoop.
- SeqPig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop .
- SigmoidAnalytics Spork - Pig on Apache Spark.
- SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data. .
- Spring for Apache Hadoop - unified configuration model and easy to use APIs for using HDFS, MapReduce, Pig, and Hive.
- SQLStream Blaze - stream processing platform.
- Stratio Streaming - the union of a real-time messaging bus with a complex event processing engine using Spark Streaming.
- Stratosphere - general purpose cluster computing framework.
- Streamdrill - usefull for counting activities of event streams over different time windows and finding the most active one.
- Sumo Logic - cloud based analyzer for machine-generated data..
- Teradata QueryGrid - data-access layer that can orchestrate multiple modes of analysis across multiple databases plus Hadoop.
- TIBCO ActiveSpaces - in-memory data grid.
- Tigon - a distributed framework built on Apache HadoopTM and Apache HBaseTM for real-time, high-throughput, low-latency data processing and analytics applications.
- Torch - Scientific computing for LuaJIT.
- Twitter Scalding - Scala library for Map Reduce jobs, built on Cascading.
- Twitter Summingbird - Streaming MapReduce with Scalding and Storm, by Twitter.
- Twitter TSAR - TimeSeries AggregatoR by Twitter.
Distributed Filesystem
- Apache HDFS - a way to store large files across multiple machines.
- BeeGFS - formerly FhGFS, parallel distributed file system.
- Ceph Filesystem - software storage platform designed.
- Disco DDFS - distributed filesystem.
- Facebook Haystack - object storage system.
- Google Colossus - distributed filesystem (GFS2).
- Google GFS - distributed filesystem.
- Google Megastore - scalable, highly available storage.
- GridGain - GGFS, Hadoop compliant in-memory file system.
- HDSF-DU - HDFS-DU is an interactive visualization of the Hadoop distributed file system. .
- Lustre file system - high-performance distributed filesystem.
- Netflix S3mper - library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
- Quantcast File System QFS - open-source distributed file system.
- Red Hat GlusterFS - scale-out network-attached storage file system.
- Tachyon - reliable file sharing at memory speed across cluster frameworks.
Key-Map Data Model
- Actian Vector - column-oriented analytic database.
- Apache Accumulo - distribuited key/value store, built on Hadoop.
- Apache Cassandra - column-oriented distribuited datastore, inspired by BigTable.
- Apache HBase - column-oriented distribuited datastore, inspired by BigTable.
- Facebook HydraBase - evolution of HBase made by Facebook.
- Google BigTable - column-oriented distributed datastore.
- Google Cloud Datastore - is a fully managed, schemaless database for storing non-relational data over BigTable.
- Hypertable - column-oriented distribuited datastore, inspired by BigTable.
- InfiniDB - is accessed through a MySQL interface and use massive parallel processing to parallelize queries.
- MapR-DB - fast, scalable, and enterprise-ready in-Hadoop database architected to manage big data.
- Netflix Priam - Co-Process for backup/recovery, Token Management, and Centralized Configuration management for Cassandra.
- OhmData C5 - improved version of HBase.
- Sqrrl - NoSQL databases on top of Apache Accumulo.
- Tephra - Transactions for HBase.
- Twitter Manhattan - real-time, multi-tenant distributed database for Twitter scale.
Document Data Model
- Actian Versant - commercial object-oriented database management systems .
- Amazon SimpleDB - a highly available and flexible non-relational data store that offloads the work of database administration.
- Clusterpoint - a database software for high-speed storage and large-scale processing of XML and JSON data on clusters of commodity hardware.
- Crate Data - is an open source massively scalable data store. It requires zero administration.
- Facebook Apollo - Facebook’s Paxos-like NoSQL database.
- jumboDB - document oriented datastore over Hadoop.
- LinkedIn Espresso - horizontally scalable document-oriented NoSQL data store.
- MarkLogic - Schema-agnostic Enterprise NoSQL database technology.
- Microsoft DocumentDB - fully-managed, highly-scalable, NoSQL document database service.
- MongoDB - Document-oriented database system.
- RavenDB - A transactional, open-source Document Database.
- RethinkDB - document database that supports queries like table joins and group by.
- Terrastore - a modern document store which provides advanced scalability and elasticity features without sacrificing consistency.
- TokuMX - High-Performance MongoDB Distribution.
Key-value Data Model
- Aerospike - NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies..
- Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.
- Couchbase ForestDB - Fast Key-Value Storage Engine Based on Hierarchical B+-Tree Trie.
- Edis - is a protocol-compatible Server replacement for Redis.
- ElephantDB - Distributed database specialized in exporting data from Hadoop.
- EventStore - distributed time series database.
- HyperDex - next generation key-value store.
- KAI - a distributed key-value datastore.
- LinkedIn Krati - is a simple persistent data store with very low latency and high throughput.
- Linkedin Voldemort - distributed key/value storage system.
- MemcacheDB - a distributed key-value storage system designed for persistent.
- Netflix Dynomite - thin Dynamo-based replication for cached data.
- Oracle NoSQL Database - distributed key-value database by Oracle Corporation.
- RAMCloud - storage system that provides large-scale low-latency storage by keeping all data in DRAM all the time and aggregating the main memories of thousands of servers.
- Redis - in memory key value datastore.
- Redis Cluster - distributed implementation of Redis.
- Redis Sentinel - system designed to help managing Redis instances.
- Riak - a decentralized datastore.
- Scalaris - a distributed transactional key-value store.
- Storehaus - library to work with asynchronous key value stores, by Twitter.
- Tarantool - an efficient NoSQL database and a Lua application server.
- TreodeDB - key-value store that's replicated and sharded and provides atomic multirow writes.
- Yahoo Sherpa - hosted, distributed and geographically replicated key-valueÊcloud storage platform.
Graph Data Model
- Apache Giraph - implementation of Pregel, based on Hadoop.
- Apache Spark Bagel - implementation of Pregel, part of Spark.
- ArangoDB - multi model distribuited database.
- Facebook TAO - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
- Faunus - Hadoop-based graph analytics engine for analyzing graphs represented across a multi-machine compute cluster.
- Google Cayley - open-source graph database.
- Google Pregel - graph processing framework.
- GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
- GraphX - resilient Distributed Graph System on Spark.
- Gremlin - graph traversal Language.
- HyperGraphDB - general purpose, open-source data storage mechanism based on a powerful knowledge management formalism known as directed hypergraphs.
- InfiniteGraph - distributed graph database.
- Infovore - RDF-centric Map/Reduce framework.
- Intel GraphBuilder - tools to construct large-scale graphs on top of Hadoop.
- MapGraph - Massively Parallel Graph processing on GPUs.
- Neo4j - graph database writting entirely in Java.
- OrientDB - document and graph database.
- Phoebus - framework for large scale graph processing.
- Sparksee - scalable high-performance graph database.
- Stardog - graph database: search, query, reasoning, and constraints in a lightweight, pure Java system.
- Titan - distributed graph database, built over Cassandra.
- Twitter FlockDB - distribuited graph database.
NewSQL Databases
- Actian Ingres - commercially supported, open-source SQL relational database management system.
- BayesDB - statistic oriented SQL database.
- Cockroach - Scalable, Geo-Replicated, Transactional Datastore.
- Datomic - distributed database designed to enable scalable, flexible and intelligent applications.
- FoundationDB - distributed database, inspired by F1.
- Google F1 - distributed SQL database built on Spanner.
- Google Spanner - globally distributed semi-relational database.
- H-Store - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
- HandlerSocket - NoSQL plugin for MySQL/MariaDB.
- IBM DB2 - object-relational database management system.
- InfiniSQL - infinity scalable RDBMS.
- MemSQL - in memory SQL database witho optimized columnar storage on flash.
- NuoDB - SQL/ACID compliant distributed database.
- Oracle Database - object-relational database management system.
- Oracle TimesTen in-Memory Database - in-memory, relational database management system with persistence and recoverability.
- Pivotal GemFire XD - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
- SAP HANA - is an in-memory, column-oriented, relational database management system.
- SenseiDB - distributed, realtime, semi-structured database.
- Sky - database used for flexible, high performance analysis of behavioral data.
- SymmetricDS - open source software for both file and database synchronization.
- Teradata Database - complete relational database management system.
- VoltDB - in-memory NewSQL database.
Columnar Databases
- Amazon RedShift - data warehouse service, based on PostgreSQL.
- C-Store - column oriented DBMS.
- Google BigQuery - framework for interactive analysis, implementation of Dremel.
- Google Dremel - framework for interactive analysis, implementation of Dremel.
- MonetDB - column store database.
- Parquet - columnar storage format for Hadoop.
- Pivotal Greenplum - purpose-built, dedicated analytic data warehouse.
- Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.
Time-Series Databases
- Cube - uses MongoDB to store time series data.
- Etsy StatsD - simple daemon for easy stats aggregation.
- InfluxDB - distributed time series database.
- Kairosdb - similar to OpenTSDB but allows for Cassandra.
- OpenTSDB - distributed time series database on top of HBase.
- Square Cube - system for collecting timestamped events and deriving metrics.
- TempoIQ - Cloud-based sensor analytics.
SQL-like processing
- Actian SQL for Hadoop - high performance interactive SQL access to all Hadoop data.
- AMPLAB Shark - data warehouse system for Spark.
- Apache Drill - framework for interactive analysis, inspired by Dremel.
- Apache HCatalog - table and storage management layer for Hadoop.
- Apache Hive - SQL-like data warehouse system for Hadoop.
- Apache Optiq - framework that allows efficient translation of queries involving heterogeneous and federated data.
- Apache Phoenix - SQL skin over HBase.
- BlinkDB - massively parallel, approximate query engine.
- Cloudera Impala - framework for interactive analysis, Inspired by Dremel.
- Concurrent Lingual - SQL-like query language for Cascading.
- Datasalt Splout SQL - full SQL query engine for big datasets.
- eBay Kylin - Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
- Facebook PrestoDB - distributed SQL query engine.
- Hadapt - a native implementation of SQL for the Apache Hadoop open-source project.
- JethroData - index-based SQL engine for Hadoop.
- Metanautix Quest - data compute engine.
- Pivotal HAWQ - SQL-like data warehouse system for Hadoop.
- RainstorDB - database for storing petabyte-scale volumes of structured and semi-structured data.
- Spark Catalyst - is a Query Optimization Framework for Spark and Shark.
- SparkSQL - Manipulating Structured Data Using Spark.
- Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
- Stinger - interactive query for Hive.
- Tajo - distributed data warehouse system on Hadoop.
- Trafodion - enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.
Integrated Development Environments
- R-Studio - IDE for R.
Data Ingestion
- Amazon Kinesis - real-time processing of streaming data at massive scale.
- Apache BookKeeper - a distributed logging service called BookKeeper and a distributed publish/subscribe system built on top of BookKeeper called Hedwig.
- Apache Chukwa - data collection system.
- Apache Flume - service to manage large amount of log data.
- Apache Samza - stream processing framework, based on Kafla and YARN.
- Apache Sqoop - tool to transfer data between Hadoop and a structured datastore.
- Apache UIMA - Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user.
- Cloudera Morphlines - framework that help ETL to Solr, HBase and HDFS.
- Facebook Scribe - streamed log data aggregator.
- Fluentd - tool to collect events and logs.
- Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
- Heka - open source stream processing software system.
- HIHO - framework for connecting disparate data sources with Hadoop.
- LinkedIn Camus - Kafka to HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka.
- LinkedIn Databus - stream of change capture events for a database.
- LinkedIn Gobblin - a framework for Solving Big Data Ingestion Problem.
- LinkedIn Kamikaze - utility package for compressing sorted integer arrays.
- Linkedin Lumos - bridge from OLTP to OLAP for use it on Hadoop.
- LinkedIn White Elephant - log aggregator and dashboard.
- Logstash - a tool for managing events and logs.
- Netflix Suro - data pipeline service for collecting, aggregating, and dispatching large volume of application events including log data based on Chukwa.
- Pinterest Secor - is a service implementing Kafka log persistance.
- Record Breaker - Automatic structure for your text-formatted data.
- TIBCO Enterprise Message Service - standards-based messaging middleware.
- Twitter Zipkin - distributed tracing system that helps us gather timing data for all the disparate services at Twitter.
- Vibe Data Stream - streaming data collection for real-time Big Data analytics.
Message-oriented middleware
- ActiveMQ - open source messaging and Integration Patterns server.
- Amazon Simple Queue Service - fast, reliable, scalable, fully managed queue service.
- Apache Kafka - distributed publish-subscribe messaging system.
- Apache Qpid - messaging tools that speak AMQP and support many languages and platforms.
- Apollo - ActiveMQ's next generation of messaging.
- Beanstalkd - simple, fast work queue.
- Bit.ly NSQ - realtime distributed message processing at scale.
- Celery - Distributed Task Queue.
- Crossroads I/O - library for building scalable and high performance distributed applications.
- Darner - simple, lightweight message queue.
- Facebook Iris - a totally ordered queue of messaging updates with separate pointers into the queue indicating the last update sent to your Messenger app and the traditional storage tier.
- Gearman - Job Server.
- HornetQ - open source project to build a multi-protocol, embeddable, very high performance, clustered, asynchronous messaging system.
- IronMQ - easy-to-use highly available message queuing service.
- Kestrel - distributed message queue system.
- Marconi - queuing and notification service made by and for OpenStack, but not only for it.
- RabbitMQ - Robust messaging for applications.
- RestMQ - message queue which uses HTTP as transport, JSON to format a minimalist protocol and is organized as REST resources.
- RQ - simple Python library for queueing jobs and processing them in the background with workers.
- Sidekiq - Simple, efficient background processing for Ruby.
- ZeroMQ - The Intelligent Transport Layer.
Service Programming
- Akka Toolkit - runtime for distributed, and fault tolerant event-driven applications on the JVM.
- Apache Avro - data serialization system.
- Apache Curator - Java libaries for Apache ZooKeeper.
- Apache Karaf - OSGi runtime that runs on top of any OSGi framework.
- Apache Thrift - framework to build binary protocols.
- Apache Zookeeper - centralized service for process management.
- Google Chubby - a lock service for loosely-coupled distributed systems.
- Linkedin Norbert - cluster manager.
- MPICH - high performance and widely portable implementation of the Message Passing Interface (MPI) standard.
- OpenMPI - message passing framework.
- Serf - decentralized solution for service discovery and orchestration.
- Spotify Luigi - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
- Spring XD - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
- Twitter Elephant Bird - libraries for working with LZOP-compressed data.
- Twitter Finagle - asynchronous network stack for the JVM.
Scheduling
- Apache Aurora - is a service scheduler that runs on top of Apache Mesos.
- Apache Falcon - data management framework.
- Apache Oozie - workflow job scheduler.
- Chronos - distributed and fault-tolerant scheduler.
- Linkedin Azkaban - batch workflow job scheduler.
- Pinterest Pinball - customizable platform for creating workflow managers.
- Sparrow - scheduling platform.
Machine Learning
- Apache Mahout - machine learning library for Hadoop.
- Ayasdi Core - tool for topological data analysis.
- brain - Neural networks in JavaScript.
- Cloudera Oryx - real-time large-scale machine learning.
- Concurrent Pattern - machine learning library for Cascading.
- convnetjs - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
- cuDNN - GPU-accelerated library of primitives for deep neural networks.
- Decider - Flexible and Extensible Machine Learning in Ruby.
- etcML - text classification with machine learning.
- Etsy Conjecture - scalable Machine Learning in Scalding.
- Google Sibyl - System for Large Scale Machine Learning at Google.
- H2O - statistical, machine learning and math runtime for Hadoop.
- IBM Watson - cognitive computing system.
- LinkedIn ml-ease - ADMM based large scale logistic regression.
- MLbase - distributed machine learning libraries for the BDAS stack.
- MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X.
- nupic - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
- PredictionIO - machine learning server buit on Hadoop, Mahout and Cascading.
- scikit-learn - scikit-learn: machine learning in Python.
- Spark MLlib - a Spark implementation of some common machine learning (ML) functionality.
- Sparkling Water - combine H2OÕs Machine Learning capabilities with the power of the Spark platform.
- Theano - Python package for deep learning that can utilize NVIDIA's CUDA toolkit to run on the GPU.
- Thunder - Large-scale analysis of neural data.
- Vahara - Machine learning and natural language processing with Apache Pig.
- Viv - global platform that enables developers to plug into and create an intelligent, conversational interface to anything.
- Vowpal Wabbit - learning system sponsored by Microsoft and Yahoo!.
- WEKA - suite of machine learning software.
- Wit - Natural Language for the Internet of Things.
- Wolfram Alpha - computational knowledge engine.
- YHat ScienceOps - platform for deploying, managing, and scaling predictive models in production applications.
Benchmarking
- Apache Hadoop Benchmarking - micro-benchmarks for testing Hadoop performances.
- Berkeley SWIM Benchmark - real-world big data workload benchmark.
- Big-Bench - Big Bench Workload Development.
- Hive-benchmarks - some benchmarking queries for Apache Hive.
- Hive-testbench - Testbench for experimenting with Apache Hive at any data scale..
- Intel HiBench - a Hadoop benchmark suite.
- Netflix Inviso - performance focused Big Data tool.
- PUMA Benchmarking - benchmark suite for MapReduce applications.
- Yahoo Gridmix3 - Hadoop cluster benchmarking from Yahoo engineer team.
Security
- Apache Argus - framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
- Apache Knox Gateway - single point of secure access for Hadoop clusters.
- Apache Sentry - security module for data stored in Hadoop.
- PacketPig - Open Source Big Data Security Analytics.
- Voltage SecureData - data protection framework.
System Deployment
- Ankush - A big data cluster management tool that creates and manages clusters of different technologies..
- Apache Ambari - operational framework for Hadoop mangement.
- Apache Bigtop - system deployment framework for the Hadoop ecosystem.
- Apache Helix - cluster management framework.
- Apache Mesos - cluster manager.
- Apache Slider - is a YARN application to deploy existing distributed applications on YARN.
- Apache Whirr - set of libraries for running cloud services.
- Apache YARN - Cluster manager.
- Brooklyn - library that simplifies application deployment and management.
- Buildoop - Similar to Apache BigTop based on Groovy language.
- Cloudera Director - a comprehensive data management platform with the flexibility and power to evolve with your business.
- Cloudera HUE - web application for interacting with Hadoop.
- Deimos - Mesos containerizer hooks for Docker.
- Develoop - tool for provisioning, managing and monitoring Apache Hadoop.
- Facebook Autoscale - the load balancer will concentrate workload to a server until it has at least a medium-level workload.
- Facebook Prism - multi datacenters replication system.
- Ganglia Monitoring System - scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.
- Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them..
- Google Borg - job scheduling and monitoring system.
- Google Omega - job scheduling and monitoring system.
- Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting..
- Hortonworks HOYA - application that can deploy HBase cluster on YARN.
- Jumbune - Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs..
- Marathon - Mesos framework for long-running services.
- Myriad - a mesos framework designed for scaling YARN clusters on Mesos. Myriad can expand or shrink one or more YARN clusters in response to events as per configured rules and policies..
- Neflix SimianArmy - a suite of tools for keeping your cloud operating in top form.
Container Manager
- Amazon EC2 Container Service - a highly scalable, high performance container management service that supports Docker containers.
- Docker - an open platform for developers and sysadmins to build, ship, and run distributed applications.
- Fig - fast, isolated development environments using Docker.
- Google Container Engine - Run Docker containers on Google Cloud Platform, powered by Kubernetes.
- Kubernetes - open source implementation of container cluster management.
- Rocket - an alternative to the Docker runtime, designed for server environments with the most rigorous security and production requirements.
Applications
- Adobe Spindle - Next-generation web analytics processing with Scala, Spark, and Parquet.
- Apache Kiji - framework to collect and analyze data in real-time, based on HBase.
- Apache Nutch - open source web crawler.
- Apache OODT - capturing, processing and sharing of data for NASA's scientific archives.
- Apache Tika - content analysis toolkit.
- Domino - Run, scale, share, and deploy models Ñ without any infrastructure..
- Eclipse BIRT - Eclipse-based reporting system.
- Eventhub - open source event analytics platform.
- HIPI Library - API for performing image processing tasks on Hadoop's MapReduce.
- Hunk - Splunk analytics for Hadoop.
- MADlib - data-processing library of an RDBMS to analyze data.
- PivotalR - R on Pivotal HD / HAWQ and PostgreSQL.
- Qubole - auto-scaling Hadoop cluster, built-in data connectors.
- Sense - Cloud Platform for Data Science and Big Data Analytics.
- Snowplow - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
- SparkR - R frontend for Spark.
- Splunk - analyzer for machine-generated date.
- Talend - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.
Search engine and framework
- Apache Blur - a search engine capable of querying massive amounts of structured data at incredible speeds.
- Apache Lucene - Search engine library.
- Apache Solr - Search platform for Apache Lucene.
- ElasticSearch - Search and analytics engine based on Apache Lucene.
- Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig..
- Enigma.io - Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web.
- Facebook Unicorn - social graph search platform.
- Google Caffeine - continuous indexing system.
- Google Percolator - continuous indexing system.
- TeraGoogle - large search index.
- Haeinsa - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
- HBase Coprocessor - implementation of Percolator, part of HBase.
- hIndex - Secondary Index for HBase.
- SF1R Search Engine - distributed search engine written in c++.
- Lily HBase Indexer - quickly and easily search for any content stored in HBase.
- LinkedIn Bobo - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.
- LinkedIn Cleo - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
- LinkedIn Galene - search architecture at LinkedIn.
- LinkedIn Zoie - is a realtime search/indexing system written in Java.
- Sphnix Search Server - fulltext search engine.
MySQL forks and evolutions
- Amazon Aurora - a MySQL-compatible, relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.
- Amazon RDS - MySQL databases in Amazon's cloud.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。