6

Big Data Ecosystem Dataset

Data

Projects

Frameworks

  • Apache Hadoop - framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).

Distributed Programming

  • AddThis Hydra - distributed data processing and storage system originally developed at AddThis.
  • Akela - Mozilla's utility library for Hadoop, HBase, Pig, etc..
  • Amazon Lambda - a compute service that runs your code in response to events and automatically manages the compute resources for you.
  • AMPLab SIMR - run Spark on Hadoop MapReduce v1.
  • AMPLab Succinct - Enabling Queries on Compressed Data.
  • Apache Crunch - a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
  • Apache DataFu - collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
  • Apache Flink - high-performance runtime, and automatic program optimization.
  • Apache Gora - framework for in-memory data model and persistence.
  • Apache Hama - BSP (Bulk Synchronous Parallel) computing framework.
  • Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
  • Apache Pig - high level language to express data analysis programs for Hadoop.
  • Apache S4 - framework for stream processing, implementation of S4.
  • Apache Spark - framework for in-memory cluster computing.
  • Apache Spark Streaming - framework for stream processing, part of Spark.
  • Apache Storm - framework for stream processing by Twitter also on YARN.
  • Apache Tez - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
  • Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
  • Cascalog - data processing and querying library.
  • Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
  • Concurrent Cascading - framework for data management/analytics on Hadoop.
  • Damballa Parkour - MapReduce library for Clojure.
  • Datasalt Pangool - alternative MapReduce paradigm.
  • DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
  • DistributedR - scalable high-performance platform for the R language.
  • Drools - a Business Rules Management System (BRMS) solution.
  • eBay Oink - REST based interface for PIG execution.
  • Facebook Corona - Hadoop enhancement which removes single point of failure.
  • Facebook Peregrine - Map Reduce framework.
  • Facebook Scuba - distributed in-memory datastore.
  • Geotrellis - geographic data processing engine for high performance applications.
  • GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework.
  • Google Dataflow - create data pipelines to help themæingest, transform and analyze data.
  • Google MapReduce - map reduce framework.
  • Google MillWheel - fault tolerant stream processing framework.
  • Hazelcast - In-Memory Data Grid.
  • HParser - data parsing transformation environment optimized for Hadoop.
  • IBM Streams - advanced analytic platform that allows user-developed applications to quickly ingest, analyze and correlate information as it arrives from thousands of real-time sources.
  • JAQL - declarative programming language for working with structured, semi-structured and unstructured data.
  • Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
  • Kryo - Java serialization and cloning: fast, efficient, automatic.
  • LinkedIn Cubert - a fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop.
  • Lipstick - Pig workflow visualization tool.
  • Metamarkers Druid - framework for real-time analysis of large datasets.
  • Netflix Aegisthus - Bulk Data Pipeline out of Cassandra. implements a reader for the SSTable format and provides a map/reduce program to create a compacted snapshot of the data contained in a column family.
  • Netflix Lipstick - Pig Visualization framework.
  • Netflix Mantis - Event Stream Processing System.
  • Netflix PigPen - map-reduce for Clojure whiche compiles to Apache Pig.
  • Netflix STAASH - language-agnostic as well as storage-agnostic web interface for storing data into persistent storage systems.
  • Netflix Zeno - Netflix's In-Memory Data Propagation Framework.
  • Nextflow - Dataflow oriented toolkit for parallel and distributed computational pipelines.
  • Nokia Disco - MapReduce framework developed by Nokia.
  • PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.
  • Pinterest Pinlater - asynchronous job execution system.
  • Pubnub - Data stream network.
  • Pydoop - Python MapReduce and HDFS API for Hadoop.
  • ScaleOut hServer - fast, scalable in-memory data grid for Hadoop.
  • SeqPig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop .
  • SigmoidAnalytics Spork - Pig on Apache Spark.
  • SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data. .
  • Spring for Apache Hadoop - unified configuration model and easy to use APIs for using HDFS, MapReduce, Pig, and Hive.
  • SQLStream Blaze - stream processing platform.
  • Stratio Streaming - the union of a real-time messaging bus with a complex event processing engine using Spark Streaming.
  • Stratosphere - general purpose cluster computing framework.
  • Streamdrill - usefull for counting activities of event streams over different time windows and finding the most active one.
  • Sumo Logic - cloud based analyzer for machine-generated data..
  • Teradata QueryGrid - data-access layer that can orchestrate multiple modes of analysis across multiple databases plus Hadoop.
  • TIBCO ActiveSpaces - in-memory data grid.
  • Tigon - a distributed framework built on Apache HadoopTM and Apache HBaseTM for real-time, high-throughput, low-latency data processing and analytics applications.
  • Torch - Scientific computing for LuaJIT.
  • Twitter Scalding - Scala library for Map Reduce jobs, built on Cascading.
  • Twitter Summingbird - Streaming MapReduce with Scalding and Storm, by Twitter.
  • Twitter TSAR - TimeSeries AggregatoR by Twitter.

Distributed Filesystem

Key-Map Data Model

  • Actian Vector - column-oriented analytic database.
  • Apache Accumulo - distribuited key/value store, built on Hadoop.
  • Apache Cassandra - column-oriented distribuited datastore, inspired by BigTable.
  • Apache HBase - column-oriented distribuited datastore, inspired by BigTable.
  • Facebook HydraBase - evolution of HBase made by Facebook.
  • Google BigTable - column-oriented distributed datastore.
  • Google Cloud Datastore - is a fully managed, schemaless database for storing non-relational data over BigTable.
  • Hypertable - column-oriented distribuited datastore, inspired by BigTable.
  • InfiniDB - is accessed through a MySQL interface and use massive parallel processing to parallelize queries.
  • MapR-DB - fast, scalable, and enterprise-ready in-Hadoop database architected to manage big data.
  • Netflix Priam - Co-Process for backup/recovery, Token Management, and Centralized Configuration management for Cassandra.
  • OhmData C5 - improved version of HBase.
  • Sqrrl - NoSQL databases on top of Apache Accumulo.
  • Tephra - Transactions for HBase.
  • Twitter Manhattan - real-time, multi-tenant distributed database for Twitter scale.

Document Data Model

  • Actian Versant - commercial object-oriented database management systems .
  • Amazon SimpleDB - a highly available and flexible non-relational data store that offloads the work of database administration.
  • Clusterpoint - a database software for high-speed storage and large-scale processing of XML and JSON data on clusters of commodity hardware.
  • Crate Data - is an open source massively scalable data store. It requires zero administration.
  • Facebook Apollo - Facebook’s Paxos-like NoSQL database.
  • jumboDB - document oriented datastore over Hadoop.
  • LinkedIn Espresso - horizontally scalable document-oriented NoSQL data store.
  • MarkLogic - Schema-agnostic Enterprise NoSQL database technology.
  • Microsoft DocumentDB - fully-managed, highly-scalable, NoSQL document database service.
  • MongoDB - Document-oriented database system.
  • RavenDB - A transactional, open-source Document Database.
  • RethinkDB - document database that supports queries like table joins and group by.
  • Terrastore - a modern document store which provides advanced scalability and elasticity features without sacrificing consistency.
  • TokuMX - High-Performance MongoDB Distribution.

Key-value Data Model

  • Aerospike - NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies..
  • Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.
  • Couchbase ForestDB - Fast Key-Value Storage Engine Based on Hierarchical B+-Tree Trie.
  • Edis - is a protocol-compatible Server replacement for Redis.
  • ElephantDB - Distributed database specialized in exporting data from Hadoop.
  • EventStore - distributed time series database.
  • HyperDex - next generation key-value store.
  • KAI - a distributed key-value datastore.
  • LinkedIn Krati - is a simple persistent data store with very low latency and high throughput.
  • Linkedin Voldemort - distributed key/value storage system.
  • MemcacheDB - a distributed key-value storage system designed for persistent.
  • Netflix Dynomite - thin Dynamo-based replication for cached data.
  • Oracle NoSQL Database - distributed key-value database by Oracle Corporation.
  • RAMCloud - storage system that provides large-scale low-latency storage by keeping all data in DRAM all the time and aggregating the main memories of thousands of servers.
  • Redis - in memory key value datastore.
  • Redis Cluster - distributed implementation of Redis.
  • Redis Sentinel - system designed to help managing Redis instances.
  • Riak - a decentralized datastore.
  • Scalaris - a distributed transactional key-value store.
  • Storehaus - library to work with asynchronous key value stores, by Twitter.
  • Tarantool - an efficient NoSQL database and a Lua application server.
  • TreodeDB - key-value store that's replicated and sharded and provides atomic multirow writes.
  • Yahoo Sherpa - hosted, distributed and geographically replicated key-valueÊcloud storage platform.

Graph Data Model

  • Apache Giraph - implementation of Pregel, based on Hadoop.
  • Apache Spark Bagel - implementation of Pregel, part of Spark.
  • ArangoDB - multi model distribuited database.
  • Facebook TAO - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
  • Faunus - Hadoop-based graph analytics engine for analyzing graphs represented across a multi-machine compute cluster.
  • Google Cayley - open-source graph database.
  • Google Pregel - graph processing framework.
  • GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
  • GraphX - resilient Distributed Graph System on Spark.
  • Gremlin - graph traversal Language.
  • HyperGraphDB - general purpose, open-source data storage mechanism based on a powerful knowledge management formalism known as directed hypergraphs.
  • InfiniteGraph - distributed graph database.
  • Infovore - RDF-centric Map/Reduce framework.
  • Intel GraphBuilder - tools to construct large-scale graphs on top of Hadoop.
  • MapGraph - Massively Parallel Graph processing on GPUs.
  • Neo4j - graph database writting entirely in Java.
  • OrientDB - document and graph database.
  • Phoebus - framework for large scale graph processing.
  • Sparksee - scalable high-performance graph database.
  • Stardog - graph database: search, query, reasoning, and constraints in a lightweight, pure Java system.
  • Titan - distributed graph database, built over Cassandra.
  • Twitter FlockDB - distribuited graph database.

NewSQL Databases

  • Actian Ingres - commercially supported, open-source SQL relational database management system.
  • BayesDB - statistic oriented SQL database.
  • Cockroach - Scalable, Geo-Replicated, Transactional Datastore.
  • Datomic - distributed database designed to enable scalable, flexible and intelligent applications.
  • FoundationDB - distributed database, inspired by F1.
  • Google F1 - distributed SQL database built on Spanner.
  • Google Spanner - globally distributed semi-relational database.
  • H-Store - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
  • HandlerSocket - NoSQL plugin for MySQL/MariaDB.
  • IBM DB2 - object-relational database management system.
  • InfiniSQL - infinity scalable RDBMS.
  • MemSQL - in memory SQL database witho optimized columnar storage on flash.
  • NuoDB - SQL/ACID compliant distributed database.
  • Oracle Database - object-relational database management system.
  • Oracle TimesTen in-Memory Database - in-memory, relational database management system with persistence and recoverability.
  • Pivotal GemFire XD - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
  • SAP HANA - is an in-memory, column-oriented, relational database management system.
  • SenseiDB - distributed, realtime, semi-structured database.
  • Sky - database used for flexible, high performance analysis of behavioral data.
  • SymmetricDS - open source software for both file and database synchronization.
  • Teradata Database - complete relational database management system.
  • VoltDB - in-memory NewSQL database.

Columnar Databases

  • Amazon RedShift - data warehouse service, based on PostgreSQL.
  • C-Store - column oriented DBMS.
  • Google BigQuery - framework for interactive analysis, implementation of Dremel.
  • Google Dremel - framework for interactive analysis, implementation of Dremel.
  • MonetDB - column store database.
  • Parquet - columnar storage format for Hadoop.
  • Pivotal Greenplum - purpose-built, dedicated analytic data warehouse.
  • Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.

Time-Series Databases

  • Cube - uses MongoDB to store time series data.
  • Etsy StatsD - simple daemon for easy stats aggregation.
  • InfluxDB - distributed time series database.
  • Kairosdb - similar to OpenTSDB but allows for Cassandra.
  • OpenTSDB - distributed time series database on top of HBase.
  • Square Cube - system for collecting timestamped events and deriving metrics.
  • TempoIQ - Cloud-based sensor analytics.

SQL-like processing

  • Actian SQL for Hadoop - high performance interactive SQL access to all Hadoop data.
  • AMPLAB Shark - data warehouse system for Spark.
  • Apache Drill - framework for interactive analysis, inspired by Dremel.
  • Apache HCatalog - table and storage management layer for Hadoop.
  • Apache Hive - SQL-like data warehouse system for Hadoop.
  • Apache Optiq - framework that allows efficient translation of queries involving heterogeneous and federated data.
  • Apache Phoenix - SQL skin over HBase.
  • BlinkDB - massively parallel, approximate query engine.
  • Cloudera Impala - framework for interactive analysis, Inspired by Dremel.
  • Concurrent Lingual - SQL-like query language for Cascading.
  • Datasalt Splout SQL - full SQL query engine for big datasets.
  • eBay Kylin - Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
  • Facebook PrestoDB - distributed SQL query engine.
  • Hadapt - a native implementation of SQL for the Apache Hadoop open-source project.
  • JethroData - index-based SQL engine for Hadoop.
  • Metanautix Quest - data compute engine.
  • Pivotal HAWQ - SQL-like data warehouse system for Hadoop.
  • RainstorDB - database for storing petabyte-scale volumes of structured and semi-structured data.
  • Spark Catalyst - is a Query Optimization Framework for Spark and Shark.
  • SparkSQL - Manipulating Structured Data Using Spark.
  • Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
  • Stinger - interactive query for Hive.
  • Tajo - distributed data warehouse system on Hadoop.
  • Trafodion - enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.

Integrated Development Environments

Data Ingestion

  • Amazon Kinesis - real-time processing of streaming data at massive scale.
  • Apache BookKeeper - a distributed logging service called BookKeeper and a distributed publish/subscribe system built on top of BookKeeper called Hedwig.
  • Apache Chukwa - data collection system.
  • Apache Flume - service to manage large amount of log data.
  • Apache Samza - stream processing framework, based on Kafla and YARN.
  • Apache Sqoop - tool to transfer data between Hadoop and a structured datastore.
  • Apache UIMA - Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user.
  • Cloudera Morphlines - framework that help ETL to Solr, HBase and HDFS.
  • Facebook Scribe - streamed log data aggregator.
  • Fluentd - tool to collect events and logs.
  • Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
  • Heka - open source stream processing software system.
  • HIHO - framework for connecting disparate data sources with Hadoop.
  • LinkedIn Camus - Kafka to HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka.
  • LinkedIn Databus - stream of change capture events for a database.
  • LinkedIn Gobblin - a framework for Solving Big Data Ingestion Problem.
  • LinkedIn Kamikaze - utility package for compressing sorted integer arrays.
  • Linkedin Lumos - bridge from OLTP to OLAP for use it on Hadoop.
  • LinkedIn White Elephant - log aggregator and dashboard.
  • Logstash - a tool for managing events and logs.
  • Netflix Suro - data pipeline service for collecting, aggregating, and dispatching large volume of application events including log data based on Chukwa.
  • Pinterest Secor - is a service implementing Kafka log persistance.
  • Record Breaker - Automatic structure for your text-formatted data.
  • TIBCO Enterprise Message Service - standards-based messaging middleware.
  • Twitter Zipkin - distributed tracing system that helps us gather timing data for all the disparate services at Twitter.
  • Vibe Data Stream - streaming data collection for real-time Big Data analytics.

Message-oriented middleware

  • ActiveMQ - open source messaging and Integration Patterns server.
  • Amazon Simple Queue Service - fast, reliable, scalable, fully managed queue service.
  • Apache Kafka - distributed publish-subscribe messaging system.
  • Apache Qpid - messaging tools that speak AMQP and support many languages and platforms.
  • Apollo - ActiveMQ's next generation of messaging.
  • Beanstalkd - simple, fast work queue.
  • Bit.ly NSQ - realtime distributed message processing at scale.
  • Celery - Distributed Task Queue.
  • Crossroads I/O - library for building scalable and high performance distributed applications.
  • Darner - simple, lightweight message queue.
  • Facebook Iris - a totally ordered queue of messaging updates with separate pointers into the queue indicating the last update sent to your Messenger app and the traditional storage tier.
  • Gearman - Job Server.
  • HornetQ - open source project to build a multi-protocol, embeddable, very high performance, clustered, asynchronous messaging system.
  • IronMQ - easy-to-use highly available message queuing service.
  • Kestrel - distributed message queue system.
  • Marconi - queuing and notification service made by and for OpenStack, but not only for it.
  • RabbitMQ - Robust messaging for applications.
  • RestMQ - message queue which uses HTTP as transport, JSON to format a minimalist protocol and is organized as REST resources.
  • RQ - simple Python library for queueing jobs and processing them in the background with workers.
  • Sidekiq - Simple, efficient background processing for Ruby.
  • ZeroMQ - The Intelligent Transport Layer.

Service Programming

  • Akka Toolkit - runtime for distributed, and fault tolerant event-driven applications on the JVM.
  • Apache Avro - data serialization system.
  • Apache Curator - Java libaries for Apache ZooKeeper.
  • Apache Karaf - OSGi runtime that runs on top of any OSGi framework.
  • Apache Thrift - framework to build binary protocols.
  • Apache Zookeeper - centralized service for process management.
  • Google Chubby - a lock service for loosely-coupled distributed systems.
  • Linkedin Norbert - cluster manager.
  • MPICH - high performance and widely portable implementation of the Message Passing Interface (MPI) standard.
  • OpenMPI - message passing framework.
  • Serf - decentralized solution for service discovery and orchestration.
  • Spotify Luigi - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
  • Spring XD - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
  • Twitter Elephant Bird - libraries for working with LZOP-compressed data.
  • Twitter Finagle - asynchronous network stack for the JVM.

Scheduling

Machine Learning

  • Apache Mahout - machine learning library for Hadoop.
  • Ayasdi Core - tool for topological data analysis.
  • brain - Neural networks in JavaScript.
  • Cloudera Oryx - real-time large-scale machine learning.
  • Concurrent Pattern - machine learning library for Cascading.
  • convnetjs - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
  • cuDNN - GPU-accelerated library of primitives for deep neural networks.
  • Decider - Flexible and Extensible Machine Learning in Ruby.
  • etcML - text classification with machine learning.
  • Etsy Conjecture - scalable Machine Learning in Scalding.
  • Google Sibyl - System for Large Scale Machine Learning at Google.
  • H2O - statistical, machine learning and math runtime for Hadoop.
  • IBM Watson - cognitive computing system.
  • LinkedIn ml-ease - ADMM based large scale logistic regression.
  • MLbase - distributed machine learning libraries for the BDAS stack.
  • MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X.
  • nupic - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
  • PredictionIO - machine learning server buit on Hadoop, Mahout and Cascading.
  • scikit-learn - scikit-learn: machine learning in Python.
  • Spark MLlib - a Spark implementation of some common machine learning (ML) functionality.
  • Sparkling Water - combine H2OÕs Machine Learning capabilities with the power of the Spark platform.
  • Theano - Python package for deep learning that can utilize NVIDIA's CUDA toolkit to run on the GPU.
  • Thunder - Large-scale analysis of neural data.
  • Vahara - Machine learning and natural language processing with Apache Pig.
  • Viv - global platform that enables developers to plug into and create an intelligent, conversational interface to anything.
  • Vowpal Wabbit - learning system sponsored by Microsoft and Yahoo!.
  • WEKA - suite of machine learning software.
  • Wit - Natural Language for the Internet of Things.
  • Wolfram Alpha - computational knowledge engine.
  • YHat ScienceOps - platform for deploying, managing, and scaling predictive models in production applications.

Benchmarking

Security

System Deployment

  • Ankush - A big data cluster management tool that creates and manages clusters of different technologies..
  • Apache Ambari - operational framework for Hadoop mangement.
  • Apache Bigtop - system deployment framework for the Hadoop ecosystem.
  • Apache Helix - cluster management framework.
  • Apache Mesos - cluster manager.
  • Apache Slider - is a YARN application to deploy existing distributed applications on YARN.
  • Apache Whirr - set of libraries for running cloud services.
  • Apache YARN - Cluster manager.
  • Brooklyn - library that simplifies application deployment and management.
  • Buildoop - Similar to Apache BigTop based on Groovy language.
  • Cloudera Director - a comprehensive data management platform with the flexibility and power to evolve with your business.
  • Cloudera HUE - web application for interacting with Hadoop.
  • Deimos - Mesos containerizer hooks for Docker.
  • Develoop - tool for provisioning, managing and monitoring Apache Hadoop.
  • Facebook Autoscale - the load balancer will concentrate workload to a server until it has at least a medium-level workload.
  • Facebook Prism - multi datacenters replication system.
  • Ganglia Monitoring System - scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.
  • Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them..
  • Google Borg - job scheduling and monitoring system.
  • Google Omega - job scheduling and monitoring system.
  • Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting..
  • Hortonworks HOYA - application that can deploy HBase cluster on YARN.
  • Jumbune - Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs..
  • Marathon - Mesos framework for long-running services.
  • Myriad - a mesos framework designed for scaling YARN clusters on Mesos. Myriad can expand or shrink one or more YARN clusters in response to events as per configured rules and policies..
  • Neflix SimianArmy - a suite of tools for keeping your cloud operating in top form.

Container Manager

  • Amazon EC2 Container Service - a highly scalable, high performance container management service that supports Docker containers.
  • Docker - an open platform for developers and sysadmins to build, ship, and run distributed applications.
  • Fig - fast, isolated development environments using Docker.
  • Google Container Engine - Run Docker containers on Google Cloud Platform, powered by Kubernetes.
  • Kubernetes - open source implementation of container cluster management.
  • Rocket - an alternative to the Docker runtime, designed for server environments with the most rigorous security and production requirements.

Applications

  • Adobe Spindle - Next-generation web analytics processing with Scala, Spark, and Parquet.
  • Apache Kiji - framework to collect and analyze data in real-time, based on HBase.
  • Apache Nutch - open source web crawler.
  • Apache OODT - capturing, processing and sharing of data for NASA's scientific archives.
  • Apache Tika - content analysis toolkit.
  • Domino - Run, scale, share, and deploy models Ñ without any infrastructure..
  • Eclipse BIRT - Eclipse-based reporting system.
  • Eventhub - open source event analytics platform.
  • HIPI Library - API for performing image processing tasks on Hadoop's MapReduce.
  • Hunk - Splunk analytics for Hadoop.
  • MADlib - data-processing library of an RDBMS to analyze data.
  • PivotalR - R on Pivotal HD / HAWQ and PostgreSQL.
  • Qubole - auto-scaling Hadoop cluster, built-in data connectors.
  • Sense - Cloud Platform for Data Science and Big Data Analytics.
  • Snowplow - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
  • SparkR - R frontend for Spark.
  • Splunk - analyzer for machine-generated date.
  • Talend - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.

Search engine and framework

  • Apache Blur - a search engine capable of querying massive amounts of structured data at incredible speeds.
  • Apache Lucene - Search engine library.
  • Apache Solr - Search platform for Apache Lucene.
  • ElasticSearch - Search and analytics engine based on Apache Lucene.
  • Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig..
  • Enigma.io - Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web.
  • Facebook Unicorn - social graph search platform.
  • Google Caffeine - continuous indexing system.
  • Google Percolator - continuous indexing system.
  • TeraGoogle - large search index.
  • Haeinsa - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
  • HBase Coprocessor - implementation of Percolator, part of HBase.
  • hIndex - Secondary Index for HBase.
  • SF1R Search Engine - distributed search engine written in c++.
  • Lily HBase Indexer - quickly and easily search for any content stored in HBase.
  • LinkedIn Bobo - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.
  • LinkedIn Cleo - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
  • LinkedIn Galene - search architecture at LinkedIn.
  • LinkedIn Zoie - is a realtime search/indexing system written in Java.
  • Sphnix Search Server - fulltext search engine.

MySQL forks and evolutions

  • Amazon Aurora - a MySQL-compatible, relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.
  • Amazon RDS - MySQL databases in Amazon's cloud.

timger
631 声望22 粉丝

« 上一篇
Awesome JavaScript
下一篇 »
大数据 论文