Apache Cassandra provides a highly scalable and reliable data storage engine for your application. When you need tools to help you analyse or move data there is a range of powerful and innovative options to choose from.
One of the things I’ve been doing recently is getting my head around some of the technologies that complement Cassandra. There is quite an array of options out there with some overlap of functionality depending on your exact use case. This post provides summary of the technology I’m aware of. Let me know if I’ve missed anything!
Firstly, there are the core DataStax support technologies. These are available as part of the DataStax Enterprise edition and, with subscription costs, are maintained and supported by DataStax. They generally focused on providing analytics capability on data stored in Cassandra. DataStax has contributed most of the integration components for these technologies back to the OpenSource community. These are really the “first class citizens” of the Cassandra technology echo system. The integration components are robust and a good volume of Cassandra-specific examples and documentation is freely available.
Spark | Highly flexible and scalable analytics with in-memory analysis capabilities. Provide full SQL query support, Machine Learning, Stream Processing. Include map-reduce capability at 10-100x the performance of Hadoop. The emerging leader for scalable analysis engines. |
---|---|
Hadoop | The canonical Hadoop map-reduce engine and interface on a Cassandra data store. |
Hive | Querying and meta-date management aimed at data warehouse use cases, running on top of Hadoop. |
Pig | A high-level language for developing map-reduce programs to run on Hadoop and preform data analysis. |
Sqoop | Tool for loading data from relational databases to Hadoop. |
Solr | Scalable and fault-tolerant indexing and search system providing full-text search, hit-highlighting, faceted search. |
Mahout | Provides machine learning on top of Hadoop. |
Note: All of these technologies are provided by the Apache software foundation so if you are searching for them search for “Apache Pig” for example to avoid wading through Wikipedia articles on animals in the genus Sus.
In addition to the Datastax supported technologies, there is a range of other technologies available with Cassandra integration. The following table sets out a few notable examples.
Apache Flume | A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Cassandra connector is more an example that production ready connection but provides a good starting point for a production implementation. See https://github.com/btoddb/flume-ng-cassandra-sink. |
---|---|
R | The de facto standard for end-user advance analytics including visualisation in the open source world. Cassandra connectors for R are available as well as the option of connecting via Spark or Hive. See https://rforge.net/doc/packages/RCassandra/00Index.html. |
Apache Drill | Apache drill provides standard SQL querying over NoSQL data stores. It allows you to use standard BI tools such as Tableau, Excel, etc over Cassandra. Cassandra support is not yet part of the main distribution but is under active development.https://github.com/yssharma/drill/tree/cassandra-storage |
Apache Storm | Storm provides real-time stream processing (as opposed to Spark Stream micro batches). A Cassandra bolt (Storm plug-in) is available here: https://github.com/hmsonline/storm-cassandra |
Pentaho | Pentaho is an open-source BI, ETL and analytics suite. It has native support for Cassandra.https://wiki.pentaho.com/display/BAD/Cassandra |
And, of course, there are many more. Let me know if I’ve missed anything important.
Most of these technologies can currently be run against a Cassandra cluster managed by Instaclustr. However, in many cases the best architecture is to run the software on the same servers as your Cassandra nodes. While our automated provision and management systems do not currently support this scenario we are looking to make some of these technologies available in the Instaclustr environment in future releases. If you have a need to run Cassandra with one of these technologies get in touch – we do have options to make these available for you now and would love to work with pilot customers so shape our offerings to meet your needs.