Apache Storm 1.1.0 released

The Apache Storm community is pleased to announce that version 1.1.0 has been released and is available from the downloads page.

This release represents a major milestone in the evolution of Apache Storm, and includes a large number of new features, usability and performance improvements, some of which are highlighted below.

Streaming SQL

Apache Storm 1.1.0 supports native Streaming SQL, powered by Apache Calcite, that allows users to run SQL queries over streaming data as well update external systems and data stores such as Apache Hive. To deploy an SQL-based topology users define the SQL query in a text file and use the storm sql command to submit the resulting topology to an Apache Storm cluster. Behind the scenes Apache Storm will compile the SQL into a Trident topology and run it on the cluster.

Apache Storm's SQL support includes the following features:

  • Streaming from/to external sources including Apache Kafka, HDFS, MongoDB, and Redis
  • Tuple filtering
  • Projections
  • User-defined paralallism of generated components
  • User Defined Functions (UDFs)
  • CSV, TSV, and Avro input/output formats
  • Extensibility to additional data sources via the ISqlTridentDataSource interface

For more information about Apache Storm's SQL support including examples, refer to the following resources:

Apache Kafka Integration Improvements

In addition to the traditional support for Kafka version 0.8/0.9 based on the Kafka simple consumer, Apache Storm includes support for Kafka 0.10 and later based on the new Kafka consumer API. Apache Storm's integration with Kafka 0.10 and later version is highly flexible and extensible, some of the features include:

  • Enhanced configuration API
  • Fine-grained offset control (at start and after failure)
  • Consumer group support
  • Plugable record translators
  • Wildcard topics
  • Multiple stream support
  • Manual partition control
  • Kafka security support

For more information on Apache Storm's Kafka integration please refer to the following documentation:

PMML (Predictive Model Markup Language) Support

In order to better support machine learning use cases, Apache Storm now includes support for executing PMML models in topoliges via a generic PMML bolt. The PMMLPredictorBolt allows users to specify a model, the raw input, and the resulting streams and output fields. At runtime the bolt will process incoming raw data, execute the model with the given input, and output tuples with scores for predicted fields and output fields.

More information on Apache Storm's PMML support can be found here.

Druid Integration

Druid is a scalable, high-performance, column oriented, distributed data store popular for real time analytics use cases. Apache Storm 1.1.0 introduces a Apache Storm bolt and Trident state implementations for streaming data into a Druid data store.

Documentation for Apache Storm's Druid integration can be found here.

OpenTSDB Integration

OpenTSDB is a highly scalable time series database based on Apache HBase. Apache Storm 1.1.0 adds an Apache Storm bolt and Trident state implementations for writing data to OpenTSDB. Apache Storm's OpenTSDB integration gives users fine-grained control over how Apache Storm tuples map to OpenTSDB data structure by providing a simple interface (ITupleOpenTsdbDatapointMapper) that performs the translation.

Move information about Apache Storm's OpenTSDB integration can be found here.

AWS Kinesis Support

For users looking to integrate with Amazon's Kinesis service, Apache Storm 1.1.0 now includes a spout for consuming message streams from Kinesis. Like most of Apache Storm's external system integration components, the Kinesis spout provides a simple interface (RecordToTupleMapper)for controlling how Kinesis messages are translated to Apache Storm tuples. The Kinesis spout provides an additional interface (FailedMessageRetryHandler) that allows users to customize the Spout's failure handling logic.

Documentation for the Kinesis spout can be found here.

HDFS Spout

Apache Storm's HDFS integration now includes a spout that continuously streams data from the HDFS filesystem. Apache Storm's HDFS spout monitors a configurable directory for new files and feeds that data into a topology. When the spout has completed processing a file, it will be moved to the configured archive directory. In the event a file is corrupt or is otherwise not processable, the corresponding file will be moved to a specific directory. Parallelism of the spout is made possible through a locking mechanism that ensures each file is "owned" by a single spout instance. The HDFS spout supports connecting to HDFS instances that are secured with Kerberos authentication.

More information on using the HDFS spout can be found in the Apache Storm HDFS Documentation

Flux Improvements

Flux is a framework and set of utilities that allow users to declaratively define Apache Storm topologies and avoid hard-coding configuration values in topology components. Apache Storm 1.1.0 introduces the following enhancements to the Flux framework:

  • Flux topology visualization in Storm UI
  • Support for Stateful bolts and windowing
  • Support for named streams
  • Support for lists of references

More information about Flux can be found in the Flux documentation.

Topology Deployment Enhancements

In previous versions of Apache Storm it was typically necessary to include all topology dependences in a shaded "uber jar," or by making them available on Apache Storm's classpath. In Apache Storm 1.1.0 the storm jar command now includes options to upload additional dependency jars during topology submission. The --jars command line option allows users to specify local jar files to upload. Alternatively, the storm jar command offers the --artifacts option for specifying additional jar file dependencies by their Maven coordinates. Finally, for Maven artifacts outside the Maven Central, the --artifactRepository option allows users to specify additional repositories for dependency resolution.

More informaton about available options of the storm jar command can be found by runnng the storm help jar command.

Resource Aware Scheduler (RAS) Improvements

The Resource Aware Scheduler introduced in Apache Storm 1.0 added a scheduler implementation that takes into account both the memory (on-heap and off-heap) and CPU resources available in a cluster. In Apache Storm 1.1.0 the RAS algorithm has been overhauled to dramatically improve cluster resource utilization and introduces rack awareness into the scheduling strategy.

More information on the new RAS capabilities can be found in the RAS documentation and the JIRA ticket introducing the new rack awareness algorithm.

Important Changes in the Binary Distribution

In order to minimize the file size of the binary distribution, external component (i.e. "connector") binaries and compiled examples are no longer included. Examples are included in source form only, but can easily compiled with the Maven mvn install command.

External Components Moved to Maven Central

Most external components are now hosted exclusively in Maven Central. External component directories will contain a README.md file, but no jar files. We encourage users to leverage a build system with Maven style dependency resolution (Maven, Gradle, etc.) to build topology jars and avoid building topology jars manually.

Maven coordinates for these components can be determined as follows:

Group ID: org.apache.storm Artifact ID: component directory name Version: 1.1.0

For users who cannot use Maven for building, external component jar files can be downloaded from Maven Central with the following URL pattern:

https://repo1.maven.org/maven2/org/apache/storm/${artifactID}/${version}/${artifactId}-${version}.jar

For example, to download the storm-kafka-client jar file the URL would be:

https://repo1.maven.org/maven2/org/apache/storm/storm-kafka-client/1.1.0/storm-kafka-client-1.1.0.jar

Thanks

Special thanks are due to all those who have contributed to Apache Storm -- whether through direct code contributions, documentation, bug reports, or helping other users on the mailing lists. Your efforts are much appreciated.

The full changelog for this release is listed below.

Full Changelog

  • STORM-2425: Storm Hive Bolt not closing open transactions
  • STORM-2409: Storm-Kafka-Client KafkaSpout Support for Failed and NullTuples
  • STORM-2423: Join Bolt should use explicit instead of default window anchoring for emitted tuples
  • STORM-2416: Improve Release Packaging to Reduce File Size
  • STORM-2414: Skip checking meta's ACL when subject has write privileges for any blobs
  • STORM-2038: Disable symlinks with a config option
  • STORM-2240: STORM PMML Bolt - Add Support to Load Models from Blob Store
  • STORM-2412: Nimbus isLeader check while waiting for max replication
  • STORM-2408: build failed if storm.kafka.client.version = 0.10.2.0
  • STORM-2403: Fix KafkaBolt test failure: tick tuple should not be acked
  • STORM-2361: Kafka spout - after leader change, it stops committing offsets to ZK
  • STORM-2353: Replace kafka-unit by kafka_2.11 and kafka-clients to test kafka-clients:0.10.1.1
  • STORM-2387: Handle tick tuples properly for Bolts in external modules
  • STORM-2345: Type mismatch in ReadClusterState's ProfileAction processing Map
  • STORM-2400: Upgraded Curator to 2.12.0 and made respective API changes
  • STORM-2396: setting interrupted status back before throwing a RuntimeException
  • STORM-1772: Adding Perf module with topologies for measuring performance
  • STORM-2395: storm.cmd supervisor calls the wrong class name
  • STORM-2391: Move HdfsSpoutTopology from storm-starter to storm-hdfs-examples
  • STORM-2389: Avoid instantiating Event Logger when topology.eventlogger.executors=0
  • STORM-2386: Fail-back Blob deletion also fails in BlobSynchronizer.syncBlobs.
  • STORM-2388: JoinBolt breaks compilation against JDK 7
  • STORM-2374: Storm Kafka Client Test Topologies Must be Serializable
  • STORM-2372: Pacemaker client doesn't clean up heartbeats properly
  • STORM-2326: Upgrade log4j and slf4j
  • STORM-2334: Join Bolt implementation
  • STORM-1363: TridentKafkaState should handle null values from TridentTupleToKafkaMapper.getMessageFromTuple()
  • STORM-2365: Support for specifying output stream in event hubs spout
  • STORM-2250: Kafka spout refactoring to increase modularity and testability
  • STORM-2340: fix AutoCommitMode issue in KafkaSpout
  • STORM-2344: Flux YAML File Viewer for Nimbus UI
  • STORM-2350: Storm-HDFS's listFilesByModificationTime is broken
  • STORM-2270 Kafka spout should consume from latest when ZK partition commit offset bigger than the latest offset
  • STORM-1464: storm-hdfs support for multiple output files and partitioning
  • STORM-2320: DRPC client printer class reusable for local and remote DRPC
  • STORM-2281: Running Multiple Kafka Spouts (Trident) Throws Illegal State Exception
  • STORM-2296: Kafka spout no dup on leader changes
  • STORM-2244: Some shaded jars doesn't exclude dependency signature files
  • STORM-2014: New Kafka spout duplicates checking if failed messages have reached max retries
  • STORM-1443: [Storm SQL] Support customizing parallelism in StormSQL
  • STORM-2148: [Storm SQL] Trident mode: back to code generate and compile Trident topology
  • STORM-2331: Emitting from JavaScript should work when not anchoring.
  • STORM-2225: change spout config to be simpler.
  • STORM-2323: Precondition for Leader Nimbus should check all topology blobs and also corresponding dependencies
  • STORM-2330: Fix storm sql code generation for UDAF with non standard sql types
  • STORM-2298: Don't kill Nimbus when ClusterMetricsConsumer is failed to initialize
  • STORM-2301: [storm-cassandra] upgrade cassandra driver to 3.1.2
  • STORM-1446: Compile the Calcite logical plan to Storm Trident logical plan
  • STORM-2303: [storm-opentsdb] Fix list invariant issue for JDK 7
  • STORM-2236: storm kafka client should support manual partition management
  • STORM-2295: KafkaSpoutStreamsNamedTopics should return output fields with predictable ordering
  • STORM-2300: [Flux] support list of references
  • STORM-2297: [storm-opentsdb] Support Flux for OpenTSDBBolt
  • STORM-2294: Send activate and deactivate command to ShellSpout
  • STORM-2280: Upgrade Calcite version to 1.11.0
  • STORM-2278: Allow max number of disruptor queue flusher threads to be configurable
  • STORM-2277: Add shaded jar for Druid connector
  • STORM-2274: Support named output streams in Hdfs Spout
  • STORM-2204: Adding caching capabilities in HBaseLookupBolt
  • STORM-2267: Use user's local maven repo. directory to local repo.
  • STORM-2254: Provide Socket time out for nimbus thrift client
  • STORM-2200: [Storm SQL] Drop Aggregate & Join support on Trident mode
  • STORM-2266: Close NimbusClient instances appropriately
  • STORM-2203: Add a getAll method to KeyValueState interface
  • STORM-1886: Extend KeyValueState iface with delete
  • STORM-2022: update Fields test to match new behavior
  • STORM-2020: Stop using sun internal classes
  • STORM-1228: port fields_test to java
  • STORM-2104: New Kafka spout crashes if partitions are reassigned while tuples are in-flight
  • STORM-2257: Add built in support for sum function with different types.
  • STORM-2082: add sql external module storm-sql-hdfs
  • STORM-2256: storm-pmml breaks on java 1.7
  • STORM-2223: PMML Bolt.
  • STORM-2222: Repeated NPEs thrown in nimbus if rebalance fails
  • STORM-2190: reduce contention between submission and scheduling
  • STORM-2239: Handle InterruptException in new Kafka spout
  • STORM-2087: Storm-kafka-client: Failed tuples are not always replayed
  • STORM-2238: Add Timestamp extractor for windowed bolt
  • STORM-2235: Introduce new option: 'add remote repositories' for dependency resolver
  • STORM-2215: validate blobs are present before submitting
  • STORM-2170: [Storm SQL] Add built-in socket datasource to runtime
  • STORM-2226: Fix kafka spout offset lag ui for kerberized kafka
  • STORM-2224: Exposed a method to override in computing the field from given tuple in FieldSelector
  • STORM-2220: Added config support for each bolt in Cassandra bolts, fixed the bolts to be used also as sinks.
  • STORM-2205: Racecondition in getting nimbus summaries while ZK connectionions are reconnected
  • STORM-2182: Refactor Storm Kafka Examples Into Own Modules.
  • STORM-1694: Kafka Spout Trident Implementation Using New Kafka Consumer API
  • STORM-2173: [SQL] Support CSV as input / output format
  • STORM-2177: [SQL] Support TSV as input / output format
  • STORM-2172: [SQL] Support Avro as input / output format
  • STORM-2185: Storm Supervisor doesn't delete directories properly sometimes
  • STORM-2103: [SQL] Introduce new sql external module: storm-sql-mongodb
  • STORM-2175: fix double close of workers
  • STORM-2109: Under supervisor V2 SUPERVISOR_MEMORY_CAPACITY_MB and SUPERVISOR_CPU_CAPACITY must be Doubles
  • STORM-2110: in supervisor v2 filter out empty command line args
  • STORM-2117: Supervisor V2 with local mode extracts resources directory to topology root directory instead of temporary directory
  • STORM-2131: Add blob command to worker-launcher, make stormdist directory not writeable by topo owner
  • STORM-2018: Supervisor V2
  • STORM-2139: Let ShellBolts and ShellSpouts run with scripts from blobs
  • STORM-2072: Add map, flatMap with different outputs (T->V) in Trident
  • STORM-2134: improving the current scheduling strategy for RAS
  • STORM-2125: Use Calcite's implementation of Rex Compiler
  • STORM-1546: Adding Read and Write Aggregations for Pacemaker to make it HA compatible
  • STORM-1444: Support EXPLAIN statement in StormSQL
  • STORM-2099: Introduce new sql external module: storm-sql-redis
  • STORM-2097: Improve logging in trident core and examples
  • STORM-2144: Fix Storm-sql group-by behavior in standalone mode
  • STORM-2066: make error message in IsolatedPool.java more descriptive
  • STORM-1870: Allow FluxShellBolt/Spout set custom "componentConfig" via yaml
  • STORM-2126: fix NPE due to race condition in compute-new-sched-assign…
  • STORM-2124: show requested cpu mem for each component
  • STORM-2089: Replace Consumer of ISqlTridentDataSource with SqlTridentConsumer
  • STORM-2118: A few fixes for storm-sql standalone mode
  • STORM-2105: Cluster/Supervisor total and available resources displayed in the UI
  • STORM-2078: enable paging in worker datatable
  • STORM-1664: Allow Java users to start a local cluster with a Nimbus Thrift server.
  • STORM-1872: Release Jedis connection when topology shutdown
  • STORM-2100: Fix Trident SQL join tests to not rely on ordering
  • STORM-1837: Fix complete-topology and prevent message loss
  • STORM-2098: DruidBeamBolt: Pass DruidConfig.Builder as constructor argument
  • STORM-2092: optimize TridentKafkaState batch sending
  • STORM-1979: Storm Druid Connector implementation.
  • STORM-2057: Support JOIN statement in Storm SQL
  • STORM-1970: external project examples refator
  • STORM-2074: fix storm-kafka-monitor NPE bug
  • STORM-1459: Allow not specifying producer properties in read-only Kafka table in StormSQL
  • STORM-2052: Kafka Spout New Client API - Log Improvements and Parameter Tuning for Better Performance.
  • STORM-2050: [storm-sql] Support User Defined Aggregate Function for Trident mode
  • STORM-1434: Support the GROUP BY clause in StormSQL
  • STORM-2016: Topology submission improvement: support adding local jars and maven artifacts on submission
  • STORM-1994: Add table with per-topology & worker resource usage and components in (new) supervisor and topology pages
  • STORM-2042: Nimbus client connections not closed properly causing connection leaks
  • STORM-1766: A better algorithm server rack selection for RAS
  • STORM-1913: Additions and Improvements for Trident RAS API
  • STORM-2037: debug operation should be whitelisted in SimpleAclAuthorizer.
  • STORM-2023: Add calcite-core to dependency of storm-sql-runtime
  • STORM-2036: Fix minor bug in RAS Tests
  • STORM-1979: Storm Druid Connector implementation.
  • STORM-1839: Storm spout implementation for Amazon Kinesis Streams.
  • STORM-1876: Option to build storm-kafka and storm-kafka-client with different kafka client version
  • STORM-2000: Package storm-opentsdb as part of external dir in installation
  • STORM-1989: X-Frame-Options support for Storm UI
  • STORM-1962: support python 3 and 2 in multilang
  • STORM-1964: Unexpected behavior when using count window together with timestamp extraction
  • STORM-1890: ensure we refetch static resources after package build
  • STORM-1988: Kafka Offset not showing due to bad classpath.
  • STORM-1966: Expand metric having Map type as value into multiple metrics based on entries
  • STORM-1737: storm-kafka-client has compilation errors with Apache Kafka 0.10
  • STORM-1968: Storm logviewer does not work for nimbus.log in secure cluster
  • STORM-1910: One topology cannot use hdfs spout to read from two locations
  • STORM-1960: Add CORS support to STORM UI Rest api
  • STORM-1959: Add missing license header to KafkaPartitionOffsetLag
  • STORM-1950: Change response json of "Topology Lag" REST API to keyed by spoutId, topic, partition.
  • STORM-1833: Simple equi-join in storm-sql standalone mode
  • STORM-1866: Update Resource Aware Scheduler Documentation
  • STORM-1930: Kafka New Client API - Support for Topic Wildcards
  • STORM-1924: Adding conf options for Persistent Word Count Topology
  • STORM-1956: Disabling Backpressure by default
  • STORM-1934: Fix race condition between sync-supervisor and sync-processes
  • STORM-1919: Introduce FilterBolt on storm-redis
  • STORM-1945: Fix NPE bugs on topology spout lag for storm-kafka-monitor
  • STORM-1888: add description for shell command
  • STORM-1902: add a simple & flexible FileNameFormat for storm-hdfs
  • STORM-1914: Storm Kafka Field Topic Selector
  • STORM-1907: PartitionedTridentSpoutExecutor has incompatible types that cause ClassCastException
  • STORM-1925: Remove Nimbus thrift call from Nimbus itself
  • STORM-1909: Update HDFS spout documentation
  • STORM-1136: Command line module to return kafka spout offsets lag and display in storm ui
  • STORM-1911: IClusterMetricsConsumer should use seconds to timestamp unit
  • STORM-1893: Support OpenTSDB for storing timeseries data.
  • STORM-1723: Introduce ClusterMetricsConsumer
  • STORM-1700: Introduce 'whitelist' / 'blacklist' option to MetricsConsumer
  • STORM-1698: Asynchronous MetricsConsumerBolt
  • STORM-1705: Cap number of retries for a failed message
  • STORM-1884: Prioritize pendingPrepare over pendingCommit
  • STORM-1575: fix TwitterSampleSpout NPE on close
  • STORM-1874: Update logger private permissions
  • STORM-1865: update command line client document
  • STORM-1771: HiveState should flushAndClose before closing old or idle Hive connections
  • STORM-1882: Expose TextFileReader public
  • STORM-1873: Implement alternative behaviour for late tuples
  • STORM-1719: Introduce REST API: Topology metric stats for stream
  • STORM-1887: Fixed help message for set_log_level command
  • STORM-1878: Flux can now handle IStatefulBolts
  • STORM-1864: StormSubmitter should throw respective exceptions and log respective errors forregistered submitter hook invocation
  • STORM-1868: Modify TridentKafkaWordCount to run in distributed mode
  • STORM-1859: Ack late tuples in windowed mode
  • STORM-1851: Fix default nimbus impersonation authorizer config
  • STORM-1848: Make KafkaMessageId and Partition serializable to support
  • STORM-1862: Flux ShellSpout and ShellBolt can't emit to named streams
  • Storm-1728: TransactionalTridentKafkaSpout error
  • STORM-1850: State Checkpointing Documentation update
  • STORM-1674: Idle KafkaSpout consumes more bandwidth than needed
  • STORM-1842: Forward references in storm.thrift cause tooling issues
  • STORM-1730: LocalCluster#shutdown() does not terminate all storm threads/thread pools.
  • STORM-1709: Added group by support in storm sql standalone mode
  • STORM-1720: Support GEO in storm-redis