[ad_1]
Not too long ago, I labored with a big fortune 500 buyer on their migration from Apache Storm to Apache NiFi. For those who’re asking your self, “Isn’t Storm for complicated occasion processing and NiFi for easy occasion processing?”, you’re right. A number of clients selected a fancy occasion engine like Apache Storm for his or her easy occasion processing, even when Apache NiFi is the extra sensible selection, slicing drastically down on SDLC (software program growth lifecycle) time. Quick-forwarding right now, Storm has been deprecated within the Cloudera Knowledge Platform in favor of Apache Flink. Subsequently, clients should migrate from Storm to a different answer easier, no-coding and get the identical requirement carried out quicker, and on this case, their use instances had been a unbelievable match for NiFi. Since all of the flows had been easy occasion processing, the NiFi flows had been constructed out in a matter of hours (drag-and-drop) as an alternative of months (coding in Java).
Nifi Flows
My buyer’s greatest concern was efficiency, which they wished like-for-like, utilizing the identical {hardware} profile (4 nodes) for Storm and NiFi. They requested, “Can NiFi sustain with the identical throughput as Storm?” As you’ll see on this weblog, NiFi just isn’t solely maintaining with Storm; it beats Storm by 4x throughput.

Setting the context, why would a buyer wish to use Apache NiFi, Apache Kafka, and Apache HBase? As a result of, they’ll be capable of retailer large quantities of knowledge, course of this information in real-time or batch, and serve the info to different functions. My buyer had reconciliation information that wanted to be processed all through the day. Of their case, they’d one other exterior software that printed occasions to Kafka. As soon as the reconciliation information was put into totally different Kafka matters, NiFi would devour from them, performing easy transformations till writing to HBase. Many functions would use the info in HBase, to create stories given sure occasion standards. The reconciliation information wanted to be saved in its closing state in HBase for no less than two years.
Earlier than we are able to focus on the optimizations, it’s vital to know a buyer’s flows. What are the supply, goal, and transformations in-between for every move? In case your supply is Kafka, the place Kafka polls messages in batches of 1,000+, Report Oriented processors will excel throughput. If the goal also can deal with batches reminiscent of HBase, you’re even in a greater spot. After which the transformations, reminiscent of JoltTransformRecord, ConvertRecord, QueyRecord, PartitionRecord, and ScriptedTransformRecord. A typical theme in all of those processors is utilizing “Report,” that means as an alternative of dealing with a single file in every processor, we’re in a position to batch collectively a number of recordsdata in a single FlowFile in NiFi. The Report Oriented processors have been round for a number of years (2017), however some clients have but to undertake the Data.
Report Oriented processors are far past simply studying/writing from one format to a different. For those who can course of extra work in a single process (aka batching), you’re creating efficiencies inside NiFi in issues like provenance. If the provenance has fewer FlowFiles to trace, you’re saving disk I/O. To supply an instance, here’s a move earlier than utilizing Report Oriented processors:
(Supply) ConsumeKafka -> (Rework) ConvertAvroToJSON -> EvaluateJsonPath -> RouteOnAttribute -> SplitJSON -> TransformJSON -> MergeContent -> (Goal) PutHBase
This move seemingly feels acquainted if you happen to haven’t used Report Oriented processors (eight processors). This move is what my buyer was gaining barely over 30 GB in 5 minutes, matching the throughput of Storm. As soon as we adopted Report Oriented processors, the move modified to (5 processors):
(Supply) ConsumeKafkaRecord -> (Rework) PartitionRecord -> RouteOnAttribute -> TransformRecord -> (Goal) PutHBaseRecord
Going from eight processors to 5 processors could not appear to be a lot of a change, however this simplifies and optimizes the general variety of duties out there inside NiFi. These NiFi duties/threads could be spent on different processors. Through the use of Report Oriented processors, my buyer was in a position to achieve over 160+ GB in 5 minutes, shattering their throughput issues. The majority of the throughput features had been from batching a number of Kafka messages (1,000+) into one FlowFile. A typical false impression is that you’ll want to use Schema Registry for the Report Oriented processors. My buyer has but to make use of Schema Registry, however created an inner Schema Controller Service inside NiFi. Through the use of this inner Controller Service, the Report readers and writers might collect their schemas from one centralized location.
I hope you’ve gained new insights into the big efficiency enhancements through the use of Report Oriented processors. The most recent model of NiFi is a part of Cloudera Movement Administration (CFM) 2.2.1.
Take the following steps to:
[ad_2]
