Records Processors In Apache Nifi

Why are you need recording and reading tools in Apache NIFI processors and how they work: we understand the example of QueryReCord, PartitionReCord and Routetext. The similarities and differences of these processors, as well as the subtleties of their use in the tasks of the Date Engineering.

QueryRecord processor in Apache Nifi

Recall that in the stream of the ETL market Apache NIFI, processors are used to listen to the incoming data, extract them from external sources and publication to destination. Also, these handlers allow routing, transforming or extracting information from stream files (FlowFile).

Apache NIFI provides more than 400 finished processors. Also, the Date Engineer can write its own handler, which we talked about here. One of the finished processors that ensure the processing of the records is QueryRecord, which evaluates one or more SQL requests for FlowFile. Then the result of the SQL request becomes the contents of the output stream file. This processor can be used, for example, for filtering by fields or lines and data transformation. Columns can be renamed, simple calculations and aggregation, etc. The processor is adjusted using the service controller for reading records and recording records to ensure the flexibility of formats of incoming and outgoing data.

The QueryRecord processor should be configured with one property determined by the user. The property name is the attitude into which the data is directed, and its value is the ANSI SQL Select operator based on the Apache Calcite optimizer, which is used to indicate how the input data should be transformed/filtered. If the transformation ends unsuccessfully, the source FlowFile is directed in relation to the “failure”. In case of success, the selected data will be directed in a related attitude. If the record of the records of the records decides to inherit the scheme from the recording, the inherited scheme will be from a set of results, and not from the input record. This allows one instance of QueryReCord processor to have several requests, each of which returns another set of columns and assemblies. However, as a result, the derivative scheme will not have the name of the scheme, so it is important that the configured recording module does not try to write the name of the scheme as an attribute when inheriting the scheme from the recording.

Another interesting example of using the QueryRecord processor is a branch of one data flow into several different ones. For example, you need to analyze the system logs by choosing the type of error (error) to send them to the notification mechanism: notification to Slack, e-mail, SMS-informing, etc. You can register all these requests at the same time using only one QueryReCord processor, and Process each of these flows as necessary.

Thus, the QueryRecord processor allows you to consider each stream file of the database table, and start the SQL request for it, providing results in the form of a free-free FlowFile. Due to the fact that the processor uses reading and recording messages, you can use it to convert data from one format to another. For example, Json Reader and Avro Writer to read the incoming JSON and record results in Avro-format.

The presence of reading and writing means in this processor allows the Date Engineer not to worry about converting data into the desired format: you can use data from any format if there is a reading tool for them. Apache NIFI provides many different records of records of records: CSVREADER, JSONTREEREADER, AVROROREADER, etc. There are also SYSLOGREADER reading tools, PARQUET files, XML and many other formats.

It is noteworthy that the Syslogwriter systemic logs are absent, because In most cases, the requested data from system logs should be more structured/amenable. Therefore, they are often recorded in JSON format. Apache NIFI has a JSONRECORDSETWRITER recording tool for this. However, if you need to record the output data in the Syslog format, you can apply FreeFormtextRacordSetwriter as a record of the recording means by configuring the property of the RAW Message unprocessed message to the True value so that the recording contains a field with the name _RaW containing the unprocessed message of the system journal. Then you should configure the TEXT entry tools as just $ {_ raw}.

PartitionReCord processor

If the QueryRecord processor is universal, then PartitionReCord is less powerful, however, it can also be used in Apache NIFI to create several flows from one incoming flow. The lack of power in it is compensated by performance and simplicity. PartitionReCord allows you to group such data using Recordpath – a simple syntax based on Jsonpath and Xpath. This processor has one input and many outputs. But unlike QueryReCord, which can direct one entry in many different output stream files, PartitionReCord will direct each entry FlowFile exactly one outgoing stream file.

PartitionReCord allows routing data in accordance with the value in the record, as well as group them for storage. He receives data focused on recording (i.e., data that can be read by a configured record of records) and evaluates one or more recording ways for each entry in the incoming FlowFile. Then each entry is grouped with other similar records, and for each group of such records, FlowFile is created. The similarity of the records is determined by the properties set by the user, the value of which is Recordpath. Two recordings are considered the same if they are of the same meaning for all configured recording ways. Since all records in a given output stream file are the same for the fields indicated in the Recordpath, the attribute is added for each field.

Like QueryReCord, PartitionReCord – a processor oriented on the record. Therefore, the Date-Engineer should configure both a record of records of records and a record of recordings, i.e. Determine the properties of Record Reader and Record Writer. You should also inform the processor how to divide the data using Recordpath. To do this, add one or more properties determined by the user. The property name becomes the name of the FlowFile attribute, which is added to each FlowFile. The value of the property is the expression of Recordpath, which NIFI will evaluate for each record. The result determines which group or section is the recording.

Routetext processor

The considered QueryReCord and PartitionReCord processors provide greater flexibility and high -performance power. But sometimes you have to deal with the data of strange formats that are not oriented to recording. Of course, you can create your own user reader for Java or use Scripted Record Reader using Groovy or Python. But you can work with these data as with raw text using the Routetext processor. It directs text data based on the set of user rules. Each line of the incoming FlowFile file is compared with the values indicated in the user properties. The mechanism of comparing the text with the properties determined by the user is set by the strategy of comparison. Then the data is routed in accordance with these rules, routing each line of the text separately.

Thus, the Routetext processor allows you to direct the lines of the text into a certain attitude without dividing the data into separate lines. This provides high performance. It also allows you to group the lines of the text and separate them from the lines that are dissimilar to them. In fact, this group gives the same possibilities as PartitionReCord, but for unprocessed text data.

The processor allows you to work with regular expressions and use the internal language of niFi expressions to evaluate the line to process coding and decoding text, shielding, syntactic dates and other powerful functions.

Similarly, QueryReCord, the Routetext processor allows you to branch one input flow of the text and branch into many streams, break the text, filter or group lines. Routetext is also similar to PartitionReCord, providing the ability to separate data oriented on recordings.

In conclusion, we note that the three considered Apache NiFi processors are far from the only tool processing tools in this framework. All of them are extremely flexible and powerful, provide high performance. This is achieved through storage of many tiny records, combined into larger streaming files up to 3 MB, because NIFI has certain FlowFile size restrictions, which we wrote about here and here. But if the data does not correspond to the standard format specification, it is impossible to use processors based on records. But even in this case, it is important to make sure that these data are not divided into many tiny stream files, as this greatly reduces performance.

By Navid Anjum

Full-stack web developer and founder of Laravelaura. He makes his tutorials as simple as humanly possible and focuses on getting the students to the point where they can build projects independently.

Leave a Reply

Your email address will not be published. Required fields are marked *