Stream Processing using Kafka Streams

Mark Lu
6 min readFeb 24, 2021

I am a software engineer working in an early stage startup. Half a year ago, we need to choose a stream processing framework to build out some of our features. We ended up picking Kafka Streams. In this article, I will go over the decision process and the overall experience of using Kafka Streams as a reflection of this journey, but not to cover the technical details of how exactly Kafka Streams works internally,

What is Kafka Streams?

Kafka Streams is a library for building streaming applications that transform data from input Kafka topics and output into various sinks such as Kafka topics, databases or external services. Stream processing framework is termed in contrast with the batch processing frameworks such as Hadoop, or micro-batching framework such as Apache Spark. Kafka Streams library touted addressing the following problems in stream processing applications:

  • Event-at-a-time processing (not microbatch) with millisecond latency
  • Stateful processing including distributed joins and aggregations
  • A convenient DSL
  • Windowing with out-of-order data using a DataFlow-like model
  • Distributed processing and fault-tolerance with fast failover
  • Reprocessing capabilities so you can recalculate output when your code changes
  • No-downtime in rolling deployments

Why Kafka Streams?

Coming to the world of stream processing frameworks, there are a lot of open source options:

  • Kafka Streams
  • Apache Flink
  • Apache Spark
  • Apache Storm
  • Apache Samza
  • Apache Apex
  • Apache Flume

As an early stage startup, we don’t have the luxury to do an exhaustive evaluation and comparison on all these frameworks. We have made a list of selection criteria specific to our situation:

  1. Can be easily deployed into the Kubernetes cluster
  2. Open Source Licensing
  3. Java Friendly
  4. Easy to operate with
  5. Stream first approach

The first criteria is there because we have standardized our service infrastructure on Kubernetes. It will reduce the operational overhead if all the software stack we adopted can be deployed into Kubernetes. Most of our backend software developers are proficient in Java, it will be more productive if the stream processing framework is Java friendly. As an early startup, the engineering productivity is crucial. After applying applying the above criteria, we have narrowed down the list to just the following two:

  • Kafka Stream
  • Apache Flink

After some initial prototyping and evaluation, we finally went ahead with Kafka Streams. Here are a few observations on Apache Flink:

  1. There are existing helm charts for deploying a Flink cluster into Kubernetes already.
  2. A separate Flink cluster need to be set up and maintained.
  3. Extra work is needed to integrate Flink cluster with monitoring tools such as DataDog so as to watch the health status of the cluster.
  4. Need to build domain knowledge to respond to possible problems Flink cluster could encounter.
  5. Flink job binaries need to be separated built and deployed into the Flink cluster.
  6. Local testing is more involved as a Flink job jar need to be built, deployed and run in a local Flink instance.

The reasoning for us to adopt the Kafka Streams are:

  1. No need to setup and operate a separate cluster, therefore less operational burden
  2. Kafka Streams can be embedded into the existing Java-based application or micro-services, therefore it’s easier to debug issues.
  3. Existing standard CI/CD pipelines for deploying micro-services can be used for deploying Kafka Streams based stream processing services
  4. Kafka Streams DSL is perfectly capable for handling our stream processing use cases which can be quickly prototyped locally.

Even though we have chosen Kafka Streams for now primarily from the engineering productivity perspective, since we are an early stage startup with limited resources. This doesn’t mean we won’t revisit this again in the future and adopt a different framework for new streaming use cases. Especially Apache Flink looks like a promising framework with good reputation for high performance and high throughput stream processing.

How is Kafka Streams being used?

Use Cases

There are a few use cases in our company that we could leverage Kafka Streams’ streaming processing capabilities:

  1. Join separate trace events ingested at different time with the same correlation ID from one event stream and publish the joined trace event into another topic.
  2. Learn from a stream of API trace event payloads for a period of time and summarize the API’s schema.
  3. Aggregate from a stream of API calls events in a time window to derive the call latency, error rate baseline as the basis for anomaly detection.

Take the first use case as an example, the Kafka streams DSL for joining the relevant trace events looks like below:

// ... Initialize rawTraceEventStream kafka stream from source topic
rawTraceEventStream
.windowedBy(
SessionWindows.with(
Duration.ofSeconds(windowSize))
.grace(Duration.ofSeconds(windowSize)))
.aggregate(
JoinedTraceData::new,
this::addTraces,
this::mergeTraces,
Named.as("AggregatedTrace"),
Materialized.<String, JoinedTrace, SessionStore<Bytes, byte[]>>as(SESSION_STATE_STORE_NAME)
.withValueSerde(new JoinedTraceSerde())
.withRetention(Duration.ofMinutes(10))

The windowedBy method defined a session window with a time window and a grace period for late arrivals with the same time length. The aggregate method defined how to add into and merge events in the same session window and how long the state will be kept in storage before the state is discarded.

As we can see it’s relatively straight-forward to define stream processing logic using Kafka Streams DSL. In the next section, we will discuss how to package and deploy the stream processing service.

Unit Testing

Once you have developed the stream processing logic, how do you make sure the logic is functioning properly as you wish it be? Definitely you don’t want to test it in the dev/staging environment after the code has been merged. Fortunately there is a Kafka Streams test utility library available for testing the stream processing logic through standard unit test just like testing any other type of Java classes. You will need to include the following Maven dependency if you happen to use the Maven build system.

<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams-test-utils</artifactId>
<version>2.0.0</version>
<scope>test</scope>
</dependency>

The test-utils package provides a TopologyTestDriver that can be used pipe data through a Topology that is either assembled manually using Processor API or via the DSL using StreamsBuilder. The test driver simulates the library runtime that continuously fetches records from input topics and processes them by traversing the topology.

For better unit testability, it’s recommended to use Event Time instead of Wallclock Timeas the stream processing timing strategy.

Deploy Stream Processing Application into Kubernetes

As shown above the Kafka streams can be initialized/embedded in a Java application, and be deployed as a micro-service.

To run the micro-service in a Kubernetes cluster, it’s best to deployed it as a StatefulSet with stable storage to keep its state. Any serious application with a reasonably complex topology and processing pipeline will generate a lot of “state”. In such a case, regular app operations like scale out or anomalies such as crashes etc. will trigger the process of restore/refresh state from the Kafka back-up topic. This can be costly in terms of time, network bandwidth etc. Using StatefulSet, we can ensure that each Pod will always have a stable storage medium attached to it and this will be stable (not change) over the lifetime of the StatefulSet. This means that after restarts, upgrades etc. (most of) the state is already present locally on the disk and the app only needs to fetch the “delta” state from the Kafka topics (if needed). This, in turn, implies that state recovery time will be much smaller or may not even be required in few cases.

For faster better performance of read and write streaming state of the StatefulSet, it’s recommended to attach SSD class of persistent volume to the streaming service. Below is an example of deploying a streaming service as a K8S StatefulSet with 50GB of attached SSD persistent volume per instance:

####################################################################
# Define 'ssd' storage class
####################################################################
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ssd
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd
####################################################################
# Streaming Headless Service
####################################################################
---
apiVersion: v1
kind: Service
metadata:
name: streaming-headless
labels:
app: streaming
spec:
clusterIP: None
selector:
app: streaming
####################################################################
# Streaming StatefulSet
####################################################################
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: streaming
labels:
app: streaming
spec:
serviceName: streaming-headless
podManagementPolicy: Parallel
#.....
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes:
- "ReadWriteOnce"
storageClassName: ssd
resources:
requests:
storage: "50Gi"

Track Stream Status using CMAK

We were able to use the open-source CMAK (Cluster Manager for Apache Kafka) UI to track the kafka stream consumer’s status. You can click on the individual kafka stream consumer link to check the status of each individual streaming task.

Conclusion

After choosing the Kafka Streams to build the stream processing applications, engineering productivity is pretty high, as it’s not much different from developing any other Java application or services. We have taken care of the first use case mentioned above pretty quickly. In general, we are pretty happy with the choice of using Kafka Streams.

There are some challenges when performing a rolling upgrade of the streaming service. It will result in an event processing lag due to the rebalancing of streaming tasks caused by taking down and bringing back the streaming service instances being upgraded. It takes some time for the event processing to catch up again. We will leave it for another post on how to improve the processing lag.

--

--

Mark Lu

Founding Engineer @ Trace Data. Experienced software engineer, tech lead in building scalable distributed systems/data platforms.