In Part 1 of this blog series, we focused on MirrorMaker 2 theory (Kafka replication, architecture, components and terminology) and invented some MirrorMaker 2 (MM2) rules. In this part, we will be more practical, and try out Instaclustr’s managed MirrorMaker 2 service and test the rules out with some experiments. 

1. Configuring Instaclustr Managed MirrorMaker 2 Mirrors

MirrorMaker 2
(Source: Shutterstock)

All of the software you need to run MirrorMaker 2 is Apache-licensed (Apache Kafka, Apache Kafka Connect, Apache MIrrorMaker 2), so you can run it yourself, or use a managed service such as Instaclustr’s managed MM2.  This is what I used to try some experiments, and here are the steps you need to follow.

  • Step 1: Log in to the Instaclustr management console
  • Step 2:  Create a Kafka cluster, e.g. kafka_1 (this will be the Kafka target cluster)
  • Step 3:  Create a Kafka Connect cluster, with kafka_1 as the target cluster (under Kafka Connect Options on Kafka cluster creation page).
  • Step 4:  Create another Kafka cluster, e.g. kafka_2 (this will be the source cluster)
  • Step 5:  Create a MirrorMaker 2 mirror (Kafka Connect -> Mirroring -> Create New Mirror).
  • Step 5 (a): Configure the mirror
    On Instaclustr managed MM2, the available mirror configuration options are as follows:
    • Kafka target: kafka_1 (fixed)
    • Rename mirrored topics: True/false (default true)
    • Kafka source: Instaclustr managed/other
    • Source cluster name: kafka_2 (the cluster you want to replicate topics from)
    • Connect via private IPs: true/false (default false)
    • Source cluster alias: kafka_2 (the default is the source cluster name)
    • Topics to mirror: “.*” (all by default)
    • Maximum number of tasks: 3 (default, and also the maximum value, is the number of partitions in the source topic)
  • Step 5 (b): Create mirror

Here’s a diagram showing what we end up with:

MirrorMaker 2.0

Note that we assume the source topic exists or is created automatically when an event is written to it the first time.

Under Kafka Connect->Mirroring you can see a useful summary of the existing mirrors including: data flow, topics, sync state, and actions (details and delete).

Under Mirror details you can see much more, including the mirror connector status, how many tasks are running (mirror and checkpoint), and a list of all the replicated topics (with renaming if applicable, e.g. A.topic1) and latencies (useful if you’ve used a regular expression and aren’t 100% sure what topics match it), and the full mirror configuration.

There are also useful mirroring metrics available in the console. For a given data flow and mirrored topic, metrics available are record count, record rate, byte count, byte rate, record age (avg, min, max), and replication latency (avg, min, max). These are actually just the generic source connector metrics.

Note that once you create a Kafka connect cluster, the target Kafka cluster cannot be changed. In the mirror configuration also note that the target cluster alias can’t be changed, and for the time being we will leave the source cluster alias unchanged. The configuration for the maximum number of tasks determines the scalability of each mirror and must be less than or equal to the number of partitions in the topic. The MM2 connector uses a multi-topic consumer, so it will work fine if the topics have different numbers of partitions, as long as the number of tasks is configured based on the topic with the most partitions.

2. Experiments With Instaclustr Managed MirrorMaker 2

Experiments with reflection
Experiments with reflection
(Source: Shutterstock)

I tried several experiments with different combinations of The number of Kafka clusters (1-3); the number of Kafka connect clusters (1,2); topic renaming on/off, and mirror flows (uni and bi-directional). Some of the experiments were a bit random, and perhaps not useful in practice, but did result in some surprising discoveries. I’ll add to the above RULES when I find that the existing rules are incomplete.

For each experiment, I created a new topic with three partitions on the source cluster(s) and produced and read events to/from it and other remote topics using the Kafka producer and consumer console tools. This approach was sufficient to check that the remote topics had been created as expected and that events were being replicated correctly between the source and target topics.

Experiment 1 (1 Cluster)

One Kafka cluster (kafka_1), one Kafka connect cluster with target cluster = kafka_1, unidirectional mirror flow kafka_1->kafka_1. Result? FAILED

MM2

I thought that perhaps replication from/to one Kafka cluster would be a good starting point, and perhaps useful for testing. However, with topic renaming on or off the mirror fails to be created. There is a warning on the Instaclustr console, “Warning: source cluster and target cluster are the same”, which I chose to ignore for the sake of science. However, it turns out that this is more than a suggestion, and you just can’t replicate from/to the same cluster. This gives us another RULE:

RULE 8: The source and target Kafka clusters cannot be the same cluster.

Which implies a minimum of two clusters for the remainder of the experiments.

Experiment 2 (2 Clusters, Unidirectional)

Two Kafka clusters (kafka_1, kafka_2), one Kafka connect cluster with target cluster = kafka_1, unidirectional mirror flow kafka_2->kafka_1. Result? SUCCESS

Apache Kafka MM2

This was my first experiment with two Kafka clusters, and I started by replicating from kafka_2 (the remote cluster) to kafka_1 (the local/target cluster). With topic renaming on or off this worked as expected with the following replications observed:

Renaming on: topic (kafka_2) -> kafka_2.topic (kafka_1)
Renaming off: topic (kafka_2) -> topic (kafka_1)

A new topic, kafka_2.topic was automatically created on the kafka_1 topic, and events from the topic on kafka_2 were replicated into it. The properties of the newly created topic were mostly as expected, i.e. the number of partitions was identical, however, the replication factor was 2, whereas the source topic had a replication factor of 3. What happened? It looks like MM2 has a default RF for new topics of 2, rather than copying the source RF value over. Workarounds include manually creating target topics before replication, or changing the RF value after auto creation.

From this experiment, we can also conclude that the source cluster does not need to have a Kafka connect cluster associated with it for mirroring to work. 

Experiment 3 (2 Clusters, Unidirectional)

Kafka clusters (kafka_1, kafka_2), one Kafka connect cluster with target cluster = kafka_1, unidirectional mirror flow kafka_1->kafka_2. Result? FAILED

Kafka MirrorMaker 2

Next, I tried to reverse the flow direction to see if I could replicate from kafka_1 (the local/target cluster) to kafka_2 (the remote cluster). Using the Instaclustr managed MM2 it isn’t possible to configure this option, and it only supports a unidirectional replication flow towards the target cluster. So this gives us a new RULE.

RULE 9: One Kafka connect cluster per target cluster

Each Kafka connect cluster only supports unidirectional mirror flows to the associated Kafka target cluster from a source cluster (which as we noted above, doesn’t need a Kafka connect cluster). Flows from target to other clusters are not supported. 

This also turns out to follow from the fact that the MirrorMaker 2 connector is really a Kafka source connector, so is only designed to write to a local Kafka cluster from an external source, not the other way around (which would be a sink connector). 

More importantly, it’s also best practice for geo-replication (“Best Practice: Consume from remote, produce to local”) as MirrorMaker 2 is commonly used to replicate data between Kafka clusters running in different cloud regions, with potentially high latencies. Kafka producers are more sensitive to high latency than Kafka consumers, so to minimize latency on the producer size MirrorMaker 2 should be run close to the target cluster.

So, just to check, I repeated the previous experiment, but with a new Kafka connect cluster with the kafka_2 cluster as the target cluster.

Experiment 4 (2 Clusters, Unidirectional)

Two Kafka clusters (kafka_1, kafka_2), 1 Kafka connect cluster with target cluster = kafka_2, unidirectional mirror flow kafka_1->kafka_2. Result? SUCCESS

Experiment 4 (2 clusters, unidirectional)

As expected, this configuration allows for unidirectional flows from the remote cluster (kafka_1) to the local/target cluster (kafka_2). And it works with both renaming on and off as expected.

There are several consequences of Kafka connect clusters supporting (only) unidirectional flows towards the target cluster as follows:

1. Kafka source clusters do not need Kafka Connect clusters

Proved by experiments 2 and 4. Q.E.D. Let’s make this a new RULE:

RULE 10: No Kafka Connect clusters are needed for Kafka source clusters

From RULES 9 and 10, you can predict how many Kafka connect clusters are needed for any given pattern, simple! 

2. Efficient fan-in replication (only one Kafka connect cluster required)

It’s possible to efficiently support fan-in replication, with multiple unidirectional MirrorMaker 2 flows from many source clusters (A, B, C) to one target cluster, with a single Kafka connect cluster (Connect) running with the target cluster (X), e.g. 

A->(Connect) X
B->(Connect) X
C->(Connect) X

fan-in replication

i.e. you only need one Kafka connect cluster on the target end of fan-in flows, and not on each source end.  

3. Inefficient fan-out replication (multiple Kafka connect clusters required)

The downside is that for fan-out patterns you need more Kafka Connect clusters, one per Kafka target cluster.  For example, for a fan-out from two source clusters (X) to three target clusters, you need three Kafka Connect clusters, one for each target cluster (ConnectA, ConnectB, ConnectC).

X->(ConnectA) A
X->(ConnectB) B
X->(ConnectC) C

Inefficient fan-out replication