Kafka Failed to Deliver ### messages

Alright, we are using a python script to spin up a number of command line instances of Zeek to read a fairly large amount of data. unfortunately, i’m seeing a number of the below errors.

i’ve adjusted partitions, the linger.ms and the batch.size.
does anyone have any thoughs? i can’t find a ton about this specific problem with kafka…

happy to provide additional details if anyone has the time to help me sort this out.

1343593277.256289 error in <params>, line 1: conn/Log::WRITER_KAFKAWRITER: Unable to deliver 5050 message(s)
1343593287.211130 error in <params>, line 1: conn/Log::WRITER_KAFKAWRITER: Unable to deliver 5210 message(s)
1343593690.032751 error in <params>, line 1: conn/Log::WRITER_KAFKAWRITER: Unable to deliver 2213 message(s)
1343594555.945758 error in <params>, line 1: conn/Log::WRITER_KAFKAWRITER: Unable to deliver 1294 message(s)
1343594533.285664 error in <params>, line 1: conn/Log::WRITER_KAFKAWRITER: Unable to deliver 3858 message(s)
1343594499.586091 error in <params>, line 1: conn/Log::WRITER_KAFKAWRITER: Unable to deliver 6502 message(s)
1343594920.674159 error in <params>, line 1: conn/Log::WRITER_KAFKAWRITER: Unable to deliver 3576 message(s)

What does your Kafka Architecture look like? Is it a cluster, single node, Container, Vm, Baremetal, Resources allocated? Using SSD, HDD? Is the system Raided? Does the system have any competition on the disks Kafka is using?

Edit: Additionally is Kafka local or remote? If it is remote what are the line rates between the Zeek node and the Kafka nodes?

Single node, VM, HDD’s i don’t think that the system is set up in raid ( it’s in a lab and i don’t build the VM’s just provide resource requrest, although i can request the drives be raided if needeed.)

System has 32 Cores and 64 gb’s of ram.

I can send you the python script we are using offline if you’d like to see how it’s leveraging resources.
we have kafka set up with 36 partitions, there are 3 separate logstash systems with 12 workers each.
nothing should be competing for disk space on that system ( and i don’t see a resource constraint on iops),
zeek is running (ascii log writing is disabled completely). and nothing else is running on the system.

kafka is local.

Based off the error’s Zeek writing to Kafka is the issue. I would venture a guess that the issue is Kafka is not provisioned with enough resources to keep up with how ever many threads you are having the python script execute. I would either limit the number of Zeek workers if you can not increase or scale. Seeing your other issues from slack looks like you are estimating you are pushing around 10Gbps which I would recommend having Kafka in a clustering architecture. 10Gbps is not as simple as it sounds and am positive that the current amount of resources is not enough to keep up with that especially since you would be spreading that out across all the different components. Logstash and Zeek are likely hogging all of your CPU time on that system and we haven’t even started to dive into Elastic or Kafka. So if scaling the Hardware is an issue I highly recommend slowing down the Zeek feed allowing other resource to be freed up and allow the rest of the pipeline to keep up.

yeah that’s what we were trying. we set a sleep and reduced the number of workers. doing an HTOP there seem to be resources available. ( not a lot though ), logstash is actually on separate systems.
i will def look into clustering kafka.

If you have some resources you could try changing the num.io.threads in the kafka config which defaults to 8. Otherwise I would recommend moving to a clustered arch to allow scaling and distributed work loads across your systems.