I would say I troubleshot this for days and I mean that as a way of saying this was hard for me to pinpoint, but I understand that "days" could also be taken as, "oh, wow, it took him way too long to find this."

Let me start off by saying, this Zabbix service is something I maintain in my free time, and free time comes at a premium these days.

Ok, the error: ...item [item] on host [host] failed: first network error, wait for 15 seconds.

When I built the server that hosts most of my services at home, I was in a hurry. As I said above, free time is limited so I needed to get this system up and running as fast as possible. The problem with this is that I didn't have any time to tune things. Not for ZFS, not for mysql, nothing.

Everything had been running just fine for months and then the Zabbix error started, flooding my inbox with warnings. In a rush, I power cycled the system. I don't know why, maybe my repressed Windows sysadmin days were itching to resurface, but I did it. This seemed to resolve the issue for a few days but then it was back. The next time, I started to troubleshoot like a proper unix admin:

  • Firewall - No blocks to MySQL or 10050
  • tcpdump - TCP packets on port 10050 are present
  • ZFS
    • IOPS not saturated (zpool iostat)
    • Tuned for database zvol:
      • zfs set recordsize=16k [db zvol]
      • zfs set primarycache=metadata [db zvol]
      • zfs set logbias=throughput [db zvol]
  • MySQL - tuned to leverage ZFS
    • Added innodb_doublewrite = 0 to my.cnf
    • Moved MySQL data to a tuned zvol

Everything looked fine, however, as I was refreshing the Zabbix webui, I occasionally received a php - php_network_getaddresses: getaddrinfo failed error. To access the Zabbix UI, I have to pass through an Nginx TLS engine & reverse proxy. All of these sit on the same server, just in separate jails with seperate FIBs for each.

This got me digging into the firewall. Since each jail communicates to neighboring jails via the firewall (router on a stick), I started looking at the states present on the firewall.

When I first deployed Zabbix, I was hitting nearly 20K states. During my troubleshooting, it was only reporting 12K states. The firewall should be able to handle 1.5 million states so it wasn't the firewall.

Turns out, pf on FreeBSD has a default state limit...

(pts/0)[user@server2:~]# pfctl -sm
states        hard limit    10000
src-nodes     hard limit    10000
frags         hard limit     5000
table-entries hard limit   200000

And there it is! My 10K state bottleneck.

After adding set limit { states 40000, frags 20000, src-nodes 30000 } to /etc/pf.conf and issuing a service pf reload, I had plenty of room!

Zabbix health graphs quickly returned to normal while the server returned to processing 130 values per second, the firewall started displaying 20K+ states again, and the PHP errors were gone!