Communication Breakdown, It’s always the same,
I’m having a nervous breakdown, Drive me insane!
tl;dr — Use this plugin to monitor 389 Directory Server replication
We’ve been bitten in the past when the multi-master replication between our authentication servers stops functioning properly and we don’t find out about it immediately. This usually manifests itself as users complaining that they’re intermittently unable to authenticate against certain services, which results in a bunch of troubleshooting effort only to discover that the real problem is the user not existing on all IPA servers.
We use freeIPA internally as our centralized user management system. freeIPA combines several standard open source components to provide an “integrated security information management solution”. These components include 389 Directory Server, MIT Kerberos, NTP, DNS, Dogtag certificate system, SSSD as well as several others. In the absence of custom configuration, freeIPA utilizes two instances of 389 Directory Server – one for traditional directory information on the standard port 389, and one for PKI/CA on port 7389. 389 Directory Server’s multi-master replication (MMR) support ensures that directory and certificate data is available from any node in the cluster.
To prevent this unfortunate scenario in the future, we developed a simple nagios/icinga plugin to assess replication health within the 389 Directory Server cluster. Fortunately, information including structure of the cluster and status of replication is stored within the LDAP schema itself. In developing the plugin, we hoped to avoid storing any authentication details in the plugin or the nagios/icinga configuration. This required enabling anonymous read-only querying of the replication agreement data. Daniel James Scott’s blog post provided very clear instructions for enabling anonymous read/search/compare access to the replication agreements. Our plugin uses the Net::LDAP Ruby gem to interact with a 389 Directory Server instance to discover all of the downstream replicas and their respective status. We query the ldap server with base
cn=config and filter on
(objectclass=nsds5replicationagreement). The equivalent command line query is:
This yields data similar to:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
We’re primarily concerned with how far in the past each replica successfully performed an update. As you can see from the output above, the replication agreement with ipa-2.example.com is in the middle of an incremental update and shows a last update end of
0. This does not necessarily mean that replication is broken. For better or worse, when the server begins an update, it clears the last end time. To avoid constantly alerting when we’re unable to retrieve meaningful replication data, the plugin maintains a state file that tracks the last valid update completion time and how many times a check has resulted in a last update completion of
0. The number of successive zero responses and acceptable number of minutes since last successful update completion are configurable parameters with the ability to set distinct warning and critical thresholds.
Since putting this monitoring in place, we’ve moved to newer freeIPA servers using replication to seamlessly migrate data from the old servers to the new. This plugin has already served to identify a breakdown in our replication that was easily remedied because the nodes had not yet significantly diverged. Other aspects of the health and performance of the IPA cluster are available via SNMP.