Alarm Catcher - Automated diagnostic from log triggersIntroduces the use of the alarmcatcher tool produced by Crossbeam Support EMEA.
Alarm Catcher allows a chassis to be instrumented such that when a specific log, or set of logs are generated, very detailed diagnostic commands can be run, both system-wide and slot-specific.
This tool is useful in a scenario where a customer with multiple, different systems is experiencing a highly intermittent (once every few weeks, for a few minutes) a transient CPU spike occurs. Without this tool, the multiple systems, the number of hours between events, and the nature of the fault make it nearly impossible to collect the key statistics Crossbeam (and Check Point) require to diagnose the cause.
This tool can be installed on a system and configured such that when defined log messages are seen, very detailed system and slot diags can be produced for later inspection. The sorts of diags produced include CB cli commands such as "show flow active", Linux cli commands on the slot (a snapshot of the 'top' output), and complex Checkpoint cli commands.
Overriding all of this is a very rigorous mechanism for ensuring "safety". If many trigger events appear, one must avoid at all costs kicking off too many snapshots of diagnostics. If, say, running a "show flow active" takes 10 minutes, inadvertently initiating a few hundred of them in parallel is not going to be a good idea on a production system!
In order to keep some control, for now we must follow a policy that this tool is not ,under any circumstances, deployed on a customer system without internal Crossbeam review first. In our labs (where it could also be hugely useful) play is encouraged! Hence the rule: no external deployment without management OK.
On a positive note: this tool is currently deployed with a large EMEA mobile operator, tracking the high CPU condition mentioned earlier. Checkpoint is also aware of this tool, and have provided a set of diags for it to generate that are of interest to them.
Attached to this solution are 4 files:
- README This contains detailed instructions on deploying, configuring and general use of the tool.
- alarmCatcher.pl The main script file - see the README
- alarmCatcherConfig.pl The config file - see the README
- alarmCatcher The startup file - see the README
At the time of writing (and likely for a while after...) this tool is deployed on and, usually running on EMEA lab system firstname.lastname@example.org.
For a quick look at the sort of output it produces, go there and check in /crossbeam/logs - full of diags caused by high CPU conditions (albeit from a system carrying little traffic at the time!)