Man in suit with graphic representing people in a network

Reasons to buy

+
Surprisingly capable of identifying issues and clustering anomalous events; Very competitive pricing – cost is per-device where many packages are per managed port; Vendor claims that faults are identified considerably more quickly following the introduction of MOOG.

Reasons to avoid

-
Non-deterministic, and hence will not be 100 per cent accurate when identifying problem.

By David Cartwright

published 3 January 2014

in Reviews

Traditional network monitoring tools depend on three things. First, that the things you want to monitor are accessible from the monitoring station via SNMP, WMI or some other useful protocol. Second, that you have the time and the inclination to constantly refine your thresholds, ensure that you add the ports you want to monitor, and define outage windows so that you're not alerted when you take a system down for maintenance. And third, that when something turns red, you understand what that means in terms of the applications that rely on the devices and interfaces that just turned red.

The arrival of cloud and the appearance of software-defined networks means that our needs for network monitoring software have changed. We need more information and need it faster.

What techniques should we be using? Continuous polling is all very well, then, but there's another source of useful information: the event, alert and system log entries that our infrastructure systems generate by the shedload. By definition this incessant stream tells us a load of stuff we're interested in both for administrative purposes and, more importantly, when diagnosing problems.

Compromise The problem with event and alert logs are twofold, though.

First, you always end up trying to find a compromise between brevity and informativeness. Turn the logging volume up and you get so many messages that you can't see the wood for the trees. Make it more selective and you reduce the volume but inevitably miss messages you care about (and, in reality, you still end up with way more volume than you can sensibly handle anyway).

Second, although some of the aspects of an event log message are standard (eg the timestamp) the actual body of the message follows no standard, computer-readable format. Since it's not realistic to produce a piece of software that proactively reads and understands this mass of free text and identifies relationships between the contents of streams from different devices, you're stuck with using it reactively as human-readable reference material when you're diagnosing a problem.

One thing, though. When I said it's not realistic to produce a piece of software that can work proactively with event messages, or to relate event streams to each other, I should have said: “Unless you're Moogsoft”. Because that's exactly what they've done.

No, I didn't believe it either. The thing is, though, the producers of Incident.Moog, aside from having the ability to invent the most bonkers product name ever, do have a bit of a history in network management software – in that they produced NetCool (probably the best known, and definitely one of the best, network monitoring package you can get), and followed it up with the equally funky RiverSoft. Moog is the latest venture, which involved an extensive academic research process to devise algorithms for interpreting and collating event messages and then using the analysis to report problems to the IT team.

The idea behind Moog is to identify incidents as they start to take place, by identifying the events that matter and disregarding the background “noise”. Step one, then, is to configure your various systems to deliver their event logs to the Moog server so it has the universe of messages to operate on. I was initially cynical about this approach – after all, with big volumes of messages this could conceivably mean high levels of traffic, and we've all seen distributed monitoring agents deployed for the sole purpose of keeping traffic levels down. Having thought about it, though, although event logs are high-volume entities, messages are highly similar which means that even if you're running over a WAN, today's WAN optimisers will be able to minimise the bandwidth required for shoving this traffic around the network.

From the barrage of events it sees, Moog identifies “clusters”. These are collections of events that it thinks are both: (a) related to each other; and (b) anomalies that aren't just part of the normal background noise. When it identifies an anomaly, it reports it on the incident list for you to investigate.

One of the interesting (and entirely counter-intuitive) aspects of Moog is that you don't actually configure it with any knowledge of your systems before letting it go off and run. It therefore bases everything it does entirely on the event logs it sees. You can, however, configure it to interrogate whatever data sources it has available (your Content Management Database being the obvious one) in order that it can embellish what it tells you with some helpful context. The basic means of alerting is to add an incident to Moog's own GUI-based list, but as you'd expect it can also throw alerts to other monitoring systems or even to less obvious mechanisms such as Twitter.

When you want to look into an incident, you do so in the Situation Room. This is, in a sentence, a collaborative, interactive interface that lets you drill into the detail behind the cluster of events that the system has decided make up the incident – which of course could involve hundreds or even thousands of alerts from a number of different systems. One of the examples I looked at for this review was a cluster of events that came from several different systems; some were generated by a DBMS and were alerting that a transaction couldn't be committed, and when you scrolled down through the alerts you came across a bunch of “disk full” alerts from the underlying system. Moog had identified these alerts as all being related and clustered them all together; it was a no-brainer to see that the application was failing because the database was blowing up as a result of the underlying disk being full.

I mentioned that the Situation Room is collaborative; it includes interactive messaging facilities so you can invite colleagues to work with you and look collectively at the problem. Because of this, the system actually becomes its own knowledge base – if a future incident resembles a past incident the system figures this out and makes it easy for you to wind back through the events and any collaborative discussions from the last time you had the problem. Moog are very proud of the novel way they've approached the Situation Room, to the extent that they've asked me not to print a screenshot of it just yet as the patents for some of the aspects are still pending.

The unique thing about Incident.Moog, and the aspect that will most bemuse potential customers, is that the vendor openly accepts that it'll never be 100 per cent accurate at showing up issues. That is, you'll miss the occasional alert and you'll also get the occasional false positive. This is an entirely alien concept to anyone who's used to traditional monitoring software, where SNMP alerts are entirely deterministic – that is, if it says that (say) a particular LAN port is more than 80 per cent utilised, you can be sure it is. With Moog you get something less deterministic, but which is highly likely (and in real-world trials with some huge customers this has been shown to be the case) to alert you to problems considerably more quickly and considerably more informatively than ever before: no more waiting for loads to go over thresholds or for the phone to ring, since as soon as an anomaly begins you know about it, very possibly much more quickly than you're used to.

The average potential customer will be entirely sceptical about Moog, and won't believe that it can possibly do what it says it will. Then they'll see it in action and will stare in disbelief that it actually does.

I know I did.

Verdict

This is a product that has you rethinking the way that network monitoring. It's an ideal product for an SDN/cloud world with high volumes of traffic and a greater degree of automation but it does require a bit of counter intuition - this is not the way that these products are meant to be.