WatchDog

In keeping with EVI's sometimes paranoid attention to error detection, eth PICS WatchDog application was created solely to monitor the PICS Task Monitor on all managed subsystems.  The WatchDog is normally configured as the first application started on managed nodes so that it will begin monitoring TaskMon as soon as possible.

Once started, WatchDog sets a timer and checks the following each period:

  1. Is the Task Monitor application still present in the system?
  2. Is the Task Monitor window still present in the system?
  3. Has the Task Monitor requested my state of health within the last 30 seconds?

If any of these conditions is FALSE, then the WatchDog initiates a system reboot. The WatchDog prevents bugs (as yet unknown) in the Task Monitor from allowing a node to hang in a non-functional state. Hopefully, a reboot will clear any condition(s) that triggered the initial problem.

At times, some customers have reported that WatchDog was unable to reboot a node. Unfortunately, we must rely on Windows to actually do what we ask when a function call returns success. We have found many times (over the years and different Windows releases) that this is not always the case. As such, in the absence of a hardware watchdog device, EVI cannot guarantee that a machine will always reboot after it detects a problem -- Windows may elect to ignore the request (or perhaps the root cause of the problem detected was Windows itself).

Despite these concerns, WatchDog's simplicity works for us because it utilizes almost no PICS functions and very few Windows functions in order to perform its task, thus minimizing the number of things outside of our control.