Tuesday 9 August 2011

Handling timeouts in perl scripts

Reference link: http://larsmichelsen.com/perl/handing-timeouts-in-perl-scripts/



These timeouts can occur for several reasons. Mostly some network connections via SSH, SNMP or HTTP-Gets which run longer than the default check timeout in Nagios. I usually use a global check timeout of 15 seconds in Nagios. This means a check script which runs more than 15 seconds will be killed. In this case Nagios returns a CRITICAL state with a “(Service check timed out)” message. This is bad cause I don’t want those timeouts to result in CRITICAL state and some CRITICAL notifications.

Now read how to catch such timeouts…

I use the alarm(13) function in my Perl script. This functions sets a timer to the given number of seconds – in this case 13 seconds. When the timer reached zero it will throw a SIGALRM. Thankfully you can customize the event on SIGALRM. This gives my script the possibility to print some output and exit the script regular with a UNKNOWN exit code.



Here is my snippet to catch timeouts in perl scripts:



# Handle SIGALRM (timeout triggered by alarm() call)

$SIG{ALRM} = sub {

print "UNKNOWN: Script timed out\n";

exit -1;

};



# Start the timer to script timeout

alarm(13);



Other good source: http://www.trevorbowen.com/tag/perlstuff/

The Perl system() function is very powerful and elegant. It forks a child process, suspends the parent process, executes the provided command (potentially calling your shell to help parse the arguments), while seamlessly redirecting all IO between the parent and child process! The usage is simple, but what happens behind the scenes is amazing!

Unfortunately, there is no way to interrupt the system() function in Perl. Sure, you can kill the main Perl program, but that’s not what I want. I want to call system() with 2 additional arguments: timeout and maxattempts. This subroutine would operate just like the traditional system() function, unless operation time exceeded the timeout value, in which case, the command would be killed and restarted, until the maximum number of attempts was exceeded.

You can find many resources that detail how to timeout a long Perl operation, like so:

eval {

local $SIG{ALRM} = sub { die "alarm clock restart" };

alarm 10;

flock(FH, 2); # blocking write lock

alarm 0;

};

if ($@ and $@ !~ /alarm clock restart/) { die }

Unfortunately, there is a little footnote that says you should not try this with system calls; otherwise, you get zombies. Sure enough, if you substitute a system() function for the above flock, the parent Perl script is alarmed by the timeout and exits the eval. Normally, this would kill the flock or any other function. But the system function persits. The parent may even complete the remainder of its program and exit, but the child will keep on ticking – not what I wanted. The second problem is that there is no way to get, or access the process id of the command executed by the system() function; therefore, there is no way to kill a system function call by the parent Perl process – at least, no way that I have found.

The above link suggests using fork and exec to create your own function, which is ultimately what I did. So, let’s jump straight to the chase scene, shall we? Here’s my final solution.

Preferred Solution

#!/usr/bin/perl -w

use strict 'refs';

use POSIX "sys_wait_h";



sub timedSystemCall {



my ($cmd, $timeout, $maxattempts, $attempt, $origmax) = @_;



# degenerate into system() call - infinite timeout, if timeout is undefined or negative

$timeout = 0 unless defined($timeout) && ($timeout > 0);

# degenerate into system() call - 1 attempt, if max attempts is undefined or negative

$maxattempts = 1 unless defined($maxattempts) && ($maxattempts > 0);

$attempt = 1 unless defined($attempt) && ($attempt > 0);

$origmax = $maxattempts unless defined $origmax;



local ($rc, $pid);



eval {

local $SIG{ALRM} = sub { die "TIMEOUT" };



# Fork child, system process

FORK: {

if ($pid = fork) {

# parent picks up here, with $pid = process id of child; however...

# NO-OP - Parent does nothing in this case, except avoid branches below...

} elsif (defined $pid) { # $pid is zero if here defined

# child process picks up here - parent process available with getppid()

# execute provided command, or die if fails

exec($cmd) || die("(E) timedSystemCall: Couldn't run $cmd: $!\n");

# child never progresses past here, because of (exec-or-die) combo

} elsif ($! =~ /No more processes/) {

# Still in parent: EAGAIN, supposedly recoverable fork error

print STDERR "(W) timedSystemCall: fork failed. Retrying in 5-seconds. ($!)\n";

sleep 5;

redo FORK;

} else {

# unknown fork error

die "(E) timedSystemCall: Cannot fork: $!\n";

}

}



# set alarm to go off in "timeout" seconds, and call $SIG{ALRM} at that time

alarm($timeout);

# hang (block) until program is finished

waitpid($pid, 0);



# program is finished - disable alarm

alarm(0);

# grab output of waitpid

$rc = $?;

}; # end of eval



# Did eval exit from an alarm timeout?

if (($@ =~ "^TIMEOUT") || !defined($rc)) {

# Yes - kill process

kill(KILL => $pid) || die "Unable to kill $pid - $!";

# Collect child's remains

($ret = waitpid($pid,0)) || die "Unable to reap child $pid (ret=$ret) - $!";

# grab exit output of child process

if ($rc = $?) {

# exit code is lower byte: shift out exit code, leave top byte in $rc

my $exit_value = $rc >> 8;

# killing signal is lower 7-bits of top byte, which was shifted to lower byte

my $signal_num = $rc & 127;

# core-dump flag is top bit

my $dumped_core = $rc & 128;

# Notify grandparent of obituary

print STDERR "(I) timedSystemCall: Child $pid obituary: exit=$exit_value, kill_signal=$signal_num, dumped_core=$dumped_core\n";

}

# Can we try again?

if ($maxattempts > 1) {

# Yes! Increment counter, for print messages

$attempt++;

print STDERR "(W) timedSystemCall: Command timed-out after $timeout seconds. Restarting ($attempt of $origmax)...\n";

# Recurse into self, while decrementing number of attempts. Return value from deepest recursion

return timedSystemCall($cmd, $timeout, $maxattempts-1, $attempt, $origmax);

} else {

# No! Out of attempts...

print STDERR "(E) timedSystemCall: Exhausted maximum attempts ($origmax) for command: $cmd\nExiting!\n";

# Return error code of killed process - will require interpretation by parent

return $rc;

}

} else {

# No - process completed successfully! Hooray!!! Return success code (should be zero).

return $rc;

}

}



exit timedSystemCall("inf.pl", 5, 3);


No comments:

Post a Comment

Tweets by @sriramperumalla