-
Notifications
You must be signed in to change notification settings - Fork 17
Description
I run icinga2 with checker servers in a cluster that all run task-spooler to keep the load down during reloads and restarts of icinga2 (there is an open bug that makes the load sky rocket). Most of the time this runs without problems, but every now and then task-spooler starts logging errors to /tmp/socket-ts.108.error. They look like this:
-------------------Warning
Msg: JobID 206018 quit while running.
errno 104, "Connection reset by peer"
date Wed May 4 20:14:15 2016
pid 633
type SERVER
New_jobs
new_job
jobid 205947
command "/usr/bin/snmpget -v 2c -r 1 -t 5 -c <password> -Oe -OU <hostname> ciscoEnvMonSupplyState.1"
state running
result.errorlevel 0
output_filename "NULL"
store_output 0
pid 16005
should_keep_finished 0
new_job
jobid 205976
command ....
What follows is a huge list (800+) of new jobs. The first 8 (the size of my queue) has PIDs, while the rest have separate JOBIDs, but no PIDs. After this long list of new jobs, this appears:
New_notifies
New_conns new_conn
socket 234
hasjob "1"
jobid 205947
new_conn
socket 665
hasjob "1"
jobid 205976
new_conn
socket 7
hasjob "1"
jobid 206018
new_conn
socket 277
hasjob "1"
jobid 206019
new_conn
socket 278
hasjob "1"
jobid 206021
Also this is a long list. And then this repeats. The last time this happened, this repeated 8183 times in about 20 minutes. The log file was 2.3 GB. I detected this when free disk space was starting to be low on one of the checkers.
# grep -c "May 4 20:" /tmp/socket-ts.108.error
8183
# ls -la /tmp/socket-ts.108.error
-rw------- 1 nagios nagios 2346969108 May 4 20:23 socket-ts.108.error
Also when this happens, task-spooler cannot limit the number of jobs it runs simultaneously. I have a limit of 8 jobs, but when this happens I can see hundreds of jobs running and hundreds of jobs in the queue. I can reproduce this error by restarting icinga2, generating a huge amount of jobs for TS to handle.
Can these errors be prevented, or is it possible to disable error-logging?