Skip to content

errno 104, "Connection reset by peer" #4

@kunsjef

Description

@kunsjef

I run icinga2 with checker servers in a cluster that all run task-spooler to keep the load down during reloads and restarts of icinga2 (there is an open bug that makes the load sky rocket). Most of the time this runs without problems, but every now and then task-spooler starts logging errors to /tmp/socket-ts.108.error. They look like this:

-------------------Warning
 Msg: JobID 206018 quit while running.
 errno 104, "Connection reset by peer"
date Wed May  4 20:14:15 2016
pid 633
type SERVER
New_jobs
  new_job
    jobid 205947
    command "/usr/bin/snmpget -v 2c -r 1 -t 5 -c <password> -Oe -OU <hostname> ciscoEnvMonSupplyState.1"
    state running
    result.errorlevel 0
    output_filename "NULL"
    store_output 0
    pid 16005
    should_keep_finished 0
  new_job
    jobid 205976
    command ....

What follows is a huge list (800+) of new jobs. The first 8 (the size of my queue) has PIDs, while the rest have separate JOBIDs, but no PIDs. After this long list of new jobs, this appears:

New_notifies
New_conns  new_conn
    socket 234
    hasjob "1"
    jobid 205947
  new_conn
    socket 665
    hasjob "1"
    jobid 205976
  new_conn
    socket 7
    hasjob "1"
    jobid 206018
  new_conn
    socket 277
    hasjob "1"
    jobid 206019
  new_conn
    socket 278
    hasjob "1"
    jobid 206021

Also this is a long list. And then this repeats. The last time this happened, this repeated 8183 times in about 20 minutes. The log file was 2.3 GB. I detected this when free disk space was starting to be low on one of the checkers.

# grep -c "May  4 20:" /tmp/socket-ts.108.error
8183
# ls -la /tmp/socket-ts.108.error
-rw-------  1 nagios nagios 2346969108 May  4 20:23 socket-ts.108.error

Also when this happens, task-spooler cannot limit the number of jobs it runs simultaneously. I have a limit of 8 jobs, but when this happens I can see hundreds of jobs running and hundreds of jobs in the queue. I can reproduce this error by restarting icinga2, generating a huge amount of jobs for TS to handle.

Can these errors be prevented, or is it possible to disable error-logging?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions