Date: Thu, 3 Dec 2015 15:12:20 -0500 (EST)
From: Ulrich Obergfell <uobergfe@redhat.com>
To: Tejun Heo <tj@kernel.org>
Cc: Don Zickus <dzickus@redhat.com>, Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        linux-kernel@vger.kernel.org, kernel-team@fb.com
Message-ID: <1971916814.34665208.1449173540866.JavaMail.zimbra@redhat.com>
In-Reply-To: <20151203194358.GK27463@mtj.duckdns.org>
References: <20151203002810.GJ19878@mtj.duckdns.org> <20151203002839.GK19878@mtj.duckdns.org> <20151203175024.GE27730@redhat.com> <20151203194358.GK27463@mtj.duckdns.org>
Subject: Re: [PATCH 2/2] workqueue: implement lockup detector
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Thread-Topic: workqueue: implement lockup detector
Thread-Index: TatBZhL2JK1z3pHS9BMbhpIqMEzeYA==
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4760
Lines: 129


Tejun,

I share Don's concern about connecting the soft lockup detector and the
workqueue watchdog to the same kernel parameter in /proc. I would feel
more comfortable if the workqueue watchdog had its dedicated parameter.


I also see a scenario that the proposed patch does not handle well: The
watchdog_thresh parameter can be changed 'on the fly' - i.e. it is not
necessary to disable and re-enable the watchdog. The flow of execution
looks like this.

  proc_watchdog_thresh
    proc_watchdog_update
      if (watchdog_enabled && watchdog_thresh)
          watchdog_enable_all_cpus
            if (!watchdog_running) {
                ...
            } else {
                //
                // update 'on the fly'
                //
                update_watchdog_all_cpus()
            }

The patched watchdog_enable_all_cpus() function disables the workqueue watchdog
unconditionally at [1]. However, the workqueue watchdog remains disabled if the
code path [2] is executed (and wq_watchdog_thresh is not updated as well).

static int watchdog_enable_all_cpus(void)
{
        int err = 0;

[1] --> disable_workqueue_watchdog();

        if (!watchdog_running) {
                ...
        } else {
     .-         /*
     |           * Enable/disable the lockup detectors or
     |           * change the sample period 'on the fly'.
     |           */
[2] <            err = update_watchdog_all_cpus();
     |
     |          if (err) {
     |                  watchdog_disable_all_cpus();
     |                  pr_err("Failed to update lockup detectors, disabled\n");
     '-         }
        }

        if (err)
                watchdog_enabled = 0;

        return err;
}


And another question that comes to my mind is: Would the workqueue watchdog
participate in the lockup detector suspend/resume mechanism, and if yes, how
would it be integrated into this ?


Regards,

Uli


----- Original Message -----
From: "Tejun Heo" <tj@kernel.org>
To: "Don Zickus" <dzickus@redhat.com>
Cc: "Ulrich Obergfell" <uobergfe@redhat.com>, "Ingo Molnar" <mingo@redhat.com>, "Peter Zijlstra" <peterz@infradead.org>, "Andrew Morton" <akpm@linux-foundation.org>, linux-kernel@vger.kernel.org, kernel-team@fb.com
Sent: Thursday, December 3, 2015 8:43:58 PM
Subject: Re: [PATCH 2/2] workqueue: implement lockup detector

Hello, Don.

On Thu, Dec 03, 2015 at 12:50:24PM -0500, Don Zickus wrote:
> This sort of looks like the hung task detector..
> 
> I am a little concerned because we just made a big effort to properly
> separate the hardlockup and softlockup paths and yet retain the flexibility
> to enable/disable them separately.  Now it seems the workqueue detector is
> permanently entwined with the softlockup detector.  I am not entirely sure
> that is correct thing to do.

The only area they get entwined is how it's controlled from userland.
While it isn't quite the same as softlockup detection, I think what it
monitors is close enough that it makes sense to put them under the
same interface.

> It also seems awkward for the lockup code to have to jump to the workqueue
> code to function properly. :-/  Though we have made exceptions for the virt
> stuff and the workqueue code is simple..

Softlockup code doesn't depend on workqueue in any way.  Workqueue
tags on touch_softlockup to detect cases which shouldn't be warned and
its enabledness is controlled together with softlockup and that's it.

> Actually, I am curious, it seems if you just added a
> /proc/sys/kernel/wq_watchdog entry, you could elminiate the entire need for
> modifying the watchdog code to begin with.  As you really aren't using any
> of it other than piggybacking on the touch_softlockup_watchdog stuff, which
> could probably be easily added without all the extra enable/disable changes
> in watchdog.c.

Yeah, except for touch signal, it's purely interface thing.  I don't
feel too strong about this but it seems a bit silly to introduce a
whole different set of interface for this.  e.g. if the user wanted to
disable softlockup detection, it'd be weird to leave wq lockup
detection running.  The same goes for threshold.

> Again, this looks like what the hung task detector is doing, which I
> struggled with years ago to integrate with the lockup code because in the
> end I had trouble re-using much of it.

So, it's a stall detector and there are inherent similarities but the
conditions tested are pretty different and it's a lot lighter.  I'm
not really sure what you're meaning to say.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/