Received: by 2002:a05:6a10:af89:0:0:0:0 with SMTP id iu9csp5088285pxb; Wed, 26 Jan 2022 04:39:27 -0800 (PST) X-Google-Smtp-Source: ABdhPJyWWRd1E8oZuq8LjFt3owJ6dWZV4zT5MWWEs04ckALlx+yD57+F8ItEcmDurEANt4ZO2kqn X-Received: by 2002:a17:903:32c1:b0:14b:3c0c:9118 with SMTP id i1-20020a17090332c100b0014b3c0c9118mr16736067plr.58.1643200767598; Wed, 26 Jan 2022 04:39:27 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1643200767; cv=none; d=google.com; s=arc-20160816; b=MQuFtOdTvUm8/Ih/i82OmzBKYXdYEJlLreGZ2sIWflZAqz+EXdZzYzbAdLDsvjXRQN 4rzJ4vb+UGaQ8YijoqXQGzWQko7rvTzZupNYaHWD0wSCKkAg8A8KiJUy0AAy4Nggricl grkUdRU1l8HjtUwmImcIlkFffMWyWXX9SsejbECVTX+s7xnKW183mPbcSwPcIFb0lVuO yQvorldOFmJe3CsdSltLnLZlJ8fqvwEfOpEN0balRkZlJ/30q7ejgFepLAanI0y/CwiC PC88HXBGO659e35n4L4sH/eDsryielyzjPwfVedFcZvWgSX9aa08P/al6wU4+wMqvJE5 n4Bw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:reply-to:message-id :subject:cc:to:from:date:dkim-signature; bh=Cl//OxUTDEt7gVQXsWuQ1ULBRMDFRFQXFdCC/l/pPyM=; b=xgYmUFAy6qBaCjJRsrxo7FDGpg5I4zL9O0eZRdBOFENBv+j2JICGFtHqtjfgn9dvX8 oGKRlEyrILZVKWh2K37vQZFgnnd+RVfLFQczh3KrnsyN3XLa8tqiVCyqxF8ypWmI5bQ7 MLbRSGI8hNQYW6UJDxlxnpNTx+Q377/2ae3BcfL7QJbJEhki4V6rTgO7Iyzj44DIqOon RppW08PUwSnEgPi1CK3/eHP+8HTm7AXUc+q9pFodZiog4ifT1id4sQItx2JA7ldFerQ7 DlQLpm1vEbSE9989c7z01ngO55sQ2KY2uQn2qobT5BZKdg+2mwth9Q6g3HYzOvmuTkDr dKew== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=YuTDfwa2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h68si2468890pgc.795.2022.01.26.04.39.15; Wed, 26 Jan 2022 04:39:27 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=YuTDfwa2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234427AbiAYWva (ORCPT + 99 others); Tue, 25 Jan 2022 17:51:30 -0500 Received: from dfw.source.kernel.org ([139.178.84.217]:50620 "EHLO dfw.source.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234479AbiAYWvJ (ORCPT ); Tue, 25 Jan 2022 17:51:09 -0500 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 0398660AE5 for ; Tue, 25 Jan 2022 22:51:09 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6573EC340E0; Tue, 25 Jan 2022 22:51:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1643151068; bh=uiTJ4/Cwbp9a9UygsPvC0ow08oTKDn4D42IS4AmQhUw=; h=Date:From:To:Cc:Subject:Reply-To:References:In-Reply-To:From; b=YuTDfwa2Z9zxInepM9A5cocaFTZ5YbXCkd49xoF4eS967PY8aROnpbVfU468ANSYY S7oqBhiVcHpuHC+RZNYZz67Vi9q89uPCz7ObF1ccYhk828PKkMVSPY/ef4/2XGqu+z C18+TdJc7YCWau/sEB2OzLp+XzBH4JTh2BwArB59aCLlejKDv1ZFIjwNyzm8Bm4e9K kAn8HQLCg6zkUeJwspg18QFJCWtP912Qv5lI49X4p9ozZaoVVybOoQjSHBPzq97YBi IUfNe/uogmyIzZ4YIB4qDBtTJD5GcYXOJLDZMLRlyiBRzVzEf8IO05ALlopVsT5kek 9+UvSZyT2zY/Q== Received: by paulmck-ThinkPad-P17-Gen-1.home (Postfix, from userid 1000) id 2DD385C14B7; Tue, 25 Jan 2022 14:51:08 -0800 (PST) Date: Tue, 25 Jan 2022 14:51:08 -0800 From: "Paul E. McKenney" To: Corey Minyard Cc: Juergen Gross , Ingo Molnar , LKML Subject: Re: Possible reproduction of CSD locking issue Message-ID: <20220125225108.GV4285@paulmck-ThinkPad-P17-Gen-1> Reply-To: paulmck@kernel.org References: <20220125182737.GO34919@minyard.net> <20220125185338.GP4285@paulmck-ThinkPad-P17-Gen-1> <20220125215238.GP34919@minyard.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20220125215238.GP34919@minyard.net> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 25, 2022 at 03:52:38PM -0600, Corey Minyard wrote: > On Tue, Jan 25, 2022 at 10:53:38AM -0800, Paul E. McKenney wrote: > > On Tue, Jan 25, 2022 at 12:27:37PM -0600, Corey Minyard wrote: > > > We have a customer that had been seeing CSD lock issues on a Centos 7 > > > kernel (unfortunately). I couldn't find anything or any kernel changes > > > that might fix it, so I was consdering it was the CSD locking issue you > > > have been chasing for a while. > > > > > > So I backported the debug patches. And of course, they stopped seeing > > > the issue, at least as much, and they had trouble with the extra CPU > > > time the debug code took. But they just reproduced it. Here are the > > > logs: > > > > > > Jan 23 23:39:43 worker0 kernel: [285737.522743] csd: Detected non-responsive CSD lock (#1) on CPU#3, waiting 5000000042 ns for CPU#55 flush_tlb_func+0x0/0xb0(0xffff8e0b3e2afbe8). > > > Jan 23 23:39:43 worker0 kernel: [285737.522744] csd: CSD lock (#1) unresponsive. > > > Jan 23 23:39:43 worker0 kernel: [285737.522747] csd: cnt(0000000): 0000->0000 queue > > > Jan 23 23:39:43 worker0 kernel: [285737.522748] csd: cnt(0000001): ffff->0037 idle > > > Jan 23 23:39:43 worker0 kernel: [285737.522749] csd: cnt(63d8dd8): 0003->0037 ipi > > > Jan 23 23:39:43 worker0 kernel: [285737.522750] csd: cnt(63d8dd9): 0003->0037 ping > > > Jan 23 23:39:43 worker0 kernel: [285737.522750] csd: cnt(63d8dda): 0003->ffff pinged > > > Jan 23 23:39:43 worker0 kernel: [285737.522751] csd: cnt(63d8dea): 0035->0037 pinged > > > Jan 23 23:39:43 worker0 kernel: [285737.522752] csd: cnt(63d8deb): ffff->0037 gotipi > > > Jan 23 23:39:43 worker0 kernel: [285737.522752] csd: cnt(63d8dec): ffff->0037 handle > > > Jan 23 23:39:43 worker0 kernel: [285737.522753] csd: cnt(63d8ded): ffff->0037 dequeue (src CPU 0 == empty) > > > Jan 23 23:39:43 worker0 kernel: [285737.522754] csd: cnt(63d8dee): ffff->0037 hdlend (src CPU 0 == early) > > > Jan 23 23:39:43 worker0 kernel: [285737.522754] csd: cnt(63d8e1f): 0003->0037 queue > > > Jan 23 23:39:43 worker0 kernel: [285737.522755] csd: cnt(63d8e20): 0003->0037 ipi > > > Jan 23 23:39:43 worker0 kernel: [285737.522756] csd: cnt(63d8e21): 0003->0037 ping > > > Jan 23 23:39:43 worker0 kernel: [285737.522756] csd: cnt(63d8e22): 0003->0037 queue > > > Jan 23 23:39:43 worker0 kernel: [285737.522757] csd: cnt(63d8e23): 0003->0037 noipi > > > Jan 23 23:39:43 worker0 kernel: [285737.522757] csd: cnt now: 63fe4cd > > > Jan 23 23:39:43 worker0 kernel: [285737.522758] Task dump for CPU 55: > > > Jan 23 23:39:43 worker0 kernel: [285737.522761] kubelet R running task 0 277695 1 0x00080000 > > > Jan 23 23:39:43 worker0 kernel: [285737.522761] Call Trace: > > > Jan 23 23:39:43 worker0 kernel: [285737.522769] [] ? __schedule+0x46a/0x990 > > > Jan 23 23:39:43 worker0 kernel: [285737.522774] [] ? context_tracking_user_enter+0x13/0x20 > > > Jan 23 23:39:43 worker0 kernel: [285737.522776] [] ? schedule_user+0x45/0x50 > > > Jan 23 23:39:43 worker0 kernel: [285737.522779] [] ? retint_careful+0x16/0x34 > > > > Long-running interrupt handler, maybe? Or am I misinterpreting this > > stack trace? > > Well, maybe. But why would the condition clear up immediately when > another IPI is sent? If I understand correctly this box is doing mostly > polled I/O, there shouldn't be many interrupts. But possibly a driver > bug, I hadn't thought of that. All fair points! Me, I suspected lost interrupts for the longest time, but never found any hard evidence of such a thing. But maybe that is what you are seeing. > I could have the system panic when this happens and maybe I can find > some breadcrumbs lying around in the core dump. The trouble is it's > been two months now to reproduce this. And being as it is a customer machine, you probably do not want to take steps to make it happen more often... But would it be possible to trace interrupt entry and exit, and then dump the offending CPU's trace buffer when things recover? > > > Jan 23 23:39:43 worker0 kernel: [285737.522780] csd: Re-sending CSD lock (#1) IPI from CPU#03 to CPU#55 > > > Jan 23 23:39:43 worker0 kernel: [285737.522788] CPU: 3 PID: 54671 Comm: runc:[2:INIT] Kdump: loaded Tainted: G OE ------------ T 3.10.0-1062.12.1.rt56.1042.mvista.test.14.el7.x86_64 #1 > > > Jan 23 23:39:43 worker0 kernel: [285737.522789] Hardware name: Dell Inc. PowerEdge R740/0YWR7D, BIOS 2.9.4 11/06/2020 > > > Jan 23 23:39:43 worker0 kernel: [285737.522789] Call Trace: > > > Jan 23 23:39:43 worker0 kernel: [285737.522793] [] dump_stack+0x19/0x1b > > > Jan 23 23:39:43 worker0 kernel: [285737.522798] [] __csd_lock_wait+0x1a8/0x2a0 > > > Jan 23 23:39:43 worker0 kernel: [285737.522800] [] ? leave_mm+0x120/0x120 > > > Jan 23 23:39:43 worker0 kernel: [285737.522802] [] smp_call_function_single+0xc4/0x1b0 > > > Jan 23 23:39:43 worker0 kernel: [285737.522804] [] ? leave_mm+0x120/0x120 > > > Jan 23 23:39:43 worker0 kernel: [285737.522809] [] ? page_counter_uncharge+0x3b/0x70 > > > Jan 23 23:39:43 worker0 kernel: [285737.522811] [] smp_call_function_many+0x344/0x380 > > > Jan 23 23:39:43 worker0 kernel: [285737.522813] [] ? leave_mm+0x120/0x120 > > > Jan 23 23:39:43 worker0 kernel: [285737.522816] [] native_flush_tlb_others+0xb8/0xc0 > > > Jan 23 23:39:43 worker0 kernel: [285737.522818] [] flush_tlb_page+0x65/0xf0 > > > Jan 23 23:39:43 worker0 kernel: [285737.522821] [] ptep_clear_flush+0x68/0xa0 > > > Jan 23 23:39:43 worker0 kernel: [285737.522825] [] wp_page_copy.isra.83+0x3d6/0x650 > > > Jan 23 23:39:43 worker0 kernel: [285737.522828] [] do_wp_page+0xb4/0x710 > > > Jan 23 23:39:43 worker0 kernel: [285737.522832] [] handle_mm_fault+0x884/0x1340 > > > Jan 23 23:39:43 worker0 kernel: [285737.522835] [] ? update_cfs_shares+0xa9/0xf0 > > > Jan 23 23:39:43 worker0 kernel: [285737.522839] [] __do_page_fault+0x213/0x5a0 > > > Jan 23 23:39:43 worker0 kernel: [285737.522841] [] do_page_fault+0x35/0x90 > > > Jan 23 23:39:43 worker0 kernel: [285737.522842] [] page_fault+0x28/0x30 > > > Jan 23 23:39:43 worker0 kernel: [285737.522845] csd: CSD lock (#1) got unstuck on CPU#03, CPU#55 released the lock. > > > > > > Hopefully this is the issue you are chasing and not something else. > > > I've been studying them to see what they mean, but I thought you might > > > be interested to get them asap. > > > > Well, there have been several bugs causing these CSD lock issues, so what > > is one more? ;-) > > > > More seriously, have you tried Frederic's patch? This fixes the issue > > described here: https://paulmck.livejournal.com/62071.html > > If your stack above was due to an interrupt storm rather than a > > long-running interrupt, this might well be the fix. > > The code is so different that it will take me some time. I'll have to > understand the issue more deeply to trace that down. But thanks for the > reference. Understood! > > Oh, and J?rgen Gro? reportedly found an issue about a year ago that > > could potentially be related, but I see that he is already on CC. > > > > And, unfortunately, even more seriously, this CSD-lock diagnostic code > > will very likely continue to find problems, just as the infamous RCU > > CPU stall warnings and hard/soft lockup warnings do. > > Yeah, I've already fixed a couple of bugs related to CSD lockups. So I > suppose I've handled all the low-hanging fruit. With a 104 core box it > makes uncovering these sorts of issues more likely, I suppose. Especially if that system has more than one socket. Thanx, Paul > -corey > > > > > Thanx, Paul > > > > ------------------------------------------------------------------------ > > > > commit 53e87e3cdc155f20c3417b689df8d2ac88d79576 > > Author: Frederic Weisbecker > > Date: Tue Oct 26 16:10:54 2021 +0200 > > > > timers/nohz: Last resort update jiffies on nohz_full IRQ entry > > > > When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU > > is guaranteed to stay online and to never stop its tick. > > > > Meanwhile on some rare case, the dedicated timekeeper may be running > > with interrupts disabled for a while, such as in stop_machine. > > > > If jiffies stop being updated, a nohz_full CPU may end up endlessly > > programming the next tick in the past, taking the last jiffies update > > monotonic timestamp as a stale base, resulting in an tick storm. > > > > Here is a scenario where it matters: > > > > 0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU. > > > > 1) A stop machine callback is queued to execute somewhere. > > > > 2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in > > MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1 > > can still take IRQs. > > > > 3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward. > > > > 4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking > > last_jiffies_update as a base. But last_jiffies_update hasn't been > > updated for 2 jiffies since the timekeeper has interrupts disabled. > > > > 5) clockevents_program_event(), which relies on ktime_get(), observes > > that the expiration is in the past and therefore programs the min > > delta event on the clock. > > > > 6) The tick fires immediately, goto 3) > > > > 7) Tick storm, the nohz_full CPU is drown and takes ages to reach > > MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation. > > > > Solve this with unconditionally updating jiffies if the value is stale > > on nohz_full IRQ entry. IRQs and other disturbances are expected to be > > rare enough on nohz_full for the unconditional call to ktime_get() to > > actually matter. > > > > Reported-by: Paul E. McKenney > > Signed-off-by: Frederic Weisbecker > > Signed-off-by: Thomas Gleixner > > Tested-by: Paul E. McKenney > > Link: https://lore.kernel.org/r/20211026141055.57358-2-frederic@kernel.org > > > > diff --git a/kernel/softirq.c b/kernel/softirq.c > > index 322b65d456767..41f470929e991 100644 > > --- a/kernel/softirq.c > > +++ b/kernel/softirq.c > > @@ -595,7 +595,8 @@ void irq_enter_rcu(void) > > { > > __irq_enter_raw(); > > > > - if (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)) > > + if (tick_nohz_full_cpu(smp_processor_id()) || > > + (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET))) > > tick_irq_enter(); > > > > account_hardirq_enter(current); > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > > index 6bffe5af8cb11..17a283ce2b20f 100644 > > --- a/kernel/time/tick-sched.c > > +++ b/kernel/time/tick-sched.c > > @@ -1375,6 +1375,13 @@ static inline void tick_nohz_irq_enter(void) > > now = ktime_get(); > > if (ts->idle_active) > > tick_nohz_stop_idle(ts, now); > > + /* > > + * If all CPUs are idle. We may need to update a stale jiffies value. > > + * Note nohz_full is a special case: a timekeeper is guaranteed to stay > > + * alive but it might be busy looping with interrupts disabled in some > > + * rare case (typically stop machine). So we must make sure we have a > > + * last resort. > > + */ > > if (ts->tick_stopped) > > tick_nohz_update_jiffies(now); > > }