Received: by 2002:ac0:a582:0:0:0:0:0 with SMTP id m2-v6csp2340579imm; Wed, 3 Oct 2018 02:14:26 -0700 (PDT) X-Google-Smtp-Source: ACcGV60vTVL9kbtOm5LhheB03CXjpusc6geK97cDMIqLHA97giRLFD6gE2ll5zT8x8ROgNX0cH44 X-Received: by 2002:a62:454d:: with SMTP id s74-v6mr650290pfa.136.1538558066030; Wed, 03 Oct 2018 02:14:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538558066; cv=none; d=google.com; s=arc-20160816; b=pJroC1Vv3PRknyRudBmowxEpgNemkjwNas/Z+4ISrLwUo/oC6XKxSBgx+3+k2kndTi hWttpmfARFPZvN907vaaAk2wIsyETXRizab4jdpFryVchAw/jbRBEYyUNnpPPPa1DtdX GY1KOum2bgVCwVPZcgoYchtt6N99b790JacQhE1k5mYI3Fuy2qR523puWH2HUfWJrMX2 wq3X/cXH0H2kTtx/1qskynv0OAtW72idrH9P3gIAS4iaabVtsXTioj6m435EVbEh8Any b637h5XO46utBrwevCVO2th3LO9D2thvgNM+JWPiwte95sI3DjH1HCVT5Yr4y4tFm/js tmoQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=PCe/1zJGSoVsW5NfyG+/NUdSOOCkiDZDkXUW0uO5IKk=; b=Rq+iGDGr5DzmKO3JpJaj1w7k38F8d9PoqKlEzHir2I3OZIHK5qNjhqLSVO1QMyNa8/ 7PG+Nx4M77xWdtKhY7wR21t/6LUuNHIuRHmdQJTv/74gqvC23Dw3eYb7v9wFM6EiqrCh BxNxp6PExvSKHOsRO4u15uTEeadp/SOXDLjdgz1c+ULUcrPekFICBJVOG5lww72b+F/z p+G/wh78aSOFXvmVL+a4QA35Y+UvMhTvSv5MUNt3aV4y2ztXka6d40UL+nCY0CjW00t4 sIKDXXwf9k9n8HPv8vUhDSOJzu3y/gu6A7Q5ffhvWSmzA21Jhtd6CmDZMEZFPqlBpitI oTYQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w17-v6si930256plp.335.2018.10.03.02.14.10; Wed, 03 Oct 2018 02:14:25 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727540AbeJCQBg (ORCPT + 99 others); Wed, 3 Oct 2018 12:01:36 -0400 Received: from mx2.suse.de ([195.135.220.15]:53700 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727188AbeJCQBf (ORCPT ); Wed, 3 Oct 2018 12:01:35 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 0CB8AACB7; Wed, 3 Oct 2018 09:14:03 +0000 (UTC) Date: Wed, 3 Oct 2018 11:14:00 +0200 From: Petr Mladek To: Steven Rostedt Cc: Daniel Wang , stable@vger.kernel.org, Alexander.Levin@microsoft.com, akpm@linux-foundation.org, byungchul.park@lge.com, dave.hansen@intel.com, hannes@cmpxchg.org, jack@suse.cz, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Mathieu Desnoyers , Mel Gorman , mhocko@kernel.org, pavel@ucw.cz, penguin-kernel@i-love.sakura.ne.jp, peterz@infradead.org, tj@kernel.org, torvalds@linux-foundation.org, vbabka@suse.cz, Cong Wang , Peter Feiner Subject: Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes" Message-ID: <20181003091400.rgdjpjeaoinnrysx@pathway.suse.cz> References: <20180927194601.207765-1-wonderfly@google.com> <20181001152324.72a20bea@gandalf.local.home> <20181002084225.6z2b74qem3mywukx@pathway.suse.cz> <20181002212327.7aab0b79@vmware.local.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181002212327.7aab0b79@vmware.local.home> User-Agent: NeoMutt/20170421 (1.8.2) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 2018-10-02 21:23:27, Steven Rostedt wrote: > On Tue, 2 Oct 2018 17:15:17 -0700 > Daniel Wang wrote: > > > On Tue, Oct 2, 2018 at 1:42 AM Petr Mladek wrote: > > > > > > Well, I still wonder why it helped and why you do not see it with 4.4. > > > I have a feeling that the console owner switch helped only by chance. > > > In fact, you might be affected by a race in > > > printk_safe_flush_on_panic() that was fixed by the commit: > > > > > > 554755be08fba31c7 printk: drop in_nmi check from printk_safe_flush_on_panic() > > > > > > The above one commit might be enough. Well, there was one more > > > NMI-related race that was fixed by: > > > > > > ba552399954dde1b printk: Split the code for storing a message into the log buffer > > > a338f84dc196f44b printk: Create helper function to queue deferred console handling > > > 03fc7f9c99c1e7ae printk/nmi: Prevent deadlock when accessing the main log buffer in NMI > > > > All of these commits already exist in 4.14 stable, since 4.14.68. The deadlock > > still exists even when built from 4.14.73 (latest tag) though. And cherrypicking > > dbdda842fe96 fixes it. > > > > I don't see the big deal of backporting this. The biggest complaints > about backports are from fixes that were added to late -rc releases > where the fixes didn't get much testing. This commit was added in 4.16, > and hasn't had any issues due to the design. Although a fix has been > added: > > c14376de3a1 ("printk: Wake klogd when passing console_lock owner") As I said, I am fine with backporting the console_lock owner stuff into the stable release. I just wonder (like Sergey) what the real problem is. The console_lock owner handshake is not fully reliable. It is might be good enough to prevent softlockup. But we should not relay on it to prevent a deadlock. My new theory ;-) printk_safe_flush() is called in nmi_trigger_cpumask_backtrace(). => watchdog_timer_fn() is blocked until all backtraces are printed. Now, the original report complained that the system rebooted before all backtraces were printed. It means that panic() was called on another CPU. My guess is that it is from the hardlockup detector. And the panic() was not able to flush the console because it was not able to take console_lock. IMHO, there was not a real deadlock. The console_lock owner handshake jsut helped to get console_lock in panic() and flush all messages before reboot => it is reasonable and acceptable fix. Just to be sure. Daniel, could you please send a log with the console_lock owner stuff backported? There we would see who called the panic() and why it rebooted early. Best Regards, Petr