Received: by 2002:a25:ca44:0:0:0:0:0 with SMTP id a65csp93071ybg; Sat, 25 Jul 2020 21:24:19 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxw0xSFAozz9O3zfEA3Fpe8PDHSXKRYFnLQE+TnzrMrUsbxdiqKDbVETQxs97wYP39X0AP5 X-Received: by 2002:aa7:c2d7:: with SMTP id m23mr15948458edp.216.1595737459426; Sat, 25 Jul 2020 21:24:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1595737459; cv=none; d=google.com; s=arc-20160816; b=OoF6+glGHkQbLq+PDBurZvPJ5zV0qShSuATrtiY+lMX3irlzdCoJakRDD36i1SI2HX DVb0uVU5g3NV2kopvd49rmdmHy0kSjE/p/YK5BNkRGzF2STwqUYIfNMncki7lVCM2aSE jjtSm6EVjF4U1S0IePhLPsLgn1MmeSfX5cpIRgYUpTDmt6ufG6t5lkF5Wuk/spvU83Gu +Q0iltIEPuWVZXj6SVKJOEmtzC+h02NCJaKI32c1LL6brj3pZTR9koi90LSFgZ2PtB2k BLZnlGr1dHjsqrH4OlakUzZwu9bG6e9QksTs0CTbOQGJLZKFzu7WAKwL4NEGWSCPS5hW 6WdA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature; bh=2MOx2O4fUUqw8g6tJeVtZgLswZrenVEln7D6xkEdxTA=; b=PbOThTXqrmiSpVpYnOcX+LpPEzACFUeZfRV9L+vv6A/oNAsA1T0Mr8f+ndCjFngSzr AnUDDFEKZzqG6YlIpqt1vEfpAJRAAeeHgU8bHHjU8cmZVU9+faqosUgPtra6y5LqHc6v nFveEBpVFotMVY42h/O8mQ39KlD+F3ZO4k+XXT7YeLm9X3kvhUCxNE0Bd36PSbAbbSbC b6utRxVSQnJKjcXQeQ1apDXv+w9dON93Sh6B/qeP+LVhpsI4GlskjgHGBHuMZKKuXtSd Vm5gJiA1haHd3Z23T+bmB6ztYGJj5eEP6JVrApx+mGuGkXLlUk06cMN+20ZlhNdwyZLB XF6A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="E/RqiRVJ"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id e20si3284954edr.227.2020.07.25.21.23.54; Sat, 25 Jul 2020 21:24:19 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="E/RqiRVJ"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725810AbgGZEWs (ORCPT + 99 others); Sun, 26 Jul 2020 00:22:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59866 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725298AbgGZEWs (ORCPT ); Sun, 26 Jul 2020 00:22:48 -0400 Received: from mail-qk1-x742.google.com (mail-qk1-x742.google.com [IPv6:2607:f8b0:4864:20::742]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DAADEC0619D2 for ; Sat, 25 Jul 2020 21:22:46 -0700 (PDT) Received: by mail-qk1-x742.google.com with SMTP id l23so12462809qkk.0 for ; Sat, 25 Jul 2020 21:22:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=2MOx2O4fUUqw8g6tJeVtZgLswZrenVEln7D6xkEdxTA=; b=E/RqiRVJd1PFIofCzdppg6QVamSaDaRZnw3Lnj/LXO4q96eeGFSzRRPokaNSNoarJv QX0hTv5h+x+SNFvtQN5D0TYeXu2indFJlcJm7XhYghA+QSro8eKfdVOPT2e6ngTlacj+ 3wM6hyAIVw04HbadwhkNrSLR+XAnDpib+oARstoZVBcSwyrMEoK4kHPFFdNs/17Mcgaq m/JpfuQCzsQr3eUjUhR6fFdhsGE+tRaQWS35QdgL/YoSUEEB9lpv+/NDdk2xKJgZj4tM L7fX0WQ2Hsw8HsCXGwi5Xf86xjPh2dERFVidt8z/5JB1drhk3Cij5qltSIxK1kvEZKXM 4SfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=2MOx2O4fUUqw8g6tJeVtZgLswZrenVEln7D6xkEdxTA=; b=lNDAGATRb9ml0D70/gp5DM/JP86izlan74Q55U+DrW44TiocSkVegSGrAhyLrPNR9O bfR9kk1wwcf2ZKSDTOD3l6M3zYhtRTCEjBkniiSTx9Ul0WR2y/GFpkSKvf33rvD8/ZPZ b9G9jEpcWDCIJYJJnt/GPHXNbUJvr75EtoPTiS0X2k3P29Ju/wByTyCyxUknzWayknEA hRGg59Z0vYlDmJ52vOqiX4eokDFP+pEvLfCU0o32tms8cnB2V3sIhw2Nj6XWtTHl0ndw sn16ArLgwVrYwqkormcNSipOAWCgmkxv/wekb91aBkOl2grWsOqul4RK95SfcB1p4r+2 fqFQ== X-Gm-Message-State: AOAM531ljUpksrEWDVe6v0HpKxLUXrFZMktTYRDLtwV95/2XfndvIMBk 2nTAdTDabR0iLNosyIa/nb1+WQ== X-Received: by 2002:a37:46c6:: with SMTP id t189mr17579258qka.50.1595737365318; Sat, 25 Jul 2020 21:22:45 -0700 (PDT) Received: from eggly.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id 71sm4564151qkk.125.2020.07.25.21.22.43 (version=TLS1 cipher=ECDHE-ECDSA-AES128-SHA bits=128/128); Sat, 25 Jul 2020 21:22:44 -0700 (PDT) Date: Sat, 25 Jul 2020 21:22:29 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Hugh Dickins cc: Linus Torvalds , Oleg Nesterov , Michal Hocko , Linux-MM , LKML , Andrew Morton , Tim Chen , Michal Hocko Subject: Re: [RFC PATCH] mm: silence soft lockups from unlock_page In-Reply-To: Message-ID: References: <20200723124749.GA7428@redhat.com> <20200724152424.GC17209@redhat.com> <20200725101445.GB3870@redhat.com> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, 25 Jul 2020, Hugh Dickins wrote: > On Sat, 25 Jul 2020, Linus Torvalds wrote: > > On Sat, Jul 25, 2020 at 3:14 AM Oleg Nesterov wrote: > > > > > > Heh. I too thought about this. And just in case, your patch looks correct > > > to me. But I can't really comment this behavioural change. Perhaps it > > > should come in a separate patch? > > > > We could do that. At the same time, I think both parts change how the > > waitqueue works that it might as well just be one "fix page_bit_wait > > waitqueue usage". > > > > But let's wait to see what Hugh's numbers say. > > Oh no, no no: sorry for getting your hopes up there, I won't come up > with any numbers more significant than "0 out of 10" machines crashed. > I know it would be *really* useful if I could come up with performance > comparisons, or steer someone else to do so: but I'm sorry, cannot. > > Currently it's actually 1 out of 10 machines crashed, for the same > driverland issue seen last time, maybe it's a bad machine; and another > 1 out of the 10 machines went AWOL for unknown reasons, but probably > something outside the kernel got confused by the stress. No reason > to suspect your changes at all (but some unanalyzed "failure"s, of > dubious significance, accumulating like last time). > > I'm optimistic: nothing has happened to warn us off your changes. Less optimistic now, I'm afraid. The machine I said had (twice) crashed coincidentally in driverland (some USB completion thing): that machine I set running a comparison kernel without your changes this morning, while the others still running with your changes; and it has now passed the point where it twice crashed before (the most troublesome test), without crashing. Surprising: maybe still just coincidence, but I must look closer at the crashes. The others have now completed, and one other crashed in that troublesome test, but sadly without yielding any crash info. I've just set comparison runs going on them all, to judge whether to take the "failure"s seriously; and I'll look more closely at them. But hungry and tired now: unlikely to have more to say tonight. > > And on Fri, 24 Jul 2020, Linus Torvalds had written: > > So the loads you are running are known to have sensitivity to this > > particular area, and are why you've done your patches to the page wait > > bit code? > > Yes. It's a series of nineteen ~hour-long tests, of which about five > exhibited wake_up_page_bit problems in the past, and one has remained > intermittently troublesome that way. Intermittently: usually it does > get through, so getting through yesterday and today won't even tell > us that your changes fixed it - that we shall learn over time later. > > Hugh