Received: by 2002:a25:e7d8:0:0:0:0:0 with SMTP id e207csp744753ybh; Wed, 18 Mar 2020 08:21:46 -0700 (PDT) X-Google-Smtp-Source: ADFU+vsOpTFn7XYTdUotQjwclkSwSPU31hiCFAVlJJFshZsNCVDZdqjvYohoACr5PmJepDGHRrvd X-Received: by 2002:a9d:18f:: with SMTP id e15mr4138038ote.42.1584544906754; Wed, 18 Mar 2020 08:21:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1584544906; cv=none; d=google.com; s=arc-20160816; b=h523r5OfgEMIUw6E9g2GbwkZhibg761Es1x4KmKJOBIdesQsJtYdHrsoTENSUtColO xxvpjtlTnsyhXoWL0JEM1YNiSeXyBcX91v5dqJ/Rsc47Lp/w3xKjxm3Li0LMYhxdvrdT vZpzuuepTyC3GMRrAhbmj0ZEjgRUdlgOMP5simxmi547Qng1Ye/fNtSe1t85IokMISDH ajpKSNfiMN23M/f345RnvY6/i5ZrQl7KlV4hC76G4gpvv4evxBK7x+xbpCp+K0dcA5yE 7lwJyBOVNa1X8lBD0+5QRDkhWUeEPvDkZ1h7KW/o8PBkhfQ0GuHmuwfcGo8Q9vTO43sV UkqQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=sjNBXmlZoIrN3q2R+0N0U+vPceC5x2t2cgobLHksbD8=; b=Ix0SGwJJ5cCIdlcUJXMOi3DJKbKzSrPcsJfHvP5kr+Tp/P12Mm1lG3sjwaTNrHylhq cZEA6zlPltEtZ2NH6hG3JuTP1uDzvfrtm7QS8m0WpmP15HcfbS52FTkpYrSTATT8uzpN 0vJ5a+o8+PBSbgjDgabgji8UlF2tIIH6/4+LlkXrxAo/RIcDi95GZh0ipLJ4kTTO4252 xg0mKygq6dctyvFOX+e7i5BRWWzi3ZZND2TdUauPgeIQ/xP+HLWibbfCOw29M4ff7X8t ofXvfzdh4uMry8qBxAFeHEVfGEqpRkxl07i4eviuRXp1YDqMhsD0nJXkPiohuXizqUBG b8jQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=sHGwL9vl; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r6si3417877oij.235.2020.03.18.08.21.33; Wed, 18 Mar 2020 08:21:46 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=sHGwL9vl; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726975AbgCRPVI (ORCPT + 99 others); Wed, 18 Mar 2020 11:21:08 -0400 Received: from mail-lf1-f68.google.com ([209.85.167.68]:45461 "EHLO mail-lf1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726619AbgCRPVI (ORCPT ); Wed, 18 Mar 2020 11:21:08 -0400 Received: by mail-lf1-f68.google.com with SMTP id x143so4504568lff.12 for ; Wed, 18 Mar 2020 08:21:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=sjNBXmlZoIrN3q2R+0N0U+vPceC5x2t2cgobLHksbD8=; b=sHGwL9vlgSbAcKLpDif5C6eANsbkh0er1kRifUICtKSB6FUThYFQoHW2UERi6YfFqJ akMziW2xR3eFzQ9LnlI7Wql8KTCujug/ONZuuFhQM88rhTHIqGjCy/iX0/Q/aPFf/WLU 6Jsu7LRNuWG0yDkXr80lZiwFUWuKLG7hKj9Q249rHHudHqDuxux3oiq+e06fW/nf2Tzi SaqjJjJDLWovL9cZFW28MPT4cxlHUnbAgIFH4ge2WJji6VnsUariQcJPGpGwe6npwq7B d4Hn1E40HCqvLZKMltCttNmcd/9bhe48aiiqCch63HeetZqgHEqg9Kkll9nbnmdk03+i huHg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=sjNBXmlZoIrN3q2R+0N0U+vPceC5x2t2cgobLHksbD8=; b=oTe8RnqNpakqrcCXCKxfW+QXvDV53yNuwyNOlvCE9806sVy2UxlBrDZI/41zmrYhhU xJop/u54HRRfnlYyAcc0tKhNME6shyDT8Ek/O8L6n0ZpjeY29UPh37RzdYwv51C1O8VC aAcjPyTbii1S3yE+2DkeH6sZ75aiz50TyDLj1KaqA+ILZ1He8vD8PKNB7vluUTG0OsdM 4ZCBe7kpm/muJhpKTOQml5rQn6k6uptFwtAIkfFzZw3k9bwDiIcxzh62A9br2OutT09A oiWiKIfX3xU2T+ytADDRuG3sOE4obHCDLPc9YYrM6GlfPmkM1L5na7tyzB1/LUNAmBdk eQFw== X-Gm-Message-State: ANhLgQ0Tpx+to5J/9GOO34tQ6BijZ8pbM3Awk8MYgAF1+7AXqZJZGeSd voYNqjRowd+1f1GAJzZVA8Y/OzNwEHMafjVyd6SOAg== X-Received: by 2002:a05:6512:3041:: with SMTP id b1mr1512366lfb.167.1584544865110; Wed, 18 Mar 2020 08:21:05 -0700 (PDT) MIME-Version: 1.0 References: <20200310221938.GF8447@dhcp22.suse.cz> <20200318095710.GG21362@dhcp22.suse.cz> In-Reply-To: <20200318095710.GG21362@dhcp22.suse.cz> From: Ami Fischman Date: Wed, 18 Mar 2020 08:20:53 -0700 Message-ID: Subject: Re: [patch] mm, oom: make a last minute check to prevent unnecessary memcg oom kills To: Michal Hocko Cc: Robert Kolchmeyer , David Rientjes , Andrew Morton , Vlastimil Babka , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 18, 2020 at 2:57 AM Michal Hocko wrote: > > On Tue 17-03-20 12:00:45, Ami Fischman wrote: > > On Tue, Mar 17, 2020 at 11:26 AM Robert Kolchmeyer > > wrote: > > > > > > On Tue, Mar 10, 2020 at 3:54 PM David Rientjes wrote: > > > > > > > > Robert, could you elaborate on the user-visible effects of this issue that > > > > caused it to initially get reported? > > > > > > Ami (now cc'ed) knows more, but here is my understanding. > > > > Robert's description of the mechanics we observed is accurate. > > > > We discovered this regression in the oom-killer's behavior when > > attempting to upgrade our system. The fraction of the system that > > went unhealthy due to this issue was approximately equal to the > > _sum_ of all other causes of unhealth, which are many and varied, > > but each of which contribute only a small amount of > > unhealth. This issue forced a rollback to the previous kernel > > where we ~never see this behavior, returning our unhealth levels > > to the previous background levels. > > Could you be more specific on the good vs. bad kernel versions? Because > I do not remember any oom changes that would affect the > time-to-check-time-to-kill race. The timing might be slightly different > in each kernel version of course. The original upgrade attempt included a large window of kernel versions: 4.14.137 to 4.19.91. In attempting to narrow down the failure we found that in tests of 10 runs we went from 0/10 failures to 1/10 failure with the update from https://chromium.googlesource.com/chromiumos/third_party/kernel/+/74fab24be8994bb5bb8d1aa8828f50e16bb38346 (based on 4.19.60) to https://chromium.googlesource.com/chromiumos/third_party/kernel/+/6e0fef1b46bb91c196be56365d9af72e52bb4675 (also based on 4.19.60) and then we went from 1/10 failures to 9/10 failures with the upgrade to https://chromium.googlesource.com/chromiumos/third_party/kernel/+/a33dffa8e5c47b877e4daece938a81e3cc810b90 (which I believe is based on 4.19.72). (this was all before we had the minimal repro yielding Robert's 61/100->0/100 stat in his previous email)