Received: by 2002:ac0:a581:0:0:0:0:0 with SMTP id m1-v6csp416306imm; Thu, 28 Jun 2018 23:27:57 -0700 (PDT) X-Google-Smtp-Source: ADUXVKKnA1+rct3kTBJDAEqN4eh/sa3qF5yVE/flcnqTmzW496QceDfZ91mhXiQTrRSgI0MOZOi+ X-Received: by 2002:a17:902:904c:: with SMTP id w12-v6mr13438985plz.34.1530253677941; Thu, 28 Jun 2018 23:27:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1530253677; cv=none; d=google.com; s=arc-20160816; b=FQ+EtDjmuLKOuaXGZtbiYQeaVPcKGQWk4ail/O0nYiEBMxm/1bxuvXpkz2+628YpL1 AU93hhV3zHJY5lIvs5VwhkCTi0xsP1jBASrhrrOkXocOaACyTmN9ZAHxOkGH/CXJ2E9n XMFA2IR/takBtKrtxAvnxh9U70OjG0UnVLTpFbFds/pxfgBONGWEH+dnCQgfjphfJinc /cXvm1saP13N7sbewsU+aA82Uf6kw2CyY7fy05WqpFig6Mg/Snh48yorf2b/QEW+Dp3m oSY4rVNRQsUpSffKxnc2/y/jtEhxdJunBzCyYKbbc00OTGNFZc+zvYae1k1TUOR8ExRC +uuQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:dkim-signature :arc-authentication-results; bh=C73ATu+JNBVbQ8GjP3+joM+FxX6anzBtTSxBi13wbQs=; b=fNhyluIyVYZXTfzuv5fCgwOBM/RiYPpyn0L6szs7kmkHabhe8ekZALM1QTXN9gV4Ar KkenNnIsKsdckvaW9CCt+ijn0wIWr4c8WABMO/SyuNhLD0zC0C7U22IEd960BS5TSueu aNqw5htZvbLESidyddiHH2N+wNb5sgMvJ3xfaBqi0BLN89togl7ORs2lxZt1DUne4u44 BP1v/G/6vvWpR0WdOfI6uVi+01G8NbrT2NdRK8B0OTB9kOSfWN2Y3Syg2c4tUoWpHueb cisCcKnGc8WYgnqI3/9HdZZqXqAuyc7eCb1swC+1/c9zQAyjHTMFTeR5sPgerSJrvLKK /rTA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=G4FcKCB4; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v61-v6si7966905plb.499.2018.06.28.23.27.43; Thu, 28 Jun 2018 23:27:57 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=G4FcKCB4; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S936194AbeF1XTM (ORCPT + 99 others); Thu, 28 Jun 2018 19:19:12 -0400 Received: from mail-vk0-f74.google.com ([209.85.213.74]:36198 "EHLO mail-vk0-f74.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932883AbeF1XTL (ORCPT ); Thu, 28 Jun 2018 19:19:11 -0400 Received: by mail-vk0-f74.google.com with SMTP id p79-v6so2717486vkd.3 for ; Thu, 28 Jun 2018 16:19:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=C73ATu+JNBVbQ8GjP3+joM+FxX6anzBtTSxBi13wbQs=; b=G4FcKCB44AKTFrQCckhecc2mpwBoqospbS6YKNiRT7V0JPd+9MquKzdzyV0A3cES7y DVM1F+Z+VpvXZ3dyol3+umeD7ZAKfAdMLTDvb6/379850NT+tTdLV8K5ooUJ4YP1xZiX iLJUf5UNhkez+3gw2QG8NdHEttvY/4JPel3L3kD+lAFsSBNE0BcnIsd5sqZPkdo8xZge vIv7w4/CRMJYT24CBy/6HtkGcJcQMJWpRyGcH95bLykpLq4rdwjzoeb4iFdp9xOqWGX7 3gc/KjAEwJejJBw6hRfsDeHd/X5Iuqqwm2pC8s0+XuujhHpj5tomhrLJNjdDmcVLNjVJ HWjw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=C73ATu+JNBVbQ8GjP3+joM+FxX6anzBtTSxBi13wbQs=; b=riAZZThdeWiLFJ8DVJcDMFOfmTj3Zt/Y4U5jkV7nUqlaawyoSrN4AWeS07LkfPA89C qpkuoBrc2NKiLdUn+cedukKOBdmFSHaK3qXB0mTOV7HIvokokLdfGE6sVelyWTXuNZFk LbexQEcOe6AMrXD4ySsotsxe9sfpKsSCPcD0i9JlgvW167gq2CRjhjMQ3/gq+lfO2TJV GLs9XpOJeNEJ/6s2nIFyhYnQkjhtFPQqJpsSfjGz5tnMLccX5O+OEa0L2cFPTHLBrbQr EZA6f+6CMK/vwh2KBSaZKRe0guNtz8Zoz09juqNrMkpgm9FFwQ6uIKCBz8wCE3EH6oIq uIMA== X-Gm-Message-State: APt69E2ZVHKUdJzo94IK7S/kUo/TRrdMQqFeZpC89/tF6Lt2sCWZYWHR jTWVpKuSHlBqoR+eOCdAWr1UT5lcCta6 X-Received: by 2002:a1f:b004:: with SMTP id z4-v6mr5234481vke.78.1530227950062; Thu, 28 Jun 2018 16:19:10 -0700 (PDT) Date: Thu, 28 Jun 2018 16:19:07 -0700 In-Reply-To: <20180628151101.25307-1-mhocko@kernel.org> Message-Id: Mime-Version: 1.0 References: <20180628151101.25307-1-mhocko@kernel.org> Subject: Re: [PATCH] memcg, oom: move out_of_memory back to the charge path From: Greg Thelen To: Michal Hocko , Andrew Morton Cc: Johannes Weiner , Shakeel Butt , linux-mm@kvack.org, LKML , Michal Hocko Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Michal Hocko wrote: > From: Michal Hocko > > 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") > has changed the ENOMEM semantic of memcg charges. Rather than invoking > the oom killer from the charging context it delays the oom killer to the > page fault path (pagefault_out_of_memory). This in turn means that many > users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg > hits the hard limit and the memcg is is OOM. This is behavior is > inconsistent with !memcg case where the oom killer is invoked from the > allocation context and the allocator keeps retrying until it succeeds. > > The difference in the behavior is user visible. mmap(MAP_POPULATE) might > result in not fully populated ranges while the mmap return code doesn't > tell that to the userspace. Random syscalls might fail with ENOMEM etc. > > The primary motivation of the different memcg oom semantic was the > deadlock avoidance. Things have changed since then, though. We have > an async oom teardown by the oom reaper now and so we do not have to > rely on the victim to tear down its memory anymore. Therefore we can > return to the original semantic as long as the memcg oom killer is not > handed over to the users space. > > There is still one thing to be careful about here though. If the oom > killer is not able to make any forward progress - e.g. because there is > no eligible task to kill - then we have to bail out of the charge path > to prevent from same class of deadlocks. We have basically two options > here. Either we fail the charge with ENOMEM or force the charge and > allow overcharge. The first option has been considered more harmful than > useful because rare inconsistencies in the ENOMEM behavior is hard to > test for and error prone. Basically the same reason why the page > allocator doesn't fail allocations under such conditions. The later > might allow runaways but those should be really unlikely unless somebody > misconfigures the system. E.g. allowing to migrate tasks away from the > memcg to a different unlimited memcg with move_charge_at_immigrate > disabled. > > Changes since rfc v1 > - s@memcg_may_oom@in_user_fault@ suggested by Johannes. It is much more > clear what is the purpose of the flag now > - s@mem_cgroup_oom_enable@mem_cgroup_enter_user_fault@g > s@mem_cgroup_oom_disable@mem_cgroup_exit_user_fault@g as per Johannes > - make oom_kill_disable an exceptional case because it should be rare > and the normal oom handling a core of the function - per Johannes > > Signed-off-by: Michal Hocko Acked-by: Greg Thelen Thanks! One comment below. > --- > > Hi, > I've posted this as an RFC previously [1]. There was no fundamental > disagreement so I've integrated all the suggested changes and tested it. > mmap(MAP_POPULATE) hits the oom killer again rather than silently fails > to populate the mapping on the hard limit excess. On the other hand > g-u-p and other charge path keep the ENOMEM semantic when the memcg oom > killer is disabled. All the forward progress guarantee relies on the oom > reaper. > > Unless there are objections I think this is ready to go to mmotm and > ready for the next merge window > > [1] http://lkml.kernel.org/r/20180620103736.13880-1-mhocko@kernel.org > include/linux/memcontrol.h | 16 ++++---- > include/linux/sched.h | 2 +- > mm/memcontrol.c | 75 ++++++++++++++++++++++++++++++-------- > mm/memory.c | 4 +- > 4 files changed, 71 insertions(+), 26 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 6c6fb116e925..5a69bb4026f6 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -494,16 +494,16 @@ unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg); > void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > struct task_struct *p); > > -static inline void mem_cgroup_oom_enable(void) > +static inline void mem_cgroup_enter_user_fault(void) > { > - WARN_ON(current->memcg_may_oom); > - current->memcg_may_oom = 1; > + WARN_ON(current->in_user_fault); > + current->in_user_fault = 1; > } > > -static inline void mem_cgroup_oom_disable(void) > +static inline void mem_cgroup_exit_user_fault(void) > { > - WARN_ON(!current->memcg_may_oom); > - current->memcg_may_oom = 0; > + WARN_ON(!current->in_user_fault); > + current->in_user_fault = 0; > } > > static inline bool task_in_memcg_oom(struct task_struct *p) > @@ -924,11 +924,11 @@ static inline void mem_cgroup_handle_over_high(void) > { > } > > -static inline void mem_cgroup_oom_enable(void) > +static inline void mem_cgroup_enter_user_fault(void) > { > } > > -static inline void mem_cgroup_oom_disable(void) > +static inline void mem_cgroup_exit_user_fault(void) > { > } > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 87bf02d93a27..34cc95b751cd 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -722,7 +722,7 @@ struct task_struct { > unsigned restore_sigmask:1; > #endif > #ifdef CONFIG_MEMCG > - unsigned memcg_may_oom:1; > + unsigned in_user_fault:1; > #ifndef CONFIG_SLOB > unsigned memcg_kmem_skip_account:1; > #endif > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e6f0d5ef320a..cff6c75137c1 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1483,28 +1483,53 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) > __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); > } > > -static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > +enum oom_status { > + OOM_SUCCESS, > + OOM_FAILED, > + OOM_ASYNC, > + OOM_SKIPPED > +}; > + > +static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > { > - if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER) > - return; > + if (order > PAGE_ALLOC_COSTLY_ORDER) > + return OOM_SKIPPED; > + > /* > * We are in the middle of the charge context here, so we > * don't want to block when potentially sitting on a callstack > * that holds all kinds of filesystem and mm locks. > * > - * Also, the caller may handle a failed allocation gracefully > - * (like optional page cache readahead) and so an OOM killer > - * invocation might not even be necessary. > + * cgroup1 allows disabling the OOM killer and waiting for outside > + * handling until the charge can succeed; remember the context and put > + * the task to sleep at the end of the page fault when all locks are > + * released. > + * > + * On the other hand, in-kernel OOM killer allows for an async victim > + * memory reclaim (oom_reaper) and that means that we are not solely > + * relying on the oom victim to make a forward progress and we can > + * invoke the oom killer here. > * > - * That's why we don't do anything here except remember the > - * OOM context and then deal with it at the end of the page > - * fault when the stack is unwound, the locks are released, > - * and when we know whether the fault was overall successful. > + * Please note that mem_cgroup_out_of_memory might fail to find a > + * victim and then we have to bail out from the charge path. > */ > - css_get(&memcg->css); > - current->memcg_in_oom = memcg; > - current->memcg_oom_gfp_mask = mask; > - current->memcg_oom_order = order; > + if (memcg->oom_kill_disable) { > + if (!current->in_user_fault) > + return OOM_SKIPPED; > + css_get(&memcg->css); > + current->memcg_in_oom = memcg; > + current->memcg_oom_gfp_mask = mask; > + current->memcg_oom_order = order; > + > + return OOM_ASYNC; > + } > + > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > + return OOM_SUCCESS; > + > + WARN(1,"Memory cgroup charge failed because of no reclaimable memory! " > + "This looks like a misconfiguration or a kernel bug."); I'm not sure here if the warning should here or so strongly worded. It seems like the current task could be oom reaped with MMF_OOM_SKIP and thus mem_cgroup_out_of_memory() will return false. So there's nothing alarming in that case. > + return OOM_FAILED; > } > > /** > @@ -1899,6 +1924,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > unsigned long nr_reclaimed; > bool may_swap = true; > bool drained = false; > + bool oomed = false; > + enum oom_status oom_status; > > if (mem_cgroup_is_root(memcg)) > return 0; > @@ -1986,6 +2013,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > if (nr_retries--) > goto retry; > > + if (gfp_mask & __GFP_RETRY_MAYFAIL && oomed) > + goto nomem; > + > if (gfp_mask & __GFP_NOFAIL) > goto force; > > @@ -1994,8 +2024,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > > memcg_memory_event(mem_over_limit, MEMCG_OOM); > > - mem_cgroup_oom(mem_over_limit, gfp_mask, > + /* > + * keep retrying as long as the memcg oom killer is able to make > + * a forward progress or bypass the charge if the oom killer > + * couldn't make any progress. > + */ > + oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, > get_order(nr_pages * PAGE_SIZE)); > + switch (oom_status) { > + case OOM_SUCCESS: > + nr_retries = MEM_CGROUP_RECLAIM_RETRIES; > + oomed = true; > + goto retry; > + case OOM_FAILED: > + goto force; > + default: > + goto nomem; > + } > nomem: > if (!(gfp_mask & __GFP_NOFAIL)) > return -ENOMEM; > diff --git a/mm/memory.c b/mm/memory.c > index 7206a634270b..a4b1f8c24884 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4125,7 +4125,7 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > * space. Kernel faults are handled more gracefully. > */ > if (flags & FAULT_FLAG_USER) > - mem_cgroup_oom_enable(); > + mem_cgroup_enter_user_fault(); > > if (unlikely(is_vm_hugetlb_page(vma))) > ret = hugetlb_fault(vma->vm_mm, vma, address, flags); > @@ -4133,7 +4133,7 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > ret = __handle_mm_fault(vma, address, flags); > > if (flags & FAULT_FLAG_USER) { > - mem_cgroup_oom_disable(); > + mem_cgroup_exit_user_fault(); > /* > * The task may have entered a memcg OOM situation but > * if the allocation error was handled gracefully (no