Received: by 2002:ac0:a581:0:0:0:0:0 with SMTP id m1-v6csp895682imm; Wed, 20 Jun 2018 08:17:26 -0700 (PDT) X-Google-Smtp-Source: ADUXVKJwrZ+BcqCppmlJOUiO37fJlR+RxgGB7hrt8OtMCIeRrqQxT/7U00xuXiYSaLRZtF2x4YfL X-Received: by 2002:a17:902:7c16:: with SMTP id x22-v6mr23793677pll.77.1529507846398; Wed, 20 Jun 2018 08:17:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1529507846; cv=none; d=google.com; s=arc-20160816; b=UPIG40HWx7FknrnYxzV53DBCqBTjmwxw1rCArHUO6WoPU6dQu4cU8Md5IEzcXhhiIx pmEQ+e1/fa1jVS5QvLQZTPESJexvPkRfXPrDJwlzUj7GS/T7rscVw0iNJN4XUemQMqpQ H0OTmrD+lBNd0krgb/vloovBJuIdnIXuzkKHNmQ/GYcK7D5sVg6uoA1baK8VtO0zqa5s lxcYvlqJ8PZXb3vPmBWpKsmdGINuJdzOoDvuHlmsWonUPYrJUIy/vn6HuCdn3MWrW0JZ TtbHlnLw/gGkWgA9d54TtU6m10QoAF1JYZUC7MZE3yBxDdPBtgU5yUAObAvDUfuOhsJ2 URyQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=jqxkfFwEagfA+T/8RI6lT964JvR/FbmhSgMXvlcheCQ=; b=b8W0ZA6tFrP3GWQJnJFgb3Uk1qXBjuNsiH6v+7Gpl3xeuGWKd0FlF14Fje/4TRW0Ym CLkR806Yav41+d7/2vGfNg2vu9wZx+ZmSuyScixI2X/vN4g2WxmbDH73Y5pDN0DGmMie qiC3r3j6O4GAoacWdZ1NMkO8fCY75UEj1FOzMq9UAJ7VNmzRx6H2ccOBgRbPsxik4z3I c+jbmdV60OHmsZbsKk0ti7u71E/tOnWCz5/zbBEn6GtY/1LNzB77/ElrdUM7dzkQ3e21 eC/FSnpV3/vJpxGM/qXvhYAwOoVa/VEc7P/Ta29/s5Edo6fm3NanuAMnHANwrNl6tLPD A2og== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@cmpxchg.org header.s=x header.b="sUOSe9/+"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id j29-v6si2125146pga.195.2018.06.20.08.16.56; Wed, 20 Jun 2018 08:17:26 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@cmpxchg.org header.s=x header.b="sUOSe9/+"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754097AbeFTPQB (ORCPT + 99 others); Wed, 20 Jun 2018 11:16:01 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:48578 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754075AbeFTPP5 (ORCPT ); Wed, 20 Jun 2018 11:15:57 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=cmpxchg.org ; s=x; h=In-Reply-To:Content-Type:MIME-Version:References:Message-ID:Subject: Cc:To:From:Date:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=jqxkfFwEagfA+T/8RI6lT964JvR/FbmhSgMXvlcheCQ=; b=sUOSe9/+fr5WQDsouugLrcuPgA +rOOUNqs4gvJUB/+RYkttzzs7H451Rc3x+ez4h474n/BcwapnmOx/fjXtWi7aCMY1z3xyfhDm2Rzv UaIv60wkdWp521UzpV9FxmTRswLZJdVsin/gq3wDzdwJg3RctypCK4/44UHvuIUVQ39o=; Date: Wed, 20 Jun 2018 11:18:12 -0400 From: Johannes Weiner To: Michal Hocko Cc: linux-mm@kvack.org, Greg Thelen , Shakeel Butt , Andrew Morton , LKML , Michal Hocko Subject: Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path Message-ID: <20180620151812.GA2441@cmpxchg.org> References: <20180620103736.13880-1-mhocko@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180620103736.13880-1-mhocko@kernel.org> User-Agent: Mutt/1.10.0 (2018-05-17) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 20, 2018 at 12:37:36PM +0200, Michal Hocko wrote: > From: Michal Hocko > > 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") > has changed the ENOMEM semantic of memcg charges. Rather than invoking > the oom killer from the charging context it delays the oom killer to the > page fault path (pagefault_out_of_memory). This in turn means that many > users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg > hits the hard limit and the memcg is is OOM. This is behavior is > inconsistent with !memcg case where the oom killer is invoked from the > allocation context and the allocator keeps retrying until it succeeds. > > The difference in the behavior is user visible. mmap(MAP_POPULATE) might > result in not fully populated ranges while the mmap return code doesn't > tell that to the userspace. Random syscalls might fail with ENOMEM etc. > > The primary motivation of the different memcg oom semantic was the > deadlock avoidance. Things have changed since then, though. We have > an async oom teardown by the oom reaper now and so we do not have to > rely on the victim to tear down its memory anymore. Therefore we can > return to the original semantic as long as the memcg oom killer is not > handed over to the users space. > > There is still one thing to be careful about here though. If the oom > killer is not able to make any forward progress - e.g. because there is > no eligible task to kill - then we have to bail out of the charge path > to prevent from same class of deadlocks. We have basically two options > here. Either we fail the charge with ENOMEM or force the charge and > allow overcharge. The first option has been considered more harmful than > useful because rare inconsistencies in the ENOMEM behavior is hard to > test for and error prone. Basically the same reason why the page > allocator doesn't fail allocations under such conditions. The later > might allow runaways but those should be really unlikely unless somebody > misconfigures the system. E.g. allowing to migrate tasks away from the > memcg to a different unlimited memcg with move_charge_at_immigrate > disabled. This is more straight-forward than I thought it would be. I have no objections to this going forward, just a couple of minor notes. > @@ -1483,28 +1483,54 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) > __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); > } > > -static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > +enum oom_status { > + OOM_SUCCESS, > + OOM_FAILED, > + OOM_ASYNC, > + OOM_SKIPPED Either SUCCESS & FAILURE, or SUCCEEDED & FAILED ;) We're not distinguishing ASYNC and SKIPPED anywhere below, but I cannot think of a good name to communicate them both without this function making assumptions about the charge function's behavior. So it's a bit weird, but probably the best way to go. > +static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > { > - if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER) > - return; > + if (order > PAGE_ALLOC_COSTLY_ORDER) > + return OOM_SKIPPED; > /* > * We are in the middle of the charge context here, so we > * don't want to block when potentially sitting on a callstack > * that holds all kinds of filesystem and mm locks. > * > - * Also, the caller may handle a failed allocation gracefully > - * (like optional page cache readahead) and so an OOM killer > - * invocation might not even be necessary. > + * cgroup1 allows disabling the OOM killer and waiting for outside > + * handling until the charge can succeed; remember the context and put > + * the task to sleep at the end of the page fault when all locks are > + * released. > + * > + * On the other hand, in-kernel OOM killer allows for an async victim > + * memory reclaim (oom_reaper) and that means that we are not solely > + * relying on the oom victim to make a forward progress and we can > + * invoke the oom killer here. > * > - * That's why we don't do anything here except remember the > - * OOM context and then deal with it at the end of the page > - * fault when the stack is unwound, the locks are released, > - * and when we know whether the fault was overall successful. > + * Please note that mem_cgroup_oom_synchronize might fail to find a > + * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > + * we would fall back to the global oom killer in pagefault_out_of_memory > */ > + if (!memcg->oom_kill_disable) { > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > + return OOM_SUCCESS; > + > + WARN(!current->memcg_may_oom, > + "Memory cgroup charge failed because of no reclaimable memory! " > + "This looks like a misconfiguration or a kernel bug."); > + return OOM_FAILED; > + } > + > + if (!current->memcg_may_oom) > + return OOM_SKIPPED; memcg_may_oom was introduced to distinguish between userspace faults that can OOM and contexts that return -ENOMEM. Now we're using it slightly differently and it's confusing. 1) Why warn for kernel allocations, but not userspace ones? This should have a comment at least. 2) We invoke the OOM killer when !memcg_may_oom. We want to OOM kill in either case, but only set up the mem_cgroup_oom_synchronize() for userspace faults. So the code makes sense, but a better name would be in order -- current->in_user_fault? > css_get(&memcg->css); > current->memcg_in_oom = memcg; > current->memcg_oom_gfp_mask = mask; > current->memcg_oom_order = order; > + > + return OOM_ASYNC; In terms of code flow, it would be much clearer to handle the memcg->oom_kill_disable case first, as a special case with early return, and make the OOM invocation the main code of this function, given its name. > @@ -1994,8 +2024,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > > memcg_memory_event(mem_over_limit, MEMCG_OOM); > > - mem_cgroup_oom(mem_over_limit, gfp_mask, > + /* > + * keep retrying as long as the memcg oom killer is able to make > + * a forward progress or bypass the charge if the oom killer > + * couldn't make any progress. > + */ > + oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, > get_order(nr_pages * PAGE_SIZE)); > + switch (oom_status) { > + case OOM_SUCCESS: > + nr_retries = MEM_CGROUP_RECLAIM_RETRIES; > + oomed = true; > + goto retry; > + case OOM_FAILED: > + goto force; > + default: > + goto nomem; > + } > nomem: > if (!(gfp_mask & __GFP_NOFAIL)) > return -ENOMEM; > -- > 2.17.1 >