Received: by 2002:ac0:a581:0:0:0:0:0 with SMTP id m1-v6csp5368848imm; Tue, 26 Jun 2018 10:03:16 -0700 (PDT) X-Google-Smtp-Source: AAOMgpeIVqlPX2Lbx+D+4yhQMD2m17zQaUxp0nO0jtv6t9zS9n3tNvu5w5oP188jVoehH0Ab4LKE X-Received: by 2002:a62:4556:: with SMTP id s83-v6mr2386703pfa.73.1530032596776; Tue, 26 Jun 2018 10:03:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1530032596; cv=none; d=google.com; s=arc-20160816; b=MNUzX+UG1eS3Yhr9DsXxLU0CUgJpVATBn7Mh1jpiSt7+qAbvhNu1/e/Bllzb2/h6yu OPQFJnRhxtpWcAWiWdL5oQVtBs7TV0eXKm7JGHgVr1tNVNkU7FNqXgSv3AbTZNdNV5mi faCzihuNKwqLTYmoHJD3U8Eg5jpNG6K/uAq1EnL5OgyEiaeDUKJwcUn963+7mbFkJDQ6 6PcsasgYaWcRuBBm7VT1AWT1QM5phwGo/Ntx7r11UdIfCPTBmB4/5fHYLfhLPhwCd1ST ESirsigsZg/Zt2d2Z9rW1fTuURsk3YtGVZYjNGNWsLpEeLRx5QjHvxk1DVI5OHj6rTme Wdtw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:user-agent:in-reply-to :content-disposition:mime-version:references:reply-to:subject:cc:to :from:date:arc-authentication-results; bh=48svQd8yqhTvvPSLHKhXEHmSPMLshT/q7yKi1kl5naA=; b=YVVzAeWxMmOMltnfZkvVluopDLDRhux2oX4pqozOgmNPCUx+aWlWt1Gdwljt8eCQnW aIrd2qMwqneQYwSVo7E/4jqUPgJ5EJVKtsHTtz+7IJPJMp0aMzmb7nlMVAtXLtVsv/5h 1QVQUsA9m/kdiWEDmkC93Z21AjS1zIVwIbcCeZXV+mLmdhpz/D5xOmexa3eiVCw1y8en dIjSn3xiNlSPyzm8UmYo42aM2FBt4I3wH82teAST4pz8ZAniMnomXoPf03OW/eA6U92m 6kyMTmfYDIVyPBuyiXtyiqk3R2IkwhAVeU9wxq291AfH2/kSXwTp1gEdsrHlC5v/y30Z XkOQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h89-v6si1950106pld.378.2018.06.26.10.03.02; Tue, 26 Jun 2018 10:03:16 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753018AbeFZRCI (ORCPT + 99 others); Tue, 26 Jun 2018 13:02:08 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:33628 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752910AbeFZRBq (ORCPT ); Tue, 26 Jun 2018 13:01:46 -0400 Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w5QGwjT8095629 for ; Tue, 26 Jun 2018 13:01:46 -0400 Received: from e11.ny.us.ibm.com (e11.ny.us.ibm.com [129.33.205.201]) by mx0a-001b2d01.pphosted.com with ESMTP id 2jus3ys2xj-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Tue, 26 Jun 2018 13:01:45 -0400 Received: from localhost by e11.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 26 Jun 2018 13:01:44 -0400 Received: from b01cxnp22034.gho.pok.ibm.com (9.57.198.24) by e11.ny.us.ibm.com (146.89.104.198) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Tue, 26 Jun 2018 13:01:41 -0400 Received: from b01ledav003.gho.pok.ibm.com (b01ledav003.gho.pok.ibm.com [9.57.199.108]) by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w5QH1fQR5702082 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Tue, 26 Jun 2018 17:01:41 GMT Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 42347B2068; Tue, 26 Jun 2018 13:01:34 -0400 (EDT) Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0F156B2066; Tue, 26 Jun 2018 13:01:34 -0400 (EDT) Received: from paulmck-ThinkPad-W541 (unknown [9.70.82.159]) by b01ledav003.gho.pok.ibm.com (Postfix) with ESMTP; Tue, 26 Jun 2018 13:01:33 -0400 (EDT) Received: by paulmck-ThinkPad-W541 (Postfix, from userid 1000) id DAADC16C377B; Tue, 26 Jun 2018 10:03:45 -0700 (PDT) Date: Tue, 26 Jun 2018 10:03:45 -0700 From: "Paul E. McKenney" To: Tetsuo Handa Cc: Michal Hocko , David Rientjes , linux-mm@kvack.org, Andrew Morton , linux-kernel@vger.kernel.org Subject: Re: [PATCH] mm,oom: Bring OOM notifier callbacks to outside of OOM killer. Reply-To: paulmck@linux.vnet.ibm.com References: <1529493638-6389-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> <20180621073142.GA10465@dhcp22.suse.cz> <2d8c3056-1bc2-9a32-d745-ab328fd587a1@i-love.sakura.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2d8c3056-1bc2-9a32-d745-ab328fd587a1@i-love.sakura.ne.jp> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 18062617-2213-0000-0000-000002C045B6 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00009259; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000266; SDB=6.01052666; UDB=6.00539667; IPR=6.00830590; MB=3.00021864; MTD=3.00000008; XFM=3.00000015; UTC=2018-06-26 17:01:44 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18062617-2214-0000-0000-00005A9DB463 Message-Id: <20180626170345.GA3593@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2018-06-26_08:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=1 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1806210000 definitions=main-1806260191 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 21, 2018 at 08:27:41PM +0900, Tetsuo Handa wrote: > On 2018/06/21 7:36, David Rientjes wrote: > >> @@ -1010,6 +1010,33 @@ int unregister_oom_notifier(struct notifier_block *nb) > >> EXPORT_SYMBOL_GPL(unregister_oom_notifier); > >> > >> /** > >> + * try_oom_notifier - Try to reclaim memory from OOM notifier list. > >> + * > >> + * Returns non-zero if notifier callbacks released something, zero otherwise. > >> + */ > >> +unsigned long try_oom_notifier(void) > > > > It certainly is tried, but based on its usage it would probably be better > > to describe what is being returned (it's going to set *did_some_progress, > > which is a page count). > > Well, it depends on what the callbacks are doing. Currently, we have 5 users. > > arch/powerpc/platforms/pseries/cmm.c > arch/s390/mm/cmm.c > drivers/gpu/drm/i915/i915_gem_shrinker.c > drivers/virtio/virtio_balloon.c > kernel/rcu/tree_plugin.h > > Speak of rcu_oom_notify() in kernel/rcu/tree_plugin.h , we can't tell whether > the callback helped releasing memory, for it does not update the "freed" argument. It does not immediately release memory, but instead takes steps to encourage memory to be released more quickly. > >> +{ > >> + static DEFINE_MUTEX(oom_notifier_lock); > >> + unsigned long freed = 0; > >> + > >> + /* > >> + * Since OOM notifier callbacks must not depend on __GFP_DIRECT_RECLAIM > >> + * && !__GFP_NORETRY memory allocation, waiting for mutex here is safe. > >> + * If lockdep reports possible deadlock dependency, it will be a bug in > >> + * OOM notifier callbacks. > >> + * > >> + * If SIGKILL is pending, it is likely that current thread was selected > >> + * as an OOM victim. In that case, current thread should return as soon > >> + * as possible using memory reserves. > >> + */ > >> + if (mutex_lock_killable(&oom_notifier_lock)) > >> + return 0; > >> + blocking_notifier_call_chain(&oom_notify_list, 0, &freed); > >> + mutex_unlock(&oom_notifier_lock); > >> + return freed; > >> +} > > > > If __blocking_notifier_call_chain() used down_read_killable(), could we > > eliminate oom_notifier_lock? > > I don't think we can eliminate it now, for it is a serialization lock > (while trying to respond to SIGKILL as soon as possible) which is currently > achieved by mutex_trylock(&oom_lock). > > (1) rcu_oom_notify() in kernel/rcu/tree_plugin.h is not prepared for being > called concurrently. > > ---------- > static int rcu_oom_notify(struct notifier_block *self, > unsigned long notused, void *nfreed) > { > int cpu; > > /* Wait for callbacks from earlier instance to complete. */ > wait_event(oom_callback_wq, atomic_read(&oom_callback_count) == 0); // <= Multiple threads can pass this line at the same time. > smp_mb(); /* Ensure callback reuse happens after callback invocation. */ > > /* > * Prevent premature wakeup: ensure that all increments happen > * before there is a chance of the counter reaching zero. > */ > atomic_set(&oom_callback_count, 1); // <= Multiple threads can execute this line at the same time. > > for_each_online_cpu(cpu) { > smp_call_function_single(cpu, rcu_oom_notify_cpu, NULL, 1); > cond_resched_tasks_rcu_qs(); > } > > /* Unconditionally decrement: no need to wake ourselves up. */ > atomic_dec(&oom_callback_count); // <= Multiple threads can execute this line at the same time, making oom_callback_count < 0 ? > > return NOTIFY_OK; > } > ---------- > > The counter inconsistency problem could be fixed by > > - atomic_set(&oom_callback_count, 1); > + atomic_inc(&oom_callback_count); > > but who becomes happy if rcu_oom_notify() became ready to be called > concurrently? We want to wait for the callback to complete before > proceeding to the OOM killer. I think that we should save CPU resource > by serializing concurrent callers. There are a lot of ways it could be made concurrency safe. If you need me to do this, please do let me know. That said, the way it is now written, if you invoke rcu_oom_notify() twice in a row, the second invocation will wait until the memory from the first invocation is freed. What do you want me to do if you invoke me concurrently? 1. One invocation "wins", waits for the earlier callbacks to complete, then encourages any subsequent callbacks to be processed more quickly. The other invocations return immediately without doing anything. 2. The invocations serialize, with each invocation waiting for the callbacks from previous invocation (in mutex_lock() order or some such), and then starting a new round. 3. Something else? Thanx, Paul > (2) i915_gem_shrinker_oom() in drivers/gpu/drm/i915/i915_gem_shrinker.c depends > on mutex_trylock() from shrinker_lock() from i915_gem_shrink() from > i915_gem_shrink_all() to return 1 (i.e. succeed) before need_resched() > becomes true in order to avoid returning without reclaiming memory. > > > This patch is certainly an improvement because it does the last > > get_page_from_freelist() call after invoking the oom notifiers that can > > free memory and we've otherwise pointlessly redirected it elsewhere. > > Thanks, but this patch might break subtle balance which is currently > achieved by mutex_trylock(&oom_lock) serialization/exclusion. > > (3) virtballoon_oom_notify() in drivers/virtio/virtio_balloon.c by default > tries to release 256 pages. Since this value is configurable, one might > set 1048576 pages. If virtballoon_oom_notify() is concurrently called by > many threads, it might needlessly deflate the memory balloon. > > We might want to remember and reuse the last result among serialized callers > (feedback mechanism) like > > { > static DEFINE_MUTEX(oom_notifier_lock); > static unsigned long last_freed; > unsigned long freed = 0; > if (mutex_trylock(&oom_notifier_lock)) { > blocking_notifier_call_chain(&oom_notify_list, 0, &freed); > last_freed = freed; > } else { > mutex_lock(&oom_notifier_lock); > freed = last_freed; > } > mutex_unlock(&oom_notifier_lock); > return freed; > > } > > or > > { > static DEFINE_MUTEX(oom_notifier_lock); > static unsigned long last_freed; > unsigned long freed = 0; > if (mutex_lock_killable(&oom_notifier_lock)) { > freed = last_freed; > last_freed >>= 1; > return freed; > } else if (last_freed) { > freed = last_freed; > last_freed >>= 1; > } else { > blocking_notifier_call_chain(&oom_notify_list, 0, &freed); > last_freed = freed; > } > mutex_unlock(&oom_notifier_lock); > return freed; > } > > . Without feedback mechanism, mutex_lock_killable(&oom_notifier_lock) serialization > could still needlessly deflate the memory balloon compared to mutex_trylock(&oom_lock) > serialization/exclusion. Maybe virtballoon_oom_notify() (and two CMM users) would > implement feedback mechanism themselves, by examining watermark from OOM notifier > hooks. > > > > On 2018/06/21 16:31, Michal Hocko wrote: > > On Wed 20-06-18 15:36:45, David Rientjes wrote: > > [...] > >> That makes me think that "oom_notify_list" isn't very intuitive: it can > >> free memory as a last step prior to oom kill. OOM notify, to me, sounds > >> like its only notifying some callbacks about the condition. Maybe > >> oom_reclaim_list and then rename this to oom_reclaim_pages()? > > > > Yes agreed and that is the reason I keep saying we want to get rid of > > this yet-another-reclaim mechanism. We already have shrinkers which are > > the main source of non-lru pages reclaim. Why do we even need > > oom_reclaim_pages? What is fundamentally different here? Sure those > > pages should be reclaimed as the last resort but we already do have > > priority for slab shrinking so we know that the system is struggling > > when reaching the lowest priority. Isn't that enough to express the need > > for current oom notifier implementations? > > > > Even if we update OOM notifier users to use shrinker hooks, they will need a > subtle balance which is currently achieved by mutex_trylock(&oom_lock). > > Removing OOM notifier is not doable right now. It is not suitable as a regression > fix for commit 27ae357fa82be5ab ("mm, oom: fix concurrent munlock and oom reaper > unmap, v3"). What we could afford for this regression is > https://patchwork.kernel.org/patch/9842889/ which is exactly what you suggested > in a thread at https://www.spinics.net/lists/linux-mm/msg117896.html . >