Received: by 2002:a05:6a10:6d10:0:0:0:0 with SMTP id gq16csp27221pxb; Tue, 12 Apr 2022 15:50:55 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx6wMBhewNQfzeJRN+OgUoxkIgFdARp/2iM8RCFyXFkwgTZ7htofF44fr0mgyW8X/22vboy X-Received: by 2002:a17:90a:4217:b0:1c7:c203:b4f3 with SMTP id o23-20020a17090a421700b001c7c203b4f3mr7441964pjg.177.1649803855370; Tue, 12 Apr 2022 15:50:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1649803855; cv=none; d=google.com; s=arc-20160816; b=WfG0AKvhEQM5WQFO+HYcgxDzs8GM5cypyOdnhbc4qDSrKCKQasydvf/Ir0DDtktsiP Q/RIjPo0C/53WJYNti3P3nVGgHvxNx9F2U26Iy9cBF+leheWor3EgFBDg6pK1N0pHPRz 2HP80XU20AKcMxnOo3gCD93FtHXub1AJXyi25g1K7rdPKuwTDefhFJCobQJEPEXFT7el pjfe+saEv1fCZkAxcRXcP8dZ7In4hhH2kpQAx62FEcKR5ZfDG5Lb0HbK9JiDvpxx0oqH /2crhxhUEXgYJjGg5i24PLdWobElclPS2ftN0c5zC3StpmQEyRaQtsZpRGe17NUggmGu ZPmg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=8bIp6N4V6OcovxnD4FwgEWzudRUr3RpK89Dxlfw08ow=; b=RjfDzBKj+C0pKv9lYEG03k1gkZzkbwJ2kcv/RKx0D9YxVIUFiKD+HRvhvHDRv412JG Cy+hM8luz/kXikr4pqqZwE6meJmf3cSNlv4WNdkc7gGRN9UHT/nx78GriSDnWTH4Z0FD ofCD2uqPBbyIrMEytVQRPAzRFLSnPyq5nZNYsaen9Pvc6//0RpcjIlFBW0z/tDxq3qgI cMcbNkOGK0wrQmQyHtliI2gBlPj1n+FVTUlxXuuGEiLi+anEsA/hqMmqfzoHTIOOrzzt WwjjjCjlihybV2mi5HehjwzuIYTqinLDFZ5sGKPdIFh9IvVpStbnquuD0bcHheo8Zdy3 pVnw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=ibGIM2pH; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id t20-20020a6564d4000000b0039809d148f1si3460536pgv.675.2022.04.12.15.50.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Apr 2022 15:50:55 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=ibGIM2pH; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 3E83E11177F; Tue, 12 Apr 2022 14:32:43 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344136AbiDKJKS (ORCPT + 99 others); Mon, 11 Apr 2022 05:10:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39098 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344135AbiDKJKR (ORCPT ); Mon, 11 Apr 2022 05:10:17 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6A05A3EF35 for ; Mon, 11 Apr 2022 02:08:03 -0700 (PDT) Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id 190CF210FD; Mon, 11 Apr 2022 09:08:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1649668082; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=8bIp6N4V6OcovxnD4FwgEWzudRUr3RpK89Dxlfw08ow=; b=ibGIM2pHOpoLFbDKOB3IpAK7zLFOEvUTDG7NcUL7DLEMV+ClrJyB6LdsFb6wvSRdESueBl 5FhNEWLJRKFbD5UhVcUw+GnqOwUJv6BBcC9xiRDlAr5+wF/KLuEPZy+YZWHruIvqY0KnmZ iLdhUHNP9v1laY29WBBRdcvGW1jN0LE= Received: from suse.cz (unknown [10.100.201.86]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id B153EA3B88; Mon, 11 Apr 2022 09:08:01 +0000 (UTC) Date: Mon, 11 Apr 2022 11:08:01 +0200 From: Michal Hocko To: Thomas Gleixner Cc: Joel Savitz , Nico Pache , Peter Zijlstra , linux-mm@kvack.org, linux-kernel , Rafael Aquini , Waiman Long , Baoquan He , Christoph von Recklinghausen , Don Dutile , "Herton R . Krzesinski" , David Rientjes , Andrea Arcangeli , Andrew Morton , Davidlohr Bueso , Ingo Molnar , Darren Hart , stable@kernel.org Subject: Re: [PATCH v8] oom_kill.c: futex: Don't OOM reap the VMA containing the robust_list_head Message-ID: References: <20220408032809.3696798-1-npache@redhat.com> <20220408081549.GM2731@worktop.programming.kicks-ass.net> <87k0bzk7e5.ffs@tglx> <87sfqni77s.ffs@tglx> <87wnfwf4e5.ffs@tglx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87wnfwf4e5.ffs@tglx> X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 11-04-22 09:47:14, Thomas Gleixner wrote: > Michal, > > On Mon, Apr 11 2022 at 08:48, Michal Hocko wrote: > > On Fri 08-04-22 23:41:11, Thomas Gleixner wrote: > >> So why would a process private robust mutex be any different from a > >> process shared one? > > > > Purely from the OOM POV they are slightly different because the OOM > > killer always kills all threads which share the mm with the selected > > victim (with an exception of the global init - see __oom_kill_process). > > Note that this is including those threads which are not sharing signals > > handling. > > So clobbering private locks shouldn't be observable to an alive thread > > unless I am missing something. > > Yes, it kills everything, but the reaper also reaps non-shared VMAs. So > if the process private futex sits in a reaped VMA the shared one becomes > unreachable. > > > On the other hand I do agree that delayed oom_reaper execution is a > > reasonable workaround and the most simplistic one. > > I think it's more than a workaround. It's a reasonable expectation that > the kernel side of the user space threads can mop up the mess the user > space part created. So even if one of of N threads is stuck in a place > where it can't, then N-1 can still reach do_exit() and mop their mess > up. > > The oom reaper is the last resort to resolve the situation in case of a > stuck task. No? Yes, I keep saying workaround because it really doesn't address the underlying issue which is that the oom_reaper clobbers something it shouldn't be. A full fix from my POV would be making oom_reaper code more aware of the futex usage. But this is something nore really viable. Btw. this is what I've in my local tree. It hasn't seen any testing but it might be a good start to make it a full patch. I have decided to use a timer rather than juggling tasks on the oom_reaper list because initial implementation looked uglier. I will try to find some time to finish that but if Nico or others beat me to it I won't complain. Also I absolutely do not insist on the timer approach. --- diff --git a/include/linux/sched.h b/include/linux/sched.h index ff6901dcb06d..528806ad6e6a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1446,6 +1446,7 @@ struct task_struct { int pagefault_disabled; #ifdef CONFIG_MMU struct task_struct *oom_reaper_list; + struct timer_list oom_reaper_timer; #endif #ifdef CONFIG_VMAP_STACK struct vm_struct *stack_vm_area; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 7ec38194f8e1..be6d65ead7ec 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -632,7 +632,7 @@ static void oom_reap_task(struct task_struct *tsk) */ set_bit(MMF_OOM_SKIP, &mm->flags); - /* Drop a reference taken by wake_oom_reaper */ + /* Drop a reference taken by queue_oom_repaer */ put_task_struct(tsk); } @@ -644,12 +644,12 @@ static int oom_reaper(void *unused) struct task_struct *tsk = NULL; wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL); - spin_lock(&oom_reaper_lock); + spin_lock_irq(&oom_reaper_lock); if (oom_reaper_list != NULL) { tsk = oom_reaper_list; oom_reaper_list = tsk->oom_reaper_list; } - spin_unlock(&oom_reaper_lock); + spin_unlock_irq(&oom_reaper_lock); if (tsk) oom_reap_task(tsk); @@ -658,22 +658,50 @@ static int oom_reaper(void *unused) return 0; } -static void wake_oom_reaper(struct task_struct *tsk) +static void wake_oom_reaper_fn(struct timer_list *timer) { - /* mm is already queued? */ - if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) - return; + struct task_struct *tsk = container_of(timer, struct task_struct, oom_reaper_timer); + struct mm_struct *mm = tsk->signal->oom_mm; + unsigned long flags; - get_task_struct(tsk); + /* The victim managed to terminate on its own - see exit_mmap */ + if (test_bit(MMF_OOM_SKIP, &mm->flags)) { + put_task_struct(tsk); + return; + } - spin_lock(&oom_reaper_lock); + spin_lock_irqsave(&oom_reaper_lock, flags); tsk->oom_reaper_list = oom_reaper_list; oom_reaper_list = tsk; - spin_unlock(&oom_reaper_lock); + spin_unlock_irqrestore(&oom_reaper_lock, flags); trace_wake_reaper(tsk->pid); wake_up(&oom_reaper_wait); } +/* + * Give OOM victims some head room to exit themselves. If they do not exit + * on their own the oom reaper is invoked. + * The timeout is basically arbitrary and there is no best value to use. + * The longer it will be the longer the worst case scenario OOM can + * take. The smaller the timeout the more likely the oom_reaper can get + * into the way and release resources which could be needed during the + * exit path - e.g. futex robust lists can sit in the anonymous memory + * which could be reaped and the exit path won't be able to let waiters + * know the holding task has terminated. + */ +#define OOM_REAPER_DELAY (2*HZ) +static void queue_oom_repaer(struct task_struct *tsk) +{ + /* mm is already queued? */ + if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) + return; + + get_task_struct(tsk); + timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper_fn, 0); + tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY; + add_timer(&tsk->oom_reaper_timer); +} + static int __init oom_init(void) { oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); @@ -681,7 +709,7 @@ static int __init oom_init(void) } subsys_initcall(oom_init) #else -static inline void wake_oom_reaper(struct task_struct *tsk) +static inline void queue_oom_repaer(struct task_struct *tsk) { } #endif /* CONFIG_MMU */ @@ -932,7 +960,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message) rcu_read_unlock(); if (can_oom_reap) - wake_oom_reaper(victim); + queue_oom_repaer(victim); mmdrop(mm); put_task_struct(victim); @@ -968,7 +996,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) task_lock(victim); if (task_will_free_mem(victim)) { mark_oom_victim(victim); - wake_oom_reaper(victim); + queue_oom_repaer(victim); task_unlock(victim); put_task_struct(victim); return; @@ -1067,7 +1095,7 @@ bool out_of_memory(struct oom_control *oc) */ if (task_will_free_mem(current)) { mark_oom_victim(current); - wake_oom_reaper(current); + queue_oom_repaer(current); return true; } -- Michal Hocko SUSE Labs