Received: by 2002:a05:6512:2355:0:0:0:0 with SMTP id p21csp200619lfu; Wed, 30 Mar 2022 20:41:18 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyy5yizz/JNBmyAo2/4jtdTcM6VABs07SzXqYo8JlIVwVLL4s7gW7cdTe96q4Y1/ahuGqbI X-Received: by 2002:a17:902:ce90:b0:154:3029:97e6 with SMTP id f16-20020a170902ce9000b00154302997e6mr2945484plg.111.1648698078347; Wed, 30 Mar 2022 20:41:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1648698078; cv=none; d=google.com; s=arc-20160816; b=yCN0VLl9WXs3X1ozezOp0Q2mq7kLOqK40o+9BW/8ySWVRf2X2XgeViTkon33SV3UPS 0JHYWqfMwxt+A87I5rUqiKWNpLpmcIK3VLFxskRbz4ZZFd3ireTlZDFaqradbY6fa/Yt Qicsgz5xEqn58TXvysxjh+dzW3FONWuKdWSlTE0dttO4MxTbhGCkMQKxq/I4D/yL82qA SCAftsRnOl975QnV1jaqO8d9unlAHaHXvqPPdscqfK/JVDBLqmFYrv5OVClOO7+VBMit pOlhs1ZC8EC3uryCgzbZaJ+NmpNaJJiLUtJB281V36xkvVRKq3DdU8cuiAbr9rTQK7WY nM8g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=s3dYNUh5AxjBX/yiYSOWfD7ksRWWq/GngWTGNlJcxhI=; b=OKciPxyX8Y746eCqp5rHFOEUK8mrhyqJs93AhiVbUew2aRLKhOZEF0wGLV1bG6JlZm ZO3l6HSmQEnatIDd+h7MdCz/sZcHRaHGrIXIDEHjzyIYax39qXjmbBV8UpwHkGBbW8E0 HU6NmUa10tHp4wa7jNhlMm3vp2F2HRAqUHoMGGp3QzsNyXp0u29sKKHa2gz8K8mM/cre X3fh9GGQAO/JbobOaxMQulYpTwVIl2qswsKn9jTRyVYNozHGG8bT3mnRQAGTJDTIdEcl MZNfIELDf9KtvfIZQOufIc2jCYuIlBMgc4ul1EnGmKyQBbJU3eCzi6u2psTPABhHYmBu a11A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=Mvew9YBE; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id x64-20020a17090a6c4600b001c68c41c770si1815781pjj.179.2022.03.30.20.41.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 30 Mar 2022 20:41:18 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=Mvew9YBE; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id A29F31350A4; Wed, 30 Mar 2022 20:00:25 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244715AbiC3JUL (ORCPT + 99 others); Wed, 30 Mar 2022 05:20:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56824 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232437AbiC3JUK (ORCPT ); Wed, 30 Mar 2022 05:20:10 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1F41A2AC5C for ; Wed, 30 Mar 2022 02:18:23 -0700 (PDT) Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id A11C2210DB; Wed, 30 Mar 2022 09:18:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1648631902; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=s3dYNUh5AxjBX/yiYSOWfD7ksRWWq/GngWTGNlJcxhI=; b=Mvew9YBEv/zDcFABus3iY6LxYrmlZALbjewnnCPibJBZFxCJzhCQyu5K2XdSWZivjSKJ0t UfP4//CUPt+/sv0yrT4knmFshytaaoJTesjJkUHja85JGcx5u0HCLkPgwLe/RKPaON5K+n kWfi0sGWlEu0Ua0KpjZ8Wwb3szO4+Qs= Received: from suse.cz (unknown [10.100.201.86]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id A60D2A3B87; Wed, 30 Mar 2022 09:18:21 +0000 (UTC) Date: Wed, 30 Mar 2022 11:18:17 +0200 From: Michal Hocko To: Nico Pache Cc: Davidlohr Bueso , Thomas Gleixner , linux-mm@kvack.org, Andrea Arcangeli , Joel Savitz , Andrew Morton , linux-kernel@vger.kernel.org, Rafael Aquini , Waiman Long , Baoquan He , Christoph von Recklinghausen , Don Dutile , "Herton R . Krzesinski" , Ingo Molnar , Peter Zijlstra , Darren Hart , Andre Almeida , David Rientjes Subject: Re: [PATCH v5] mm/oom_kill.c: futex: Close a race between do_exit and the oom_reaper Message-ID: References: <20220318033621.626006-1-npache@redhat.com> <20220322004231.rwmnbjpq4ms6fnbi@offworld> <20220322025724.j3japdo5qocwgchz@offworld> <87bkxyaufi.ffs@tglx> <87zglha9rt.ffs@tglx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Nico, On Wed 23-03-22 10:17:29, Michal Hocko wrote: > Let me skip over futex part which I need to digest and only focus on the > oom side of the things for clarification. > > On Tue 22-03-22 23:43:18, Thomas Gleixner wrote: [...] > > You can easily validate that by doing: > > > > wake_oom_reaper(task) > > task->reap_time = jiffies + HZ; > > queue_task(task); > > wakeup(reaper); > > > > and then: > > > > oom_reap_task(task) > > now = READ_ONCE(jiffies); > > if (time_before(now, task->reap_time) > > schedule_timeout_idle(task->reap_time - now); > > > > before trying to actually reap the mm. > > > > That will prevent the enforced race in most cases and allow the exiting > > and/or killed processes to cleanup themself. Not pretty, but it should > > reduce the chance of the reaper to win the race with the exiting and/or > > killed process significantly. > > > > It's not going to work when the problem is combined with a heavy VM > > overload situation which keeps a guest (or one/some it's vCPUs) away > > from being scheduled. See below for a discussion of guarantees. > > > > If it failed to do so when the sleep returns, then you still can reap > > it. > > Yes, this is certainly an option. Please note that the oom_reaper is not > the only way to trigger this. process_mrelease syscall performs the same > operation from the userspace. Arguably process_mrelease could be used > sanely/correctly because the userspace oom killer can do pro-cleanup > steps before going to final SIGKILL & process_mrelease. One way would be > to send SIGTERM in the first step and allow the victim to perform its > cleanup. are you working on another version of the fix/workaround based on the discussion so far? I guess the most reasonable way forward is to rework oom_reaper processing to be delayed. This can be either done by a delayed wake up or as Thomas suggests above by postponing the processing. I think the delayed wakeup would be _slightly_ easier to handle because the queue can contain many tasks to be reaped. More specifically something like delayed work but we cannot rely on the WQ here. I guess we do not have any delayed wait queue interface but the same trick with the timer should be applicable here as well. exit_mmap would then cancel the timer after __oom_reap_task_mm is done. Actually the timer could be canceled after mmu_notifier_release already but this shouldn't make much of a difference. -- Michal Hocko SUSE Labs