Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp2151220pxb; Fri, 25 Mar 2022 12:01:02 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw7TP7l0YcbFeL4R5PIuQhflbDAJVJvaSsbangRJC0lik5PZk2qBih0Zfk4+u502KmnH9ds X-Received: by 2002:a05:6a00:b92:b0:4fa:82e9:786c with SMTP id g18-20020a056a000b9200b004fa82e9786cmr11704091pfj.31.1648234862016; Fri, 25 Mar 2022 12:01:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1648234862; cv=none; d=google.com; s=arc-20160816; b=iJhU9RsiSjxNxnjHsaZ21fcrv70C/tZf94gKoDVCDLZy/LzimMUoWI16Ojv6jLIPLh Pt3yczDjtt1yxtaZPgsQE86MtJ7jqznFde78QATSmMwXVgiRL4O4qlNWPm8a6835rThd sZs0SaCq/W9WCwRir2DMurx8G+dX+FOGMU+VbMo1snQS2FfBlnrASfSW3e6Y0vWikTdb Ngi9Wtxt66zlKnegb1Tp1aH43qZVkDLGZPRun4OguMsbeKhfWHC2dS9l4/zmOnQ9atf6 0/o4my3YZtNPiUW2AUuDyprh2knZr2POQA60EN+bgG5LgxP52wzuRft22SPHBmBdqUrU ocCw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=YUUeaXeizdEbhTfbv9oM2YEpstZpvlj67k8v1Zerx7I=; b=lgwEuwLkYTH9tcDN46u1uFNF++LSpWhXUa/W2Bx+fWbUfq7DweDoAssTmJ8SSMU3PT q9St/Xgw/pQtzPfYqS1XTBCtSJnlelIwCQEE1/octsuXxRhTJtko51fn7zgdAaE3cMcC fryv3v/7sX3Y0UlMXkTNYrAe2gHbtHGFSvLdzjlOlixDPPywuucCCWRqryfk/KdUC4Dg 1r4oETGqPRmduwOmbSc9XnIv17G7S4kncNctLo+N15QSo5xOvPfvSTh3aVFuRWkoqaCZ ELWza7UJr1OT6GMaJzfkLkhju1iBYynTesjjGjqIriNguGOZ8RiO532icW0l1QLtG/IE SalA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=desiato.20200630 header.b=B2fbnZPO; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id d7-20020a170902cec700b00153b2d1662esi3357315plg.566.2022.03.25.12.01.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Mar 2022 12:01:02 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=desiato.20200630 header.b=B2fbnZPO; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id AB35B1E744C; Fri, 25 Mar 2022 11:10:50 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239920AbiCWLNw (ORCPT + 99 others); Wed, 23 Mar 2022 07:13:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33840 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234132AbiCWLNv (ORCPT ); Wed, 23 Mar 2022 07:13:51 -0400 Received: from desiato.infradead.org (desiato.infradead.org [IPv6:2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C80CF78076 for ; Wed, 23 Mar 2022 04:12:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=YUUeaXeizdEbhTfbv9oM2YEpstZpvlj67k8v1Zerx7I=; b=B2fbnZPOT1h23P2XKSNtHyXWtw CSes4QkfbR4mT6NuFszS6Gddq7SeNz2UFZuAIkcgofqiHXvICYEL/wKRDIQBHpHv8j62ne1KOENjc DxIo14i3wV0Asb2oxVU/82QTpmuJ7Wf2ZUhfkyqKyfLYZ5cnpvtPr68iwfK5Yzyk6/6gZ0AP4rc/G /rvF9GNI3bKG5YQ/yFyugVF65PgckQS/gcT+SCH80aa08ZIhl5zOTdNpnfCgXVs0gtIiCmO4asfHx phv35XUB9AMKdhEuQ2GkJf6gWtPzXyXpiwa/BXFIbcZ6K5BQBHZwf7AC0NJ6geZZuiG0JXM3Vj6lT 3C5SDZKg==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1nWyuG-003k7a-5W; Wed, 23 Mar 2022 11:12:00 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 9961D3000E6; Wed, 23 Mar 2022 12:11:57 +0100 (CET) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id 54D0F2C870D41; Wed, 23 Mar 2022 12:11:57 +0100 (CET) Date: Wed, 23 Mar 2022 12:11:57 +0100 From: Peter Zijlstra To: Michal Hocko Cc: Thomas Gleixner , Davidlohr Bueso , Nico Pache , linux-mm@kvack.org, Andrea Arcangeli , Joel Savitz , Andrew Morton , linux-kernel@vger.kernel.org, Rafael Aquini , Waiman Long , Baoquan He , Christoph von Recklinghausen , Don Dutile , "Herton R . Krzesinski" , Ingo Molnar , Darren Hart , Andre Almeida , David Rientjes Subject: Re: [PATCH v5] mm/oom_kill.c: futex: Close a race between do_exit and the oom_reaper Message-ID: References: <20220318033621.626006-1-npache@redhat.com> <20220322004231.rwmnbjpq4ms6fnbi@offworld> <20220322025724.j3japdo5qocwgchz@offworld> <87bkxyaufi.ffs@tglx> <87zglha9rt.ffs@tglx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 23, 2022 at 10:17:28AM +0100, Michal Hocko wrote: > > Neither is it "normal" that a VM is scheduled out long enough to miss a > > 1 second deadline. That might be considered normal by cloud folks, but > > that's absolute not normal from an OS POV. Again, that's not a OS > > problem, that's an operator/admin problem. > > Thanks for this clarification. I would tend to agree. Following a > previous example that oom victims can leave inconsistent state behind > which can influence other processes. I am wondering what kind of > expectations about the lock protected state can we make when the holder > of the lock has been interrupted at any random place in the critical > section. Right, this is why the new owner gets the OWNER_DIED bit so it can see something really dodgy happened. Getting that means it needs to validate state consistency or just print a nice error and fully terminate things. So robust futexes: - rely on userspace to maintain a linked list of held locks, - rely on lock acquire to check OWNER_DIED and handle state inconsistency. If userspace manages to screw up either one of those, it's game over. Nothing we can do about it. Software really has to be built do deal with this, it doesn't magically work (IOW, in 99% of the case it just doesn't work right). > [...] > > > And just to be clear, this is clearly a bug in the oom_reaper per se. > > > Originally I thought that relaxing the locking (using trylock and > > > retry/bail out on failure) would help but as I've learned earlier this > > > day this is not really possible because of #PF at least. The most self > > > contained solution would be to skip over vmas which are backing the > > > robust list which would allow the regular exit path to do the proper > > > cleanup. > > > > That's not sufficient because you have to guarantee that the relevant > > shared futex is accessible. See the lock chain example above. > > Yeah, my previous understanding was that the whole linked list lives in > the single mapping and we can just look at their addresses. Nope; shared futexes live in shared memory and as such the robust_list entry must live there too. That is, the robust_list entry is embedded in the lock itself along the lines of: struct robust_mutex { u32 futex; struct robust_list list; }; and then you register the robust_list_head with: .futex_offset = offsetof(struct robust_mutex, futex) - offsetof(struct robust_mutex, list); or somesuch (glibc does all this). And the locks themselves are spread all over the place.