Received: by 2002:ac8:1418:0:b0:3ab:920c:4c8b with SMTP id k24csp3132670qtj; Wed, 18 Jan 2023 13:40:55 -0800 (PST) X-Google-Smtp-Source: AMrXdXtdP7wX0znNyz6umg0BEFqiq74rjU8fKj7CpNJb8LNzy+h5VguPMhXy2YzL98sTfJjJLo/8 X-Received: by 2002:a17:906:3a97:b0:86e:bbdf:95c5 with SMTP id y23-20020a1709063a9700b0086ebbdf95c5mr8811080ejd.26.1674078054810; Wed, 18 Jan 2023 13:40:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674078054; cv=none; d=google.com; s=arc-20160816; b=v1bs8G/hnZIbHLrPfmDjavS4E3dmn4CGgAY/iF6G2NuRHQQD3j5d0P5HwUpDElqeY1 963hfzuCVL6TUuBhzgg9J+4b/TwKd4PgQilgZZNtIyyLAZd8S3Yeb71Is/+jsjvIpoSC rEiKjg0dnzanPBPCWj2esFOHr3xQ4OZq4/L5yI1CaF/93zX3CDk/yYnnARIbc6a1Teir 8M6ug4htUYzVxQIsXEFKK7kX1tVvTVtcR47RwmjMSHr6gd8/YQLAnSYwfUAr77rlXLgi amn+p8ZlCq8sOmAn7MQfTdSJDtdZ3klZOboONfBstgrTmB+koNFzd1p70SCF9vZ7AxdA lwbg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=SCl7LjVH2TZ3UTWu0doBlI+1Q+CKxwaOwngsxyrwCmg=; b=RudFc5e9O6JcEZT+XCHnBV1b6i2sXdU05X7eMtN1eBn+JfF65RSSwBfvOxR9EAs1Bz dWDMgXYLWEuAgVgQUtUlTZU6ihZ9bZTvJRYTSFVLgBW+jGdhIxf7IJf757/tcHuYLJ3a kCppy5BfhQgwnTSgof/6vUyK/PdVQ0/Dhc5bBlC3Gpg3o2iDZV0NDF4grsDTfBpsZnwa +TccjFVTTS5SBt0/822h64ZBjsqjih0rMBfQEk7sBLZE1mQATNUoUbRhowKki4KntPXi 0S+bfQHo78yg/XYHG/iSJXzy3Vm0kLT/tlwHfY5/Z/9H3cmR1b1gt3ZSLWG0VoeougNQ /92w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=Xdd6unAy; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ji2-20020a170907980200b00871db1b78e5si7453856ejc.834.2023.01.18.13.40.43; Wed, 18 Jan 2023 13:40:54 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=Xdd6unAy; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229655AbjARV2Z (ORCPT + 45 others); Wed, 18 Jan 2023 16:28:25 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56410 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229648AbjARV2Y (ORCPT ); Wed, 18 Jan 2023 16:28:24 -0500 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0FCEB18ABC for ; Wed, 18 Jan 2023 13:28:23 -0800 (PST) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id B7D965C362; Wed, 18 Jan 2023 21:28:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1674077301; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=SCl7LjVH2TZ3UTWu0doBlI+1Q+CKxwaOwngsxyrwCmg=; b=Xdd6unAyJsHjLOy7+DmPRhcVdPMQb2LzuHEe4SAD6unAzkIb7zwG/mfbKlTagLOkn/XMH5 CjMQWydbF5lSHXu+I10M/SseaMjytZ6NOYdOEs9QWHQDhGLwaMJ+RpS+4str3zcAc/Prjh 0nJ+g+di7iE4y2oMzyVtuUfo9Dxo2x0= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 87D73139D2; Wed, 18 Jan 2023 21:28:21 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id Z0atIHVkyGMaOAAAMHmgww (envelope-from ); Wed, 18 Jan 2023 21:28:21 +0000 Date: Wed, 18 Jan 2023 22:28:20 +0100 From: Michal Hocko To: Suren Baghdasaryan Cc: Jann Horn , peterz@infradead.org, Ingo Molnar , Will Deacon , akpm@linux-foundation.org, michel@lespinasse.org, jglisse@google.com, vbabka@suse.cz, hannes@cmpxchg.org, mgorman@techsingularity.net, dave@stgolabs.net, willy@infradead.org, liam.howlett@oracle.com, ldufour@linux.ibm.com, laurent.dufour@fr.ibm.com, paulmck@kernel.org, luto@kernel.org, songliubraving@fb.com, peterx@redhat.com, david@redhat.com, dhowells@redhat.com, hughd@google.com, bigeasy@linutronix.de, kent.overstreet@linux.dev, punit.agrawal@bytedance.com, lstoakes@gmail.com, peterjung1337@gmail.com, rientjes@google.com, axelrasmussen@google.com, joelaf@google.com, minchan@google.com, shakeelb@google.com, tatashin@google.com, edumazet@google.com, gthelen@google.com, gurua@google.com, arjunroy@google.com, soheil@google.com, hughlynch@google.com, leewalsh@google.com, posk@google.com, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linuxppc-dev@lists.ozlabs.org, x86@kernel.org, linux-kernel@vger.kernel.org, kernel-team@android.com Subject: Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it Message-ID: References: <20230109205336.3665937-1-surenb@google.com> <20230109205336.3665937-13-surenb@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 18-01-23 09:36:44, Suren Baghdasaryan wrote: > On Wed, Jan 18, 2023 at 7:11 AM 'Michal Hocko' via kernel-team > wrote: > > > > On Wed 18-01-23 14:23:32, Jann Horn wrote: > > > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko wrote: > > > > On Tue 17-01-23 19:02:55, Jann Horn wrote: > > > > > +locking maintainers > > > > > > > > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan wrote: > > > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling > > > > > > instead of mmap_lock. Because there are cases when multiple VMAs need > > > > > > to be exclusively locked during VMA tree modifications, instead of the > > > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock > > > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When > > > > > > mmap_write_lock holder is done with all modifications and drops mmap_lock, > > > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as > > > > > > locked. > > > > > [...] > > > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma) > > > > > > +{ > > > > > > + up_read(&vma->lock); > > > > > > +} > > > > > > > > > > One thing that might be gnarly here is that I think you might not be > > > > > allowed to use up_read() to fully release ownership of an object - > > > > > from what I remember, I think that up_read() (unlike something like > > > > > spin_unlock()) can access the lock object after it's already been > > > > > acquired by someone else. > > > > > > > > Yes, I think you are right. From a look into the code it seems that > > > > the UAF is quite unlikely as there is a ton of work to be done between > > > > vma_write_lock used to prepare vma for removal and actual removal. > > > > That doesn't make it less of a problem though. > > > > > > > > > So if you want to protect against concurrent > > > > > deletion, this might have to be something like: > > > > > > > > > > rcu_read_lock(); /* keeps vma alive */ > > > > > up_read(&vma->lock); > > > > > rcu_read_unlock(); > > > > > > > > > > But I'm not entirely sure about that, the locking folks might know better. > > > > > > > > I am not a locking expert but to me it looks like this should work > > > > because the final cleanup would have to happen rcu_read_unlock. > > > > > > > > Thanks, I have completely missed this aspect of the locking when looking > > > > into the code. > > > > > > > > Btw. looking at this again I have fully realized how hard it is actually > > > > to see that vm_area_free is guaranteed to sync up with ongoing readers. > > > > vma manipulation functions like __adjust_vma make my head spin. Would it > > > > make more sense to have a rcu style synchronization point in > > > > vm_area_free directly before call_rcu? This would add an overhead of > > > > uncontended down_write of course. > > > > > > Something along those lines might be a good idea, but I think that > > > rather than synchronizing the removal, it should maybe be something > > > that splats (and bails out?) if it detects pending readers. If we get > > > to vm_area_free() on a VMA that has pending readers, we might already > > > be in a lot of trouble because the concurrent readers might have been > > > traversing page tables while we were tearing them down or fun stuff > > > like that. > > > > > > I think maybe Suren was already talking about something like that in > > > another part of this patch series but I don't remember... > > > > This http://lkml.kernel.org/r/20230109205336.3665937-27-surenb@google.com? > > Yes, I spent a lot of time ensuring that __adjust_vma locks the right > VMAs and that VMAs are freed or isolated under VMA write lock > protection to exclude any readers. If the VM_BUG_ON_VMA in the patch > Michal mentioned gets hit then it's a bug in my design and I'll have > to fix it. But please, let's not add synchronize_rcu() in the > vm_area_free(). Just to clarify. I didn't suggest to add synchronize_rcu into vm_area_free. What I really meant was synchronize_rcu like primitive to effectivelly synchronize with any potential pending read locker (so something like vma_write_lock (or whatever it is called). The point is that vma freeing is an event all readers should be notified about. This can be done explicitly for each and every vma before vm_area_free is called but this is just hard to review and easy to break over time. See my point? -- Michal Hocko SUSE Labs