Received: by 2002:a05:6358:a55:b0:ec:fcf4:3ecf with SMTP id 21csp72263rwb; Wed, 18 Jan 2023 14:18:54 -0800 (PST) X-Google-Smtp-Source: AMrXdXv/iT/MLnpk8JbcM4mkpzii/P0OdjaMwZTeDX0F/vEzEAUwjvlawyHubqDz0BGvsyNaZdKT X-Received: by 2002:a17:907:9252:b0:86e:d832:2f60 with SMTP id kb18-20020a170907925200b0086ed8322f60mr8030670ejb.48.1674080334286; Wed, 18 Jan 2023 14:18:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674080334; cv=none; d=google.com; s=arc-20160816; b=znwrZgR9hNBAHsba0jLDfh1vhnnSZ5hHqT/E2WTc1R1qSbzn0MyanBkEmpR3oiMu3n 7ctc4DJ8NgbzwA0EowC3ArDCv3J/9athSMBPRCZgjzXIhVNSWnuFfGmjiho+CN6rsizb ScOav+bTzPqW+JVsgt5+BiRPUZ8+oAh9oKTZZqkE7fVSSTvJCP3RUSGznvgWpwthTBh2 /piWCSKEN+v9GPSUOZSNXwu9RLuf/vlkbzMyPRyNYpz/ccAbGWt4XDtU5FYMvvg++bRy Pq1eQKVHIFi/BnarZc+NPvgEwcmrKRCoHXO2y0l/BtQpwng+2FTHU0+sDIgnaAjTXycU NoFQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=JgpeoTHHmirFYtBvQprCwslvQV+NLdgiQIToXq3uwZE=; b=imim/fcIR8Zb0zWT1ex0Bv+WzOuyj4uGvt02lsj7UQlU8R+EbbXC5X3vhRkPOLWiGw hILvqZ/dltiqKSfIakGdVeZITMmAAOIcMSft8/zzPDZGY2CvGrYGSTrvJE7sQIT0uGlj LqI9bINRe93ZG9rKvcZoEIRbyZxiN59QkGGcR4BdZyt0EWmr0RixeG9iA2GHSe5bzTRT BzvsLgG2K6eeU/SxHhgf0Iup0talbP1LaM2xsIruavxULeY+kupqGeEN7TnKn6ESMv1P RxNRJW+3q6V11ewytbzxCVGiX9f/6igwMDcvBGxoHW0u+xMo23HDxC/E6EU59Y3XTknr M6Gg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=As8aSH9Y; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ww2-20020a170907084200b007c01839ded7si16399930ejb.993.2023.01.18.14.18.42; Wed, 18 Jan 2023 14:18:54 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=As8aSH9Y; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231222AbjARVpW (ORCPT + 47 others); Wed, 18 Jan 2023 16:45:22 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40846 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231191AbjARVpU (ORCPT ); Wed, 18 Jan 2023 16:45:20 -0500 Received: from mail-yw1-x1132.google.com (mail-yw1-x1132.google.com [IPv6:2607:f8b0:4864:20::1132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 341FF1CACE for ; Wed, 18 Jan 2023 13:45:18 -0800 (PST) Received: by mail-yw1-x1132.google.com with SMTP id 00721157ae682-4d0f843c417so85367b3.7 for ; Wed, 18 Jan 2023 13:45:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=JgpeoTHHmirFYtBvQprCwslvQV+NLdgiQIToXq3uwZE=; b=As8aSH9YqOQwxb84HSdDayQFHtJeebFHp/bWatci23d+ZcUuRI1zqiFa7kqzfJdPY4 6b+IdUTvG5GtLImu60sSAzg3fuOI/z1KJ8K8PVSMB/ikC7CUG0PWmYb+OljUgYExq3/D YkXRlsoGCfYsZpFuwKn+fMlTo5yCmPbsZqMh/A2xQgbLJB4/S/GB4LtAVK7gdVS0eGMZ M/1G7ZIN4/fcoqMPjkHJdj4qRBT320uoqpVjhp1SrGjLO4E+i/3+I0ScgEnlaLK1xZtE sy1uA14H5BPKjCtl0W8puTAMiRspqLXz5hxqqL1vTZCNdJw44Ch+/9/QTYVHWDStgrkJ nVzQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=JgpeoTHHmirFYtBvQprCwslvQV+NLdgiQIToXq3uwZE=; b=T50vr+pnFTf5w6+8Oy3QPJU5pi/YHNkuAl4fSxqjuIFgZckknqRbDHrCDt/T/1BtiH Maa4RkE5Sc3a/if+A/4b9DNkEVIKgg9BdAMJdjRpqmNzUd5P4S1wte70XTkm2eyEe4KQ 2DmDzX1FQRwrXJY91Uj1qtRODVjVPrcwjkzCPKYDzsvqhJ3U4PTZ6rLZvT75ND6sO86o cL2KRrvfqY9o4khULqAxzbFY30fi3hbHFbIx3S5mWZ9jQCaryl4LrmbFakgizNijKgyL 2ufLyz/2MhSHItmuNy3K8SpnY7OInaxZ8JHTGXKnE0vl+G5eDaI06nxyO0WD/xyYU6Wp WAgg== X-Gm-Message-State: AFqh2koYvbXiTftmbJwpRNpjXHaiLDaMBl+9j4SOrYeGwldn/bFdegrt IH4gBdLl0e1PzPgk2izJJl67FGKNSFvGYNCBoELiNw== X-Received: by 2002:a0d:fc05:0:b0:3ea:454d:d1ef with SMTP id m5-20020a0dfc05000000b003ea454dd1efmr1108320ywf.409.1674078317049; Wed, 18 Jan 2023 13:45:17 -0800 (PST) MIME-Version: 1.0 References: <20230109205336.3665937-1-surenb@google.com> <20230109205336.3665937-13-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Wed, 18 Jan 2023 13:45:05 -0800 Message-ID: Subject: Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it To: Michal Hocko Cc: Jann Horn , peterz@infradead.org, Ingo Molnar , Will Deacon , akpm@linux-foundation.org, michel@lespinasse.org, jglisse@google.com, vbabka@suse.cz, hannes@cmpxchg.org, mgorman@techsingularity.net, dave@stgolabs.net, willy@infradead.org, liam.howlett@oracle.com, ldufour@linux.ibm.com, laurent.dufour@fr.ibm.com, paulmck@kernel.org, luto@kernel.org, songliubraving@fb.com, peterx@redhat.com, david@redhat.com, dhowells@redhat.com, hughd@google.com, bigeasy@linutronix.de, kent.overstreet@linux.dev, punit.agrawal@bytedance.com, lstoakes@gmail.com, peterjung1337@gmail.com, rientjes@google.com, axelrasmussen@google.com, joelaf@google.com, minchan@google.com, shakeelb@google.com, tatashin@google.com, edumazet@google.com, gthelen@google.com, gurua@google.com, arjunroy@google.com, soheil@google.com, hughlynch@google.com, leewalsh@google.com, posk@google.com, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linuxppc-dev@lists.ozlabs.org, x86@kernel.org, linux-kernel@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko wrote: > > On Wed 18-01-23 09:36:44, Suren Baghdasaryan wrote: > > On Wed, Jan 18, 2023 at 7:11 AM 'Michal Hocko' via kernel-team > > wrote: > > > > > > On Wed 18-01-23 14:23:32, Jann Horn wrote: > > > > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko wrote: > > > > > On Tue 17-01-23 19:02:55, Jann Horn wrote: > > > > > > +locking maintainers > > > > > > > > > > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan wrote: > > > > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling > > > > > > > instead of mmap_lock. Because there are cases when multiple VMAs need > > > > > > > to be exclusively locked during VMA tree modifications, instead of the > > > > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock > > > > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When > > > > > > > mmap_write_lock holder is done with all modifications and drops mmap_lock, > > > > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as > > > > > > > locked. > > > > > > [...] > > > > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma) > > > > > > > +{ > > > > > > > + up_read(&vma->lock); > > > > > > > +} > > > > > > > > > > > > One thing that might be gnarly here is that I think you might not be > > > > > > allowed to use up_read() to fully release ownership of an object - > > > > > > from what I remember, I think that up_read() (unlike something like > > > > > > spin_unlock()) can access the lock object after it's already been > > > > > > acquired by someone else. > > > > > > > > > > Yes, I think you are right. From a look into the code it seems that > > > > > the UAF is quite unlikely as there is a ton of work to be done between > > > > > vma_write_lock used to prepare vma for removal and actual removal. > > > > > That doesn't make it less of a problem though. > > > > > > > > > > > So if you want to protect against concurrent > > > > > > deletion, this might have to be something like: > > > > > > > > > > > > rcu_read_lock(); /* keeps vma alive */ > > > > > > up_read(&vma->lock); > > > > > > rcu_read_unlock(); > > > > > > > > > > > > But I'm not entirely sure about that, the locking folks might know better. > > > > > > > > > > I am not a locking expert but to me it looks like this should work > > > > > because the final cleanup would have to happen rcu_read_unlock. > > > > > > > > > > Thanks, I have completely missed this aspect of the locking when looking > > > > > into the code. > > > > > > > > > > Btw. looking at this again I have fully realized how hard it is actually > > > > > to see that vm_area_free is guaranteed to sync up with ongoing readers. > > > > > vma manipulation functions like __adjust_vma make my head spin. Would it > > > > > make more sense to have a rcu style synchronization point in > > > > > vm_area_free directly before call_rcu? This would add an overhead of > > > > > uncontended down_write of course. > > > > > > > > Something along those lines might be a good idea, but I think that > > > > rather than synchronizing the removal, it should maybe be something > > > > that splats (and bails out?) if it detects pending readers. If we get > > > > to vm_area_free() on a VMA that has pending readers, we might already > > > > be in a lot of trouble because the concurrent readers might have been > > > > traversing page tables while we were tearing them down or fun stuff > > > > like that. > > > > > > > > I think maybe Suren was already talking about something like that in > > > > another part of this patch series but I don't remember... > > > > > > This http://lkml.kernel.org/r/20230109205336.3665937-27-surenb@google.com? > > > > Yes, I spent a lot of time ensuring that __adjust_vma locks the right > > VMAs and that VMAs are freed or isolated under VMA write lock > > protection to exclude any readers. If the VM_BUG_ON_VMA in the patch > > Michal mentioned gets hit then it's a bug in my design and I'll have > > to fix it. But please, let's not add synchronize_rcu() in the > > vm_area_free(). > > Just to clarify. I didn't suggest to add synchronize_rcu into > vm_area_free. What I really meant was synchronize_rcu like primitive to > effectivelly synchronize with any potential pending read locker (so > something like vma_write_lock (or whatever it is called). The point is > that vma freeing is an event all readers should be notified about. I don't think readers need to be notified if we are ensuring that the VMA is not used by anyone else and is not reachable by the readers. This is currently done by write-locking the VMA either before removing it from the tree or before freeing it. > This can be done explicitly for each and every vma before vm_area_free > is called but this is just hard to review and easy to break over time. > See my point? I understand your point now and if we really need that, one way would be to have a VMA refcount (like Laurent had in his version of SPF implementation). I don't think current implementation needs that level of VMA lifetime control unless I missed some location that should take the lock and does not. > > -- > Michal Hocko > SUSE Labs