Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp11675pxb; Fri, 15 Jan 2021 06:35:13 -0800 (PST) X-Google-Smtp-Source: ABdhPJwjxhanCUAhZFU63raKrE7n/Pq6oB2DIOVQavLi4B6Q6yCbCGk22NOzxzhnYXzQN0hvPHk+ X-Received: by 2002:a50:9dc9:: with SMTP id l9mr9829667edk.377.1610721313239; Fri, 15 Jan 2021 06:35:13 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1610721313; cv=none; d=google.com; s=arc-20160816; b=u3JXAGXT6aUVqaMQtpCBjp2SlQ00TpgieDbViy0Fh2Wp3veL0eThr0Mlan/f5hLHXT wB/+sCKxBmEz8ZqGZIFzK72/6WeRmnOAyLFXbYUf+sJ5lv/oXwiZvdaiSDZrNnJkdtcJ r6bCa2bClqOlRcImVKmvaFkd+uoCAUlbIl34V8352H+78yrPJFOX5JY+yicyTKZHMXe9 p+p4wPapZgDj9F8lqdeWoUtjI00eY5i9VkuZamxZYQfRJRyrn4TO4sEOFFziH8+tQ0pQ wiA3FZTsgRfRcPzwTy+YFYhUl+TGpqDVwPAfDYMSLumBLbJP3B/i80gKUWefc7zpLo7Z 1STw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=6LwzIQLyGFYV7Orjk1SmLQ2zVfBLarOFITi6nuRpE2o=; b=E08dVMaOwwQGaF/niKHuucjENBhfIcTyu9rMgRCKZWHc86dWEQawFTBWRv8edO37u6 jvSRhFRyJaLGwbgXO/RTkaXvDK81ecPMPvyRdeiRquQFvrNvqXLvfQr3h82Bb77RgMUo 0BLVtifGaT7VIHzcRMyXBZpv0XzDLv/AEOSl0X8+r1r/0rEa9mbYTev0h+IUqCq2Vyzr rC0v287geJB7QCdDuVs+HR+UEDJUvudDABNrr6mBgX0Mpa1AufHEEPO8V2h0JcM131yv 4mF3XL3ZNLgtlyK41cUxLvM0jR6rhMsW9Kg58xw/RaAeYyr5jJt9Y+38feb2ZwhU/CJS iCwg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id z19si2340501ejl.162.2021.01.15.06.34.25; Fri, 15 Jan 2021 06:35:13 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732480AbhAOObl (ORCPT + 99 others); Fri, 15 Jan 2021 09:31:41 -0500 Received: from mx2.suse.de ([195.135.220.15]:59232 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731685AbhAOObk (ORCPT ); Fri, 15 Jan 2021 09:31:40 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 800CBABDA; Fri, 15 Jan 2021 14:30:58 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 0A3701E0800; Fri, 15 Jan 2021 15:30:58 +0100 (CET) Date: Fri, 15 Jan 2021 15:30:58 +0100 From: Jan Kara To: Linus Torvalds Cc: Matthew Wilcox , Jason Gunthorpe , Andrea Arcangeli , Linux-MM , Linux Kernel Mailing List , Yu Zhao , Andy Lutomirski , Peter Xu , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , Minchan Kim , Will Deacon , Peter Zijlstra , Hugh Dickins , "Kirill A. Shutemov" , Oleg Nesterov , Jann Horn , Kees Cook , John Hubbard , Leon Romanovsky , Jan Kara , Kirill Tkhai Subject: Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy Message-ID: <20210115143058.GG27380@quack2.suse.cz> References: <20210107200402.31095-1-aarcange@redhat.com> <20210107202525.GD504133@ziepe.ca> <20210109193224.GB35215@casper.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat 09-01-21 11:46:46, Linus Torvalds wrote: > On Sat, Jan 9, 2021 at 11:33 AM Matthew Wilcox wrote: > > > > On Thu, Jan 07, 2021 at 01:05:19PM -0800, Linus Torvalds wrote: > > > Side note, and not really related to UFFD, but the mmap_sem in > > > general: I was at one point actually hoping that we could make the > > > mmap_sem a spinlock, or at least make the rule be that we never do any > > > IO under it. At which point a write lock hopefully really shouldn't be > > > such a huge deal. > > > > There's a (small) group of us working towards that. It has some > > prerequisites, but where we're hoping to go currently: > > > > - Replace the vma rbtree with a b-tree protected with a spinlock > > - Page faults walk the b-tree under RCU, like peterz/laurent's SPF patchset > > - If we need to do I/O, take a refcount on the VMA > > > > After that, we can gradually move things out from mmap_sem protection > > to just the vma tree spinlock, or whatever makes sense for them. In a > > very real way the mmap_sem is the MM layer's BKL. > > Well, we could do the "no IO" part first, and keep the semaphore part. > > Some people actually prefer a semaphore to a spinlock, because it > doesn't end up causing preemption issues. > > As long as you don't do IO (or memory allocations) under a semaphore > (ok, in this case it's a rwsem, same difference), it might even be > preferable to keep it as a semaphore rather than as a spinlock. > > So it doesn't necessarily have to go all the way - we _could_ just try > something like "when taking the mmap_sem, set a thread flag" and then > have a "warn if doing allocations or IO under that flag". > > And since this is about performance, not some hard requirement, it > might not even matter if we catch all cases. If we fix it so that any > regular load on most normal filesystems never see the warning, we'd > already be golden. Honestly, I'd *love* if a filesystem can be guaranteed that ->fault and ->mkwrite callbacks do not happen under mmap_sem (or if at least fs would be free to drop mmap_sem if it finds the page is not already cached / prepared for writing). Because for filesystems the locking of page fault is really painful as the lock ordering wrt mmap_sem is exactly oposite compared to read / write path (read & write path must be designed so that mmap_sem can be taken inside it to copy user data, fault path may be all happening under mmap_sem). As a result this has been a long term source of deadlocks, stale data exposure issues, and filesystem corruption issues due to insufficient locking for multiple filesystems. But when I was looking at what it would take to achieve this several years ago, fixing all GUP users to deal with mmap_sem being dropped during a fault was a gigantic task because there were users of GUP relying on mmap_sem being held for large code sections around the GUP call... Honza -- Jan Kara SUSE Labs, CR