Received: by 2002:ac0:bc90:0:0:0:0:0 with SMTP id a16csp184620img; Tue, 19 Mar 2019 21:34:49 -0700 (PDT) X-Google-Smtp-Source: APXvYqx8Vcr9rZdeq9Luu8jQRdc2+NX50XxODdfbn9RU5eIsvZivwitzV/DeOvvhRZCR2TAm3SM2 X-Received: by 2002:a63:ed0a:: with SMTP id d10mr5384488pgi.452.1553056489729; Tue, 19 Mar 2019 21:34:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1553056489; cv=none; d=google.com; s=arc-20160816; b=nA+0e3+pSGXeTfneEFcTTGzmUT0zVNuT4Io9Swi7O/dmvYmFVurWJQju0NQZNrPLY0 3Pq4e7RSFjZWntZa1oxrM6EbahByH6Ccm7NqUfZYYIB2fHulEF03ZvEEqSlnlXO8v/8m K/md3GhGffRQKRAlLoDu6cXcE98PfS77oYkjEzeq6sszylpU7RIe4gYzRk/9tgb3mCNt rb8RNKcoWsKRePVqwTOQ9ffMXqCj8kRvhTtQffpS9ECXGZ2JrE+5Ne2p3rqdCjr+w+6v hI0bD0L+6s7/rUCqE9RgHFaBlMyOdVqesvli0SQgIS+tKg3+qkrj720xVRtFrFvkYh1b HTVw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=1JIHwc8CE0LMo3o6GEKfFP4MbkQC2X9rETKpJRbgjYU=; b=hIg6fiSeGgM1Z2Z9tyUn8qynSNBLBE/zMj3xdjV89X+zMIEs5iRHtC9XDCbzXkWIWm N8Ir4a7kewj5l9i3qk7hTxNSf0kjAQsOF9h9wtv+RXdi1CwVw2ezPqZWxaXU3jZ2WfRh oKucedVE6tX3Tb22uxs5OZNAsb1GxlxQhFW4mPdIlj55VHVsgmcla7aYz8TAzN6vX2Mg vDQXi1U7yfpXvR4vSfDE1ISbIuhS9NTt8WFIUBBeQkz0deKdA+ebUrXIFT3CsCxQCyby m0D7UF/vsId+yMwMPHiK40C7a0HQ5ukMNN7xD00NNjzoHV5DltLnEJGcCKseMg7qhsOc EN7A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h127si752452pfb.213.2019.03.19.21.34.23; Tue, 19 Mar 2019 21:34:49 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726152AbfCTEd1 (ORCPT + 99 others); Wed, 20 Mar 2019 00:33:27 -0400 Received: from mx1.redhat.com ([209.132.183.28]:47700 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725980AbfCTEd1 (ORCPT ); Wed, 20 Mar 2019 00:33:27 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id B620F81E09; Wed, 20 Mar 2019 04:33:26 +0000 (UTC) Received: from redhat.com (ovpn-120-246.rdu2.redhat.com [10.10.120.246]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 1E74018EC9; Wed, 20 Mar 2019 04:33:22 +0000 (UTC) Date: Wed, 20 Mar 2019 00:33:20 -0400 From: Jerome Glisse To: John Hubbard Cc: Dave Chinner , "Kirill A. Shutemov" , john.hubbard@gmail.com, Andrew Morton , linux-mm@kvack.org, Al Viro , Christian Benvenuti , Christoph Hellwig , Christopher Lameter , Dan Williams , Dennis Dalessandro , Doug Ledford , Ira Weiny , Jan Kara , Jason Gunthorpe , Matthew Wilcox , Michal Hocko , Mike Rapoport , Mike Marciniszyn , Ralph Campbell , Tom Talpey , LKML , linux-fsdevel@vger.kernel.org, Andrea Arcangeli Subject: Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions Message-ID: <20190320043319.GA7431@redhat.com> References: <20190308213633.28978-1-jhubbard@nvidia.com> <20190308213633.28978-2-jhubbard@nvidia.com> <20190319120417.yzormwjhaeuu7jpp@kshutemo-mobl1> <20190319134724.GB3437@redhat.com> <20190319141416.GA3879@redhat.com> <20190319212346.GA26298@dastard> <20190319220654.GC3096@redhat.com> <20190319235752.GB26298@dastard> <20190320000838.GA6364@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Wed, 20 Mar 2019 04:33:27 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 19, 2019 at 06:43:45PM -0700, John Hubbard wrote: > On 3/19/19 5:08 PM, Jerome Glisse wrote: > > On Wed, Mar 20, 2019 at 10:57:52AM +1100, Dave Chinner wrote: > >> On Tue, Mar 19, 2019 at 06:06:55PM -0400, Jerome Glisse wrote: > >>> On Wed, Mar 20, 2019 at 08:23:46AM +1100, Dave Chinner wrote: > >>>> On Tue, Mar 19, 2019 at 10:14:16AM -0400, Jerome Glisse wrote: > >>>>> On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote: > >>>>>> On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote: > >>>>>>> On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote: > >>>>>>>> From: John Hubbard > >>>>>> [...] > >>>>> Forgot to mention one thing, we had a discussion with Andrea and Jan > >>>>> about set_page_dirty() and Andrea had the good idea of maybe doing > >>>>> the set_page_dirty() at GUP time (when GUP with write) not when the > >>>>> GUP user calls put_page(). We can do that by setting the dirty bit > >>>>> in the pte for instance. They are few bonus of doing things that way: > >>>>> - amortize the cost of calling set_page_dirty() (ie one call for > >>>>> GUP and page_mkclean() > >>>>> - it is always safe to do so at GUP time (ie the pte has write > >>>>> permission and thus the page is in correct state) > >>>>> - safe from truncate race > >>>>> - no need to ever lock the page > >>>> > >>>> I seem to have missed this conversation, so please excuse me for > >>> > >>> The set_page_dirty() at GUP was in a private discussion (it started > >>> on another topic and drifted away to set_page_dirty()). > >>> > >>>> asking a stupid question: if it's a file backed page, what prevents > >>>> background writeback from cleaning the dirty page ~30s into a long > >>>> term pin? i.e. I don't see anything in this proposal that prevents > >>>> the page from being cleaned by writeback and putting us straight > >>>> back into the situation where a long term RDMA is writing to a clean > >>>> page.... > >>> > >>> So this patchset does not solve this issue. > >> > >> OK, so it just kicks the can further down the road. > >> > >>> [3..N] decide what to do for GUPed page, so far the plans seems > >>> to be to keep the page always dirty and never allow page > >>> write back to restore the page in a clean state. This does > >>> disable thing like COW and other fs feature but at least > >>> it seems to be the best thing we can do. > >> > >> So the plan for GUP vs writeback so far is "break fsync()"? :) > >> > >> We might need to work on that a bit more... > > > > Sorry forgot to say that we still do write back using a bounce page > > so that at least we write something to disk that is just a snapshot > > of the GUPed page everytime writeback kicks in (so either through > > radix tree dirty page write back or fsync or any other sync events). > > So many little details that i forgot the big chunk :) > > > > Cheers, > > J?r?me > > > > Dave, Jan, Jerome, > > Bounce pages for periodic data integrity still seem viable. But for the > question of things like fsync or truncate, I think we were zeroing in > on file leases as a nice building block. > > Can we revive the file lease discussion? By going all the way out to user > space and requiring file leases to be coordinated at a high level in the > software call chain, it seems like we could routinely avoid some of the > worst conflicts that the kernel code has to resolve. > > For example: > > Process A > ========= > gets a lease on file_a that allows gup > usage on a range within file_a > > sets up writable DMA: > get_user_pages() on the file_a range > start DMA (independent hardware ops) > hw is reading and writing to range > > Process B > ========= > truncate(file_a) > ... > __break_lease() > > handle SIGIO from __break_lease > if unhandled, process gets killed > and put_user_pages should get called > at some point here > > ...and so this way, user space gets to decide the proper behavior, > instead of leaving the kernel in the dark with an impossible decision > (kill process A? Block process B? User space knows the preference, > per app, but kernel does not.) There is no need to kill anything here ... if truncate happens then the GUP user is just GUPing page that do not correspond to anything anymore. This is the current behavior and it is what GUP always has been. By the time you get the page from GUP there is no garantee that they correspond to anything. If a device really want to mirror process address faithfully then the hardware need to make little effort either have something like ATS/ PASID or be able to abide mmu notifier. If we start blocking existing syscall just because someone is doing a GUP we are opening a pandora box. It is not just truncate, it is a whole range of syscall that deals with either file or virtual address. The semantic of GUP is really the semantic of direct I/O and the virtual address you are direct I/O-ing to/from and the rule there is: do not do anything stupid to those virtual addresses while you are doing direct I/O with them (no munmap, mremap, madvise, truncate, ...). Same logic apply to file, when two process do thing to same file there the kernel never get in the way of one process doing something the other process did not expect. For instance one process mmaping the file the other process truncating the file, if the first process try to access the file through the mmap after the truncation it will get a sigbus. So i believe best we could do is send a SIGBUS to the process that has GUPed a range of a file that is being truncated this would match what we do for CPU acces. There is no reason access through GUP should be handled any differently. Cheers, J?r?me