Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp4898217imm; Tue, 19 Jun 2018 01:30:54 -0700 (PDT) X-Google-Smtp-Source: ADUXVKLIDLWxzxR+pQfdR8McAkgX/y3IzVckbdz2Z8hLz5s094Q0oSd6of3D4L77RwNqo2RYPjHW X-Received: by 2002:a62:da07:: with SMTP id c7-v6mr17049914pfh.106.1529397054372; Tue, 19 Jun 2018 01:30:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1529397054; cv=none; d=google.com; s=arc-20160816; b=nD5VcIOcVz0gaBRJghyRwFtOaTKObUYMR34Af3kyez9dnJDhAwU0fz0ZedFQ4+v9Qs Nh8vKaLVwQTCOJCR+7vAgb2qsgalVkOIe7Hc0EL2OJhT+xJYpE5iidEVi0t65PYzG3LB iX/2BOtOiGhnOMIaFBdpRLwB/w8lJBhl/f1jHn7lCf4NK9DmoKgYhQdYSwL4cV42Q/31 LBeuWZiXD94KiaE8yCmQNFzuaHWah8f7g9xgGvhsSzf9R56GZfhnoHJD5Z6jpebZ7oIU n4rIw4R4bgFVWsNstvgruT8Q5DKZWKE9x7XujZ4VgSvEczS0JqW8ghnUumJLgrfbxqmY zHHg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=ACRCAfjmGAr0C3dNVm33RQzvs7/66IZCxoDQjq6gAiE=; b=kAx9mKFDJu9hcKNUPtyFiHX8Cu1y7FdeSC4GmXzyCa7qTfWtqNCo39CMvOvt1h9YZj 3IX8KzpiHOuyB5jIrjyq9Wy1Q6nLU4hWefOKCH78fOvNav1kBaeAvLUad20dhu9ckLz/ swvfhZFgAALqkqO8G0BC2F8Wv8ad1vF3B8H528ppeMmgBFTxsYF+i5mSlbwLgcDrV1DR 7i/o9tN3C5ckUJzmcz57D5vmW8RMO+wwfHgZtmQIU+3xxwSvsytv/yGHzqINgW/kkzHF HZYn0Af2hJsBrMZTqqpq0EijMsgU0T3qXWMKU+FmK48gmFjpRIS58GVSzpkvEeEbsLxQ n4CQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c17-v6si13793801pgf.352.2018.06.19.01.30.40; Tue, 19 Jun 2018 01:30:54 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965530AbeFSI3y (ORCPT + 99 others); Tue, 19 Jun 2018 04:29:54 -0400 Received: from mx2.suse.de ([195.135.220.15]:37026 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965409AbeFSI3w (ORCPT ); Tue, 19 Jun 2018 04:29:52 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (charybdis-ext-too.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 98F91AED2; Tue, 19 Jun 2018 08:29:50 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id AC6871E0AAF; Tue, 19 Jun 2018 10:29:49 +0200 (CEST) Date: Tue, 19 Jun 2018 10:29:49 +0200 From: Jan Kara To: John Hubbard Cc: Dan Williams , Christoph Hellwig , Jason Gunthorpe , John Hubbard , Matthew Wilcox , Michal Hocko , Christopher Lameter , Jan Kara , Linux MM , LKML , linux-rdma Subject: Re: [PATCH 2/2] mm: set PG_dma_pinned on get_user_pages*() Message-ID: <20180619082949.wzoe42wpxsahuitu@quack2.suse.cz> References: <20180617200432.krw36wrcwidb25cj@ziepe.ca> <311eba48-60f1-b6cc-d001-5cc3ed4d76a9@nvidia.com> <20180618081258.GB16991@lst.de> <3898ef6b-2fa0-e852-a9ac-d904b47320d5@nvidia.com> <0e6053b3-b78c-c8be-4fab-e8555810c732@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0e6053b3-b78c-c8be-4fab-e8555810c732@nvidia.com> User-Agent: NeoMutt/20170912 (1.9.0) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 18-06-18 14:36:44, John Hubbard wrote: > On 06/18/2018 12:21 PM, Dan Williams wrote: > > On Mon, Jun 18, 2018 at 11:14 AM, John Hubbard wrote: > >> On 06/18/2018 10:56 AM, Dan Williams wrote: > >>> On Mon, Jun 18, 2018 at 10:50 AM, John Hubbard wrote: > >>>> On 06/18/2018 01:12 AM, Christoph Hellwig wrote: > >>>>> On Sun, Jun 17, 2018 at 01:28:18PM -0700, John Hubbard wrote: > >>>>>> Yes. However, my thinking was: get_user_pages() can become a way to indicate that > >>>>>> these pages are going to be treated specially. In particular, the caller > >>>>>> does not really want or need to support certain file operations, while the > >>>>>> page is flagged this way. > >>>>>> > >>>>>> If necessary, we could add a new API call. > >>>>> > >>>>> That API call is called get_user_pages_longterm. > >>>> > >>>> OK...I had the impression that this was just semi-temporary API for dax, but > >>>> given that it's an exported symbol, I guess it really is here to stay. > >>> > >>> The plan is to go back and provide api changes that bypass > >>> get_user_page_longterm() for RDMA. However, for VFIO and others, it's > >>> not clear what we could do. In the VFIO case the guest would need to > >>> be prepared handle the revocation. > >> > >> OK, let's see if I understand that plan correctly: > >> > >> 1. Change RDMA users (this could be done entirely in the various device drivers' > >> code, unless I'm overlooking something) to use mmu notifiers, and to do their > >> DMA to/from non-pinned pages. > > > > The problem with this approach is surprising the RDMA drivers with > > notifications of teardowns. It's the RDMA userspace applications that > > need the notification, and it likely needs to be explicit opt-in, at > > least for the non-ODP drivers. > > > >> 2. Return early from get_user_pages_longterm, if the memory is...marked for > >> RDMA? (How? Same sort of page flag that I'm floating here, or something else?) > >> That would avoid the problem with pinned pages getting their buffer heads > >> removed--by disallowing the pinning. Makes sense. > > > > Well, right now the RDMA workaround is DAX specific and it seems we > > need to generalize it for the page-cache case. One thought is to have > > try_to_unmap() take it's own reference and wait for the page reference > > count to drop to one so that the truncate path knows the page is > > dma-idle and disconnected from the page cache, but I have not looked > > at the details. > > > >> Also, is there anything I can help with here, so that things can happen sooner? > > > > I do think we should explore a page flag for pages that are "long > > term" pinned. Michal asked for something along these lines at LSF / MM > > so that the core-mm can give up on pages that the kernel has lost > > lifetime control. Michal, did I capture your ask correctly? > > > OK, that "refcount == 1" approach sounds promising: > > -- still use a page flag, but narrow the scope to get_user_pages_longterm() pages > -- just wait in try_to_unmap, instead of giving up But this would fix only the RDMA use case, isn't it? Direct IO (and other get_user_pages_fast() users) would be still problematic. And for record, the problem with page cache pages is not only that try_to_unmap() may unmap them. It is also that page_mkclean() can write-protect them. And once PTEs are write-protected filesystems may end up doing bad things if DMA then modifies the page contents (DIF/DIX failures, data corruption, oopses). As such I don't think that solutions based on page reference count have a big chance of dealing with the problem. And your page flag approach would also need to take page_mkclean() into account. And there the issue is that until the flag is cleared (i.e., we are sure there are no writers using references from GUP) you cannot writeback the page safely which does not work well with your idea of clearing the flag only once the page is evicted from page cache (hint, page cache page cannot get evicted until it is written back). So as sad as it is, I don't see an easy solution here. Honza -- Jan Kara SUSE Labs, CR