Received: by 2002:ac0:a581:0:0:0:0:0 with SMTP id m1-v6csp3522702imm; Mon, 2 Jul 2018 00:08:12 -0700 (PDT) X-Google-Smtp-Source: ADUXVKIfs5hIW+TOWWhcNn+goIvF0X4I9rsNWCsJtB8awNAraJanUCtdVw1vDU1qS/ADva6sWwfC X-Received: by 2002:a17:902:a989:: with SMTP id bh9-v6mr25199626plb.245.1530515292330; Mon, 02 Jul 2018 00:08:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1530515292; cv=none; d=google.com; s=arc-20160816; b=bLwyVJUnK1HzA+eqs+ZNB3YoNO1Qt4xcIZpzrkZkiMLd9Si5wd2IMMASsulh25+3JG Nd/n8gKIUeTtSy8vczb30iAULANmPSwPge4k4PGN5XDi+cxyjnK+f1+RDbLSwRO2Avqs viYROA7+7o4fP4A46FYBvGhSvQKukJ5QclDTWsosywIlPns8sV5PzSI9PsivTlrr0E/g 9LO0fxF4XIGdFEjuxG4dTIIKWIZmgwTbGOYMZz5slKXg7JP2axZlVzKj343PjBhq4Nfb 3MLys4Wu3lHF32enDxzkIHUBHNHrUZXjF1TV57sB+Vo1h0qSd9wscxj9isdVt/kQxmLh vsyQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=KKlUMUXnCSFLoK6mVNgo+4M6oBBxJqO/6habQL/H2WY=; b=tJbXHVlIWabotmIKTY8EiObmDyCOQpAmT2SW9+tyRZj+pc0eyJj6NW9CVZGgcDjyoE hPcQCj1jnWvnjNNc8+ghel6ASWCIKbvqxmnXL3L2+kgidAgoi1Hnm32XxHK6+CbHJzkK L74bPfyaj9kT6FmVE52L9VXyrowvyWijxQ9+4b0oAOBRp/00wsYgj5v/pJWoKUaK93V1 +s/u+umPARkUa7jYQemcN/dqt3KUqlMQ25AKR9sp0zekgNY1YFmhZ5n3P78QNU6ZuO1a jVboBFVy3dvy96HtgvlZxb9SB+EZ7H69x2bLB4q8LoByz6lFq/kMimHyvD+jILKfAVUU sW4g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 1-v6si9936629pfv.135.2018.07.02.00.07.57; Mon, 02 Jul 2018 00:08:12 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753599AbeGBG6O (ORCPT + 99 others); Mon, 2 Jul 2018 02:58:14 -0400 Received: from mx2.suse.de ([195.135.220.15]:59432 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751420AbeGBG6K (ORCPT ); Mon, 2 Jul 2018 02:58:10 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 78745AD36; Mon, 2 Jul 2018 06:58:08 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id C2E2F1E3C25; Mon, 2 Jul 2018 08:58:07 +0200 (CEST) Date: Mon, 2 Jul 2018 08:58:07 +0200 From: Jan Kara To: Leon Romanovsky Cc: Jan Kara , John Hubbard , Jason Gunthorpe , Michal Hocko , Dan Williams , Christoph Hellwig , John Hubbard , Matthew Wilcox , Christopher Lameter , Linux MM , LKML , linux-rdma Subject: Re: [PATCH 2/2] mm: set PG_dma_pinned on get_user_pages*() Message-ID: <20180702065807.zwnsdwww442gqcf6@quack2.suse.cz> References: <20180626164825.fz4m2lv6hydbdrds@quack2.suse.cz> <20180627113221.GO32348@dhcp22.suse.cz> <20180627115349.cu2k3ainqqdrrepz@quack2.suse.cz> <20180627115927.GQ32348@dhcp22.suse.cz> <20180627124255.np2a6rxy6rb6v7mm@quack2.suse.cz> <20180627145718.GB20171@ziepe.ca> <20180627170246.qfvucs72seqabaef@quack2.suse.cz> <1f6e79c5-5801-16d2-18a6-66bd0712b5b8@nvidia.com> <20180628091743.khhta7nafuwstd3m@quack2.suse.cz> <20180702055251.GV3014@mtr-leonro.mtl.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180702055251.GV3014@mtr-leonro.mtl.com> User-Agent: NeoMutt/20170912 (1.9.0) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 02-07-18 08:52:51, Leon Romanovsky wrote: > On Thu, Jun 28, 2018 at 11:17:43AM +0200, Jan Kara wrote: > > On Wed 27-06-18 19:42:01, John Hubbard wrote: > > > On 06/27/2018 10:02 AM, Jan Kara wrote: > > > > On Wed 27-06-18 08:57:18, Jason Gunthorpe wrote: > > > >> On Wed, Jun 27, 2018 at 02:42:55PM +0200, Jan Kara wrote: > > > >>> On Wed 27-06-18 13:59:27, Michal Hocko wrote: > > > >>>> On Wed 27-06-18 13:53:49, Jan Kara wrote: > > > >>>>> On Wed 27-06-18 13:32:21, Michal Hocko wrote: > > > >>>> [...] > > > >>>>>> Appart from that, do we really care about 32b here? Big DIO, IB users > > > >>>>>> seem to be 64b only AFAIU. > > > >>>>> > > > >>>>> IMO it is a bad habit to leave unpriviledged-user-triggerable oops in the > > > >>>>> kernel even for uncommon platforms... > > > >>>> > > > >>>> Absolutely agreed! I didn't mean to keep the blow up for 32b. I just > > > >>>> wanted to say that we can stay with a simple solution for 32b. I thought > > > >>>> the g-u-p-longterm has plugged the most obvious breakage already. But > > > >>>> maybe I just misunderstood. > > > >>> > > > >>> Most yes, but if you try hard enough, you can still trigger the oops e.g. > > > >>> with appropriately set up direct IO when racing with writeback / reclaim. > > > >> > > > >> gup longterm is only different from normal gup if you have DAX and few > > > >> people do, which really means it doesn't help at all.. AFAIK?? > > > > > > > > Right, what I wrote works only for DAX. For non-DAX situation g-u-p > > > > longterm does not currently help at all. Sorry for confusion. > > > > > > > > > > OK, I've got an early version of this up and running, reusing the page->lru > > > fields. I'll clean it up and do some heavier testing, and post as a PATCH v2. > > > > Cool. > > > > > One question though: I'm still vague on the best actions to take in the > > > following functions: > > > > > > page_mkclean_one > > > try_to_unmap_one > > > > > > At the moment, they are both just doing an evil little early-out: > > > > > > if (PageDmaPinned(page)) > > > return false; > > > > > > ...but we talked about maybe waiting for the condition to clear, instead? > > > Thoughts? > > > > What needs to happen in page_mkclean() depends on the caller. Most of the > > callers really need to be sure the page is write-protected once > > page_mkclean() returns. Those are: > > > > pagecache_isize_extended() > > fb_deferred_io_work() > > clear_page_dirty_for_io() if called for data-integrity writeback - which > > is currently known only in its caller (e.g. write_cache_pages()) where > > it can be determined as wbc->sync_mode == WB_SYNC_ALL. Getting this > > information into page_mkclean() will require some plumbing and > > clear_page_dirty_for_io() has some 50 callers but it's doable. > > > > clear_page_dirty_for_io() for cleaning writeback (wbc->sync_mode != > > WB_SYNC_ALL) can just skip pinned pages and we probably need to do that as > > otherwise memory cleaning would get stuck on pinned pages until RDMA > > drivers release its pins. > > Sorry for naive question, but won't it create too much dirty pages > so writeback will be called "non-stop" to rebalance watermarks without > ability to progress? If the amount of pinned pages is more than allowed dirty limit then yes. However dirty limit is there exactly to prevent too many difficult-to-get-rid-of pages in page cache. So if your amount of pinned pages crosses the dirty limit you have just violated this mm constraint and you either need to modify your workload not to pin so many pages or you need to verify so many dirty pages are indeed safe and increase the dirty limit. Realistically, I think we need to come up with a hard limit (similar to mlock or account them as mlock) on these pinned pages because they are even worse than plain dirty pages. However I'd strongly prefer to keep that discussion separate to this discussion about method how to avoid memory corruption / oopses because that is another can of worms with a big bikeshedding potential. Honza -- Jan Kara SUSE Labs, CR