Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp1427064imm; Tue, 10 Jul 2018 01:23:09 -0700 (PDT) X-Google-Smtp-Source: AAOMgpfD+0O5QXaCvL53cKgL5ij6Nn84wyHzIt5bESQoMW+G2jSShmM5fZfRPhBBRk6ZI/qDJ6DU X-Received: by 2002:a62:6303:: with SMTP id x3-v6mr24709540pfb.91.1531210989216; Tue, 10 Jul 2018 01:23:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1531210989; cv=none; d=google.com; s=arc-20160816; b=xDhsMFMjkoI5/cE0h9mHzGuspVyh62GoITFFLUyXwX23ktbh/h8oBjHp2bU2z1l5I7 W7fu0PAhic3X+dqU2RXly7eNRNloQTxx9oE/vgjCfwnV+j5ofGoyNXDee0H0PnmuVe5+ bXvttna+NKyRqo/hU5+mw9H1LIE1d78cPFrfRKE5bATQ9UqkYeICv8g+2RO0Z90MYJ10 3KqX30YvYAI9P6gOxZk+clSQMQJ8MWNC2H2zRBYrGkTd9YCFbrynk6+O4GEJ6pwXY8n+ FaaHJ1vG2+9SxaL35Pi6zFUtkPBkhXNr+uPxypD74Pl2W6NegIf6/t0CSFjkM0jD0do+ /98A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=TyyYCZHbtiLluqUCEFHCJAPoatrfTx4Leo35Ccq6OmU=; b=n0bnv5RMKclVFMjySoEtsdQNjfwLR8oIcXtEgPj5uMaYwVuBgXRAlxsM50CeEAkf6s 9Jxp9Fv2WXzt724kcRWqauJyIhGqBHQpkrVMG7sU3hKT1xOetrwpLFkylxvlicWjvgYx PkTLZKEz62coIcoCL7ohnED5SpMAN9Fu52qgzLTTYBUZu+Z+4z6qfU4TPti2zBaBRRkT pmIvQEfuAWt9bhd7PcwMdR5DsgjYXIRzkDk/5N8i52j9QLgvSuLiUanU+yZ3W/YzwoMO 6pvQUH0H/zMz0GeB7wx3KcUK6MFTZ2V+8ssGsXPDjSqLC93cwc/sFbwhhG0s2TfAGBCC 0nMw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d5-v6si16982149pla.337.2018.07.10.01.22.54; Tue, 10 Jul 2018 01:23:09 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751267AbeGJIVJ (ORCPT + 99 others); Tue, 10 Jul 2018 04:21:09 -0400 Received: from mx2.suse.de ([195.135.220.15]:44490 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751082AbeGJIVI (ORCPT ); Tue, 10 Jul 2018 04:21:08 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 6917CAEBC; Tue, 10 Jul 2018 08:21:06 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id A050A1E3BF2; Tue, 10 Jul 2018 10:21:00 +0200 (CEST) Date: Tue, 10 Jul 2018 10:21:00 +0200 From: Jan Kara To: Matthew Wilcox Cc: Jan Kara , Nicholas Piggin , john.hubbard@gmail.com, Michal Hocko , Christopher Lameter , Jason Gunthorpe , Dan Williams , Al Viro , linux-mm@kvack.org, LKML , linux-rdma , linux-fsdevel@vger.kernel.org, John Hubbard Subject: Re: [PATCH 0/2] mm/fs: put_user_page() proposal Message-ID: <20180710082100.mkdwngdv5kkrcz6n@quack2.suse.cz> References: <20180709080554.21931-1-jhubbard@nvidia.com> <20180709184937.7a70c3aa@roar.ozlabs.ibm.com> <20180709160806.xjt2l2pbmyiutbyi@quack2.suse.cz> <20180709171651.GE2662@bombadil.infradead.org> <20180709194740.rymbt2fzohbdmpye@quack2.suse.cz> <20180709200049.GA5335@bombadil.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180709200049.GA5335@bombadil.infradead.org> User-Agent: NeoMutt/20170912 (1.9.0) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 09-07-18 13:00:49, Matthew Wilcox wrote: > On Mon, Jul 09, 2018 at 09:47:40PM +0200, Jan Kara wrote: > > On Mon 09-07-18 10:16:51, Matthew Wilcox wrote: > > > > 2) What to do when some page is pinned but we need to do e.g. > > > > clear_page_dirty_for_io(). After some more thinking I agree with you that > > > > just blocking waiting for page to unpin will create deadlocks like: > > > > > > Why are we trying to writeback a page that is pinned? It's presumed to > > > be continuously redirtied by its pinner. We can't evict it. > > > > So what should be a result of fsync(file), where some 'file' pages are > > pinned e.g. by running direct IO? If we just skip those pages, we'll lie to > > userspace that data was committed while it was not (and it's not only about > > data that has landed in those pages via DMA, you can have first 1k of a page > > modified by normal IO in parallel to DMA modifying second 1k chunk). If > > fsync(2) returns error, it would be really unexpected by userspace and most > > apps will just not handle that correctly. So what else can you do than > > block? > > I was thinking about writeback, and neglected the fsync case. For memory cleaning writeback skipping is certainly the right thing to do and that's what we plan to do. > For fsync, we could copy the "current" contents of the page to a > freshly-allocated page and write _that_ to disc? As long as we redirty > the real page after the pin is dropped, I think we're fine. So for record, this technique is called "bouncing" in block layer terminology and we do have a support for it there (see block/bounce.c). It would need some tweaking (e.g. a bio flag to indicate that some page in a bio needs bouncing if underlying storage requires stable pages) but that is easy to do - we even had support for something similar some years back as ext3 needed it to provide guarantee metadata buffer cannot be modified while IO is running on it. I was actually already considering using this some time ago but then disregarded it as it seemed it won't buy us much compared to blocking / skipping. But now seeing the troubles with blocking, using page bouncing for situations where we cannot just skip page writeout looks indeed appealing. Thanks for suggesting that! As a side note I'm not 100% decided whether it is better to keep the original page dirty all the time while it is pinned or not. I'm more inclined to keeping it dirty all the time as it gives mm more accurate information about the amount of really dirty pages, prevents reclaim of filesystem's dirtiness / allocation tracking information (buffers or whatever it has attached to the page), and generally avoids "surprising" set_page_dirty() once page is unpinned (one less dirtying path for filesystems to care about). OTOH it would make flusher threads always try to writeback these pages only to skip them, fsync(2) would always write them, etc... Honza -- Jan Kara SUSE Labs, CR