Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp2425562imu; Thu, 10 Jan 2019 14:07:34 -0800 (PST) X-Google-Smtp-Source: ALg8bN5l9M2cWjdfjcNbo3KhiAj5BnoUP6tE+fo0qss6cfLUJveGxomPmNdfEsVjY51wQKBWgs3b X-Received: by 2002:a17:902:7296:: with SMTP id d22mr12333238pll.265.1547158054505; Thu, 10 Jan 2019 14:07:34 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1547158054; cv=none; d=google.com; s=arc-20160816; b=oVG8YHCTFLwqRKk0hJWlfiq+10j/ezNpXYIWu9qfmJXylOHn0Yiv58bPKA6UPmr4OI hi0ph7R8vogzMazbk4gBrdWmQGW2tzeUAhkAXyxke1r4H6BCS+C0yZPMEaVRgOMp9H6Y 4R1NZKHFcs9w19lQDTbtDXmPVS0/tvNqMrq3uoKAy3Yxf2SU/9pJ3blYwFynBwB3nuQN z5ISRbqBUkNrDCdi9xNeNYHaaf+ntH4OWvqmiTaoajth5fvmiai/7bMqJaMZ37BlQlRf aFvnvIsgbb/ZTBtYlnH9wMnEn31BTa+dZj2iSg27HNcEjSBN93QYFi42PSJ23xtGwngc JC1Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=IoqVc5HCsMSaJIyXg3X4F8q8PL4vx3+5B8H6yFqeGtM=; b=zhMqzI2xdpO1ToZnBUhcEKx8/HY6s9xI3GJ2IdcHM5aP0zqKoEcwoRIdKnluaQLRmc fUHFO0FJGIqI/+W9Sd2a+MgMO2Q+AKPvnYN2FsWR+muJ3MkdsTdSPwwT13zVXtbb2HP9 QSLg7stDfBxwpKrECwO6xWYM0YDNrdUn0iZQJxOA9qozMRzFu/rGfzlcWEJHqxFJhArp grwufOMMAofV0lKt6t5TxlC781cIZOk3Lfcs3GoD/M+k94gNWrIwVFGHf5VyieHVBb48 1k2RBvo79cNYImi5x1JBkpwmbFkW7Fkyx70NXP6jHF4GnZwDWAoL225ZEeKEcBsy0xmP PqnA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d36si52623770pla.216.2019.01.10.14.07.18; Thu, 10 Jan 2019 14:07:34 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729143AbfAJVoj (ORCPT + 99 others); Thu, 10 Jan 2019 16:44:39 -0500 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:22142 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728744AbfAJVoj (ORCPT ); Thu, 10 Jan 2019 16:44:39 -0500 Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail07.adl2.internode.on.net with ESMTP; 11 Jan 2019 08:14:34 +1030 Received: from dave by dastard with local (Exim 4.80) (envelope-from ) id 1ghi7n-00035u-JB; Fri, 11 Jan 2019 08:44:27 +1100 Date: Fri, 11 Jan 2019 08:44:27 +1100 From: Dave Chinner To: Matthew Wilcox Cc: Andy Lutomirski , Linus Torvalds , Jiri Kosina , Jann Horn , Andrew Morton , Greg KH , Peter Zijlstra , Michal Hocko , Linux-MM , kernel list , Linux API Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged Message-ID: <20190110214427.GK27534@dastard> References: <20190108044336.GB27534@dastard> <20190109022430.GE27534@dastard> <20190109043906.GF27534@dastard> <20190110004424.GH27534@dastard> <20190110144711.GV6310@bombadil.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190110144711.GV6310@bombadil.infradead.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 10, 2019 at 06:47:11AM -0800, Matthew Wilcox wrote: > On Wed, Jan 09, 2019 at 09:26:41PM -0800, Andy Lutomirski wrote: > > Since direct IO has been brought up, I have a question. I've wondered > > for years why direct IO works the way it does. If I were implementing > > it from scratch, my first inclination would be to use the page cache > > instead of fighting it. To do a single-page direct read, I would look > > that page up in the page cache (i.e. i_pages these days). If the page > > is there, I would do a normal buffered read. If the page is not > > there, I would insert a record into i_pages indicating that direct IO > > is in progress and then I would do the IO into the destination page. > > If any other read, direct or otherwise, sees a record saying "under > > direct IO", it would wait. > > OK, you're in the same ballpark I am ;-) Kent Overstreet pointed out > that what you want to do here is great for the mixed case, but it's > pretty inefficient for IOs to files which are wholly uncached. > > So what I'm currently thinking about is an rwsem which works like this: > > O_DIRECT task: > if i_pages is empty, take rwsem for read, recheck i_pages is empty, do IO, > drop rwsem. GUP does page fault on user buffer which is a mmapped region of same file. page fault sets up for buffered IO, tries to take rwsem for write, deadlocks. Most of the schemes we come up with fall down at this point - you can't hold a lock over gup that is also used in the buffered IO path. That's why XFS (and now ext4) have the IOLOCK and MMAPLOCK for truncation serialisation - we can't lock out both read()/write() and mmap IO paths with the same lock... > if i_pages is not empty, insert XA_LOCK_ENTRY, when IO complete, wake waitqueue for that (mapping, index). I assume you really mean add a tag to the entry? But this means there is no record ofthe direct IO being in flight except for the rwsem being held across the IO. Even if we did insert a flag to say "DIO in progress" and not rely on the lock.... > buffered IO: > if i_pages is empty, take rwsem for write, allocate page, insert page, drop rwsem. > if i_pages is not empty, look up index, if entry is XA_LOCK_ENTRY sleep on > waitqueue. otherwise proceed as now. ... we'll sleep on that flags in the page fault and deadlock anyway. I'm pretty sure we explored this "record DIO state in the radix tree" 2 or 3 years ago and came to the conclusion that it didn't work for reasons like the above. i.e. it doesn't solve the problems we currently have with locking and serialisation between DIO and mmap... Cheers, Dave. -- Dave Chinner david@fromorbit.com