Received: by 2002:a25:c205:0:0:0:0:0 with SMTP id s5csp120532ybf; Wed, 26 Feb 2020 09:54:56 -0800 (PST) X-Google-Smtp-Source: APXvYqxmOPNVhih6RyLkMirb5EIRNDvc89XnlO9pMv64rqOZoX6FzUs8oti8KqGtPjCfe02NOlDk X-Received: by 2002:aca:56d1:: with SMTP id k200mr132713oib.25.1582739696063; Wed, 26 Feb 2020 09:54:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1582739696; cv=none; d=google.com; s=arc-20160816; b=r9r/rSYMl2Bj1wwgUSILsYPUxMJKQVpQ64YOCOk33wViGcJoNi0CZyq1E0ugDxJUav 6SnEasOQOI22C9JIFxZHL+c36/3On2777F4D/C1ltq3vIceVH4ASe2GrldgHKWWPpNFn IaUHmBiT7jgLnWH6WKO4T9h3JzN+vjhzT1lhqVHbr4XzWm4WIrOzbZvKJSibwkCX2Cuc 24Q6weEJbUB+DD2u7LhGyaYXRgWkiFwp5OTf6H9uCvZXb3x2kzOpeDPnnLBUbCf2ktOl hgNK5NQZoN2C3HvkLUxexOnCa2DRUvy9Gcf/7uPROdKbl0uzFgR6Erq+/FJTEcvirgn+ mQSw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=19soKHU3UI5r5wvmfAJVl4z0JskbjWMjhN/nbHz3sBk=; b=xGTBxly1o7baJLSkvBrs0HnzgN+eAdGjj1xEzZ8nXywENjDOcES6nj+W4tWaPRHm6b TjBHEmVmJC7Te++JkW4GEUgJ/q1XTazv81aaVtGZkf+leho81zpmHuhCN+TGhHk3QFFx f//8o80aVaUWilRhi+MvnC5JFbGds4/ntyaGCSh4SPJ71/K+6dwAa2ZF4yS7RecpDIA2 9Id4lUfnHt9IGfasbEol2sVewbHpJ+Z0JegFAJc3qeUw2n2QE47IH97SThy0zeSy2zAq cTOHG2CGG9VwsQyFLs7DMbyvtTzJiD69rf9NWoKSy97WTH51JFUOIJmknqkl8M+iQyI/ nycw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=ayzTVtVb; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s188si1411439oia.277.2020.02.26.09.54.43; Wed, 26 Feb 2020 09:54:56 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=ayzTVtVb; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727068AbgBZRyY (ORCPT + 99 others); Wed, 26 Feb 2020 12:54:24 -0500 Received: from mail-oi1-f195.google.com ([209.85.167.195]:42866 "EHLO mail-oi1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726787AbgBZRyX (ORCPT ); Wed, 26 Feb 2020 12:54:23 -0500 Received: by mail-oi1-f195.google.com with SMTP id j132so358560oih.9 for ; Wed, 26 Feb 2020 09:54:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=19soKHU3UI5r5wvmfAJVl4z0JskbjWMjhN/nbHz3sBk=; b=ayzTVtVbCk7aoeymtSP3HR7X6PRnOqPojYSBqwLAfxVUuLITtIzhlDJlWMEm6A1+8T C8YIeBWHOV3NPwqg64wQHrWhfj9kivCWkysAH1BSAOYvLFOOAi4+yH9QJP57/+5shlfD s5cjVcyQxyo0HaM1s6nfsR8FwpF+8S3J2DvT6kSkUlGvXSM7QY5vN+6Qbv7oRV9HPJKs igu6VIUXlgXR/XdKg9vl42HW0GEeFBI2npotOal7aLWB+rxrop4KsYtmfyNVvePc3Nfv qVucCvVKwh0rb8yG4kZZuGcG6D+WRKCBt1Mq7RD7NLEqaY3HW1GRhBWlkhjBAbXqHMmj YIAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=19soKHU3UI5r5wvmfAJVl4z0JskbjWMjhN/nbHz3sBk=; b=MH9rGOtbJP0HjOjpISiRBK2g4thRXajpR4Wh5h3rFvcNsob73wS0XN6nlIupXEA9FQ yAp6VqN3lQ8wkh3VKrBxjbUs4YcW/MfEGIMAXXejO2sdacw/8gdErkmCbObJox/BbYdP pjT0B7brrHorZzJd4HydWnB8XFw3viqzo56WBmM6cdDy1VIosB+drlFsGkInDbJ32I9R yK0obGL7DQgj3KBPi6sYsdjoudQ6LWgxPHOkgRckSpzKYT/4vY0OqkPo9p+rZ0t66BZY EouYlWKyqd2OywQ6k1SI5ecnihw+vBfYt0RQC1Qdua9pU6zs5mgiSttBl1QAte+X/AUc X0/g== X-Gm-Message-State: APjAAAX6lzNikp+DkdQhvH23+HUzNMsmEPV+36xhjIz04I18B60bw/ON 5t1TJ4/zMeaKpYCAiRrJlH3vERsXqFyjAhBrGVbA+A== X-Received: by 2002:aca:4c9:: with SMTP id 192mr123494oie.105.1582739662914; Wed, 26 Feb 2020 09:54:22 -0800 (PST) MIME-Version: 1.0 References: <20200221004134.30599-1-ira.weiny@intel.com> <20200221004134.30599-8-ira.weiny@intel.com> <20200221174449.GB11378@lst.de> <20200221224419.GW10776@dread.disaster.area> <20200224175603.GE7771@lst.de> <20200225000937.GA10776@dread.disaster.area> <20200225173633.GA30843@lst.de> <20200226172034.GV10728@quack2.suse.cz> In-Reply-To: <20200226172034.GV10728@quack2.suse.cz> From: Dan Williams Date: Wed, 26 Feb 2020 09:54:12 -0800 Message-ID: Subject: Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state To: Jan Kara Cc: Jonathan Halliday , Jeff Moyer , Christoph Hellwig , Dave Chinner , "Weiny, Ira" , Linux Kernel Mailing List , Alexander Viro , "Darrick J. Wong" , "Theodore Y. Ts'o" , linux-ext4 , linux-xfs , linux-fsdevel Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 26, 2020 at 9:20 AM Jan Kara wrote: > > On Wed 26-02-20 08:46:42, Dan Williams wrote: > > On Wed, Feb 26, 2020 at 1:29 AM Jonathan Halliday > > wrote: > > > > > > > > > Hi All > > > > > > I'm a middleware developer, focused on how Java (JVM) workloads can > > > benefit from app-direct mode pmem. Initially the target is apps that > > > need a fast binary log for fault tolerance: the classic database WAL use > > > case; transaction coordination systems; enterprise message bus > > > persistence and suchlike. Critically, there are cases where we use log > > > based storage, i.e. it's not the strict 'read rarely, only on recovery' > > > model that a classic db may have, but more of a 'append only, read many > > > times' event stream model. > > > > > > Think of the log oriented data storage as having logical segments (let's > > > implement them as files), of which the most recent is being appended to > > > (read_write) and the remaining N-1 older segments are full and sealed, > > > so effectively immutable (read_only) until discarded. The tail segment > > > needs to be in DAX mode for optimal write performance, as the size of > > > the append may be sub-block and we don't want the overhead of the kernel > > > call anyhow. So that's clearly a good fit for putting on a DAX fs mount > > > and using mmap with MAP_SYNC. > > > > > > However, we want fast read access into the segments, to retrieve stored > > > records. The small access index can be built in volatile RAM (assuming > > > we're willing to take the startup overhead of a full file scan at > > > recovery time) but the data itself is big and we don't want to move it > > > all off pmem. Which means the requirements are now different: we want > > > the O/S cache to pull hot data into fast volatile RAM for us, which DAX > > > explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather > > > than app-direct mode, except here we're using the O/S rather than the > > > hardware memory controller to do the cache management for us. > > > > > > Currently this requires closing the full (read_write) file, then copying > > > it to a non-DAX device and reopening it (read_only) there. Clearly > > > that's expensive and rather tedious. Instead, I'd like to close the > > > MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode > > > that will instead go via the O/S cache in the traditional manner. Bonus > > > points if I can do it over non-overlapping ranges in a file without > > > closing the DAX mode mmap, since then the segments are entirely logical > > > instead of needing separate physical files. > > > > Hi John, > > > > IIRC we chatted about this at PIRL, right? > > > > At the time it sounded more like mixed mode dax, i.e. dax writes, but > > cached reads. To me that's an optimization to optionally use dax for > > direct-I/O writes, with its existing set of page-cache coherence > > warts, and not a capability to dynamically switch the dax-mode. > > mmap+MAP_SYNC seems the wrong interface for this. This writeup > > mentions bypassing kernel call overhead, but I don't see how a > > dax-write syscall is cheaper than an mmap syscall plus fault. If > > direct-I/O to a dax capable file bypasses the block layer, isn't that > > about the maximum of kernel overhead that can be cut out of this use > > case? Otherwise MAP_SYNC is a facility to achieve efficient sub-block > > update-in-place writes not append writes. > > Well, even for appends you'll pay the cost only once per page (or maybe even > once per huge page) when using MAP_SYNC. With a syscall you'll pay once per > write. So although it would be good to check real numbers, the design isn't > non-sensical to me. True, Jonathan, how many writes per page are we talking about in this case?