Received: by 2002:a25:c205:0:0:0:0:0 with SMTP id s5csp61208ybf; Wed, 26 Feb 2020 08:50:13 -0800 (PST) X-Google-Smtp-Source: APXvYqzPDctKmYyc1bWhevoNUN3uuKS5jQrxr2KDr5kRRiYk69J0bda2SGpnJxE0zcRpG4631777 X-Received: by 2002:a9d:5e8b:: with SMTP id f11mr3975470otl.110.1582735813698; Wed, 26 Feb 2020 08:50:13 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1582735813; cv=none; d=google.com; s=arc-20160816; b=1AREFdr26Ks1gzXXspEpd3Cuvk7qUfYqconzn2OV5MMFZAndwMfX2Lz7xDv1POihkD WWW66dP4b+8TR6i8/SzleyBoPKOm7OHsTy9IT7iMwojFS1YxW7DrSDtGgTdPAO9OsgLq WcEbjwuluebn0RtLD36XfRdZC8/XW/lDzxfXgvgldiTMak9tje+EAgiriguoctESCWUd 1NNjclHwOeBF0ZPTlc/ZogJKDIr2MZWE+ru5GzdtfThhcFK1O/TBjoWSxXA0g76BxQ7m GGuJltSOVSYco+rr7BJPLOjHELYEsGKSMmuhuEzwtvhcBZIOdS0I6YELHNXx7D6iXD95 BViw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=jfFrdPdZ9FoWsLEItIHfFzP9ymFVjyGzad5UqI1rEBE=; b=stjldXt2nM76W33z1oslGoJd6OvUF8KYb6y5JJOPtX2H1NjPJj4yH3XRF4KpDOObTN XJOOSekahERP34+o1fDMCz8P5KqOMrXl5xO+R3ktdX5HxC5fB6X8nm0e5p8FO0plUL9H JnKbza+ImpKPZgsww0kUHTKq3/E7kmC5Efq01q7XMVhamGf7ZEjyEbTtftjyxdA9ovlN gpH9plMER6Bvs7gIGWoVz/ULgf7JxZElwsZX+cMuYXBnFI1ItlcUqdQKo6AMu9bx67lT p5zr5iA9T/yY9X4drgjTWAuBX9ZMdyM1Ue/mpTJhJWSEjD8s8Y6NE7dqP7XOmAovezAp Y9Qw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b="LLkXP/3Y"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k19si72281otp.33.2020.02.26.08.50.01; Wed, 26 Feb 2020 08:50:13 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b="LLkXP/3Y"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727995AbgBZQq6 (ORCPT + 99 others); Wed, 26 Feb 2020 11:46:58 -0500 Received: from mail-oi1-f194.google.com ([209.85.167.194]:39395 "EHLO mail-oi1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727993AbgBZQqy (ORCPT ); Wed, 26 Feb 2020 11:46:54 -0500 Received: by mail-oi1-f194.google.com with SMTP id 18so144604oij.6 for ; Wed, 26 Feb 2020 08:46:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=jfFrdPdZ9FoWsLEItIHfFzP9ymFVjyGzad5UqI1rEBE=; b=LLkXP/3YzNaP8jdt0z93QGulQGLXQX0Trzd+WFJYkXo7XFWuAJ6wZlyGff03QwBHLf 4Plc4q/Qd0QJCdIf1e9GoeaXEwKZlw9ETZjCukpkMrdt9ZR/Sr44zYE50Gip8IOl5r/m JYB1LSlrJ5+0Ka91jrB29f9SbGT869H3/DqAXOp34IH/Yoc3vP+//hshViJUUDDqS/LJ QNCuX04I8VWPdsbprpmw1cyzBOjBzRkd1MECp+iNhZ3/lxaXqckVfJPRynNgiP/ijnQg eiPa77p4MyvubAypJgQy1RxojQg1rNwgRioANNu+yBxI55olFfbCnrSGU/LHQ65MMRHT vO+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=jfFrdPdZ9FoWsLEItIHfFzP9ymFVjyGzad5UqI1rEBE=; b=H1ADkCHafNEO4gdA319QMq8bGT/6AEynYGh4uCMU3baN4a5Ioj/8OYiH2YIHFT1XB2 a4znvjlUL9aE66OsXCKqsJMhOJXekarAsZZ/WdEeGu2QwAbIjuxl0W0xlgtRkANnArBJ ZwYT6Rjmxz7F+2RNZ+SVMGYuoaOc0UCPf2J9A7kKejRXJlTAxntxIKTOc3+uQ9Eq1Bzm xTLblPT5/Q0MrdmR774NDAr9BtiWEpZZg4ETYEH91jQ5kF5uzveCL9YVTaNM+yDPbYhd 7+pFJbBmc7FPQW2Q+ox8+nJZm2YIDe9RzHcAtkgTzeav9mWj65hQy+8TQop6M6VG6Btd +4QA== X-Gm-Message-State: APjAAAUT3BcRNv70DqZkayho8p6umyLQYnytta+RkIOzgC7jx5pm0Fwa de8lxpDHwuxBlAedNTZRMxvM5zz0yEKLfX13XV9H5g== X-Received: by 2002:aca:4c9:: with SMTP id 192mr4075244oie.105.1582735614183; Wed, 26 Feb 2020 08:46:54 -0800 (PST) MIME-Version: 1.0 References: <20200221004134.30599-1-ira.weiny@intel.com> <20200221004134.30599-8-ira.weiny@intel.com> <20200221174449.GB11378@lst.de> <20200221224419.GW10776@dread.disaster.area> <20200224175603.GE7771@lst.de> <20200225000937.GA10776@dread.disaster.area> <20200225173633.GA30843@lst.de> In-Reply-To: From: Dan Williams Date: Wed, 26 Feb 2020 08:46:42 -0800 Message-ID: Subject: Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state To: Jonathan Halliday Cc: Jeff Moyer , Christoph Hellwig , Dave Chinner , "Weiny, Ira" , Linux Kernel Mailing List , Alexander Viro , "Darrick J. Wong" , "Theodore Y. Ts'o" , Jan Kara , linux-ext4 , linux-xfs , linux-fsdevel Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 26, 2020 at 1:29 AM Jonathan Halliday wrote: > > > Hi All > > I'm a middleware developer, focused on how Java (JVM) workloads can > benefit from app-direct mode pmem. Initially the target is apps that > need a fast binary log for fault tolerance: the classic database WAL use > case; transaction coordination systems; enterprise message bus > persistence and suchlike. Critically, there are cases where we use log > based storage, i.e. it's not the strict 'read rarely, only on recovery' > model that a classic db may have, but more of a 'append only, read many > times' event stream model. > > Think of the log oriented data storage as having logical segments (let's > implement them as files), of which the most recent is being appended to > (read_write) and the remaining N-1 older segments are full and sealed, > so effectively immutable (read_only) until discarded. The tail segment > needs to be in DAX mode for optimal write performance, as the size of > the append may be sub-block and we don't want the overhead of the kernel > call anyhow. So that's clearly a good fit for putting on a DAX fs mount > and using mmap with MAP_SYNC. > > However, we want fast read access into the segments, to retrieve stored > records. The small access index can be built in volatile RAM (assuming > we're willing to take the startup overhead of a full file scan at > recovery time) but the data itself is big and we don't want to move it > all off pmem. Which means the requirements are now different: we want > the O/S cache to pull hot data into fast volatile RAM for us, which DAX > explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather > than app-direct mode, except here we're using the O/S rather than the > hardware memory controller to do the cache management for us. > > Currently this requires closing the full (read_write) file, then copying > it to a non-DAX device and reopening it (read_only) there. Clearly > that's expensive and rather tedious. Instead, I'd like to close the > MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode > that will instead go via the O/S cache in the traditional manner. Bonus > points if I can do it over non-overlapping ranges in a file without > closing the DAX mode mmap, since then the segments are entirely logical > instead of needing separate physical files. Hi John, IIRC we chatted about this at PIRL, right? At the time it sounded more like mixed mode dax, i.e. dax writes, but cached reads. To me that's an optimization to optionally use dax for direct-I/O writes, with its existing set of page-cache coherence warts, and not a capability to dynamically switch the dax-mode. mmap+MAP_SYNC seems the wrong interface for this. This writeup mentions bypassing kernel call overhead, but I don't see how a dax-write syscall is cheaper than an mmap syscall plus fault. If direct-I/O to a dax capable file bypasses the block layer, isn't that about the maximum of kernel overhead that can be cut out of this use case? Otherwise MAP_SYNC is a facility to achieve efficient sub-block update-in-place writes not append writes.