Received: by 2002:a25:1506:0:0:0:0:0 with SMTP id 6csp4694538ybv; Wed, 26 Feb 2020 01:29:24 -0800 (PST) X-Google-Smtp-Source: APXvYqyhOGjDCADAGhg1DJRcF9O0E8Wm0s0z/qrxnwAohH05gI+Fe0BObpWgnyDwRlJXIXGM6ZZs X-Received: by 2002:aca:1204:: with SMTP id 4mr2386089ois.143.1582709364642; Wed, 26 Feb 2020 01:29:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1582709364; cv=none; d=google.com; s=arc-20160816; b=MnH/9RSefQsIB70ahF5QDfAYSm4xGRz0EKdI7vCJI64CLy+twymOqJmYq0HFBjNhN9 zkZ5V6artQxqkegsy1F7Aks50UbRB6GpVTFGQ8tHIOzDYVj4YqCTbkKZGdm8MZsCdlL1 5ZSGJU1YfACqlECV3FxLu4dhy/gzNmnREal39Fhn62RQCJQS6gd7w+oluWuBjshw1jza fJzbEkpbIpd+tNBQUphbohYznh9R+skc+HKQrnLLG7gfgTQQ0UEy2vi8/3yYOQgzfzwN 3tqwKjGyUALjq3PhIM8femGzvzebNcPRzt5lHlq9tDUf+2/lVL4uG39lGUKUXcjyIRSX RCFw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature; bh=RNaeJf+Y7W3nqj/Iutjp5fRx0FEV1X+IuW8sp//EwUw=; b=MT4VQw5W9Y1BzfULfSh8W5SPhRSsZOXWp3ybDk2AjAMllTHDjLnqYSPxRa6f2dNrbz FrP7y1RxWnAJKX4fnMb+By+D5NBxvodTVRY8KS9CEb4gJCGR8bV2j7U3QLK/pgd/kqFE +56NG3+xyQLB1n0+9VAQ5EIkSJXHlZl96ibUXsfOvublTGt1ZISDzOgUMTJnVTtDmWjG S8ReFJo3nNTiQWBxu4/lTYAbX1EvmrVpba8UFTObdzgn0RJPEZLzUtaJSKEsrlBarNvf 0ycZT1qsnGIFiunJZtvZ+NMYGEFym5p+6qT65Zwp5Ypv+fj7X/lNTd+uflUO0j/dU+16 JtHw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=ZEPIuxlj; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v26si1048007otj.0.2020.02.26.01.29.11; Wed, 26 Feb 2020 01:29:24 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=ZEPIuxlj; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727107AbgBZJ3H (ORCPT + 99 others); Wed, 26 Feb 2020 04:29:07 -0500 Received: from us-smtp-2.mimecast.com ([205.139.110.61]:56021 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726082AbgBZJ3G (ORCPT ); Wed, 26 Feb 2020 04:29:06 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1582709345; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=RNaeJf+Y7W3nqj/Iutjp5fRx0FEV1X+IuW8sp//EwUw=; b=ZEPIuxljJngXzMx5FUsZ8Qq0eH1QmTrZ5GRNLlRlM5+DDo4XCWOzFvli+7MPkqcXJmr2LL Tu3JkLO5AF3vQPaRIuwv+UdOR6pdQGhMdU1GMp32W1pnmWmz0xT1DIcwCvivy17aShvtly WQfuCJpmrQbPpgFangt/TIkenqeK3aM= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-334-rW6VEPUJOpKfz0IxePnoJw-1; Wed, 26 Feb 2020 04:29:03 -0500 X-MC-Unique: rW6VEPUJOpKfz0IxePnoJw-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 6DAB6190D340; Wed, 26 Feb 2020 09:29:01 +0000 (UTC) Received: from mohegan.the-transcend.com (unknown [10.36.118.5]) by smtp.corp.redhat.com (Postfix) with ESMTP id 764D310027BF; Wed, 26 Feb 2020 09:28:58 +0000 (UTC) Subject: Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state To: Jeff Moyer , Christoph Hellwig Cc: Dave Chinner , ira.weiny@intel.com, linux-kernel@vger.kernel.org, Alexander Viro , "Darrick J. Wong" , Dan Williams , "Theodore Y. Ts'o" , Jan Kara , linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org References: <20200221004134.30599-1-ira.weiny@intel.com> <20200221004134.30599-8-ira.weiny@intel.com> <20200221174449.GB11378@lst.de> <20200221224419.GW10776@dread.disaster.area> <20200224175603.GE7771@lst.de> <20200225000937.GA10776@dread.disaster.area> <20200225173633.GA30843@lst.de> From: Jonathan Halliday Message-ID: Date: Wed, 26 Feb 2020 09:28:57 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.4.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi All I'm a middleware developer, focused on how Java (JVM) workloads can benefit from app-direct mode pmem. Initially the target is apps that need a fast binary log for fault tolerance: the classic database WAL use case; transaction coordination systems; enterprise message bus persistence and suchlike. Critically, there are cases where we use log based storage, i.e. it's not the strict 'read rarely, only on recovery' model that a classic db may have, but more of a 'append only, read many times' event stream model. Think of the log oriented data storage as having logical segments (let's implement them as files), of which the most recent is being appended to (read_write) and the remaining N-1 older segments are full and sealed, so effectively immutable (read_only) until discarded. The tail segment needs to be in DAX mode for optimal write performance, as the size of the append may be sub-block and we don't want the overhead of the kernel call anyhow. So that's clearly a good fit for putting on a DAX fs mount and using mmap with MAP_SYNC. However, we want fast read access into the segments, to retrieve stored records. The small access index can be built in volatile RAM (assuming we're willing to take the startup overhead of a full file scan at recovery time) but the data itself is big and we don't want to move it all off pmem. Which means the requirements are now different: we want the O/S cache to pull hot data into fast volatile RAM for us, which DAX explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather than app-direct mode, except here we're using the O/S rather than the hardware memory controller to do the cache management for us. Currently this requires closing the full (read_write) file, then copying it to a non-DAX device and reopening it (read_only) there. Clearly that's expensive and rather tedious. Instead, I'd like to close the MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode that will instead go via the O/S cache in the traditional manner. Bonus points if I can do it over non-overlapping ranges in a file without closing the DAX mode mmap, since then the segments are entirely logical instead of needing separate physical files. I note a comment below regarding a per-directly setting, but don't have the background to fully understand what's being suggested. However, I'll note here that I can live with a per-directory granularity, as relinking a file into a new dir is a constant time operation, whilst the move described above isn't. So if a per-directory granularity is easier than a per-file one that's fine, though as a person with only passing knowledge of filesystem design I don't see how having multiple links to a file can work cleanly in that case. Hope that helps. Jonathan P.S. I'll cheekily take the opportunity of having your attention to tack on one minor gripe about the current system: The only way to know if a mmap with MAP_SYNC will work is to try it and catch the error. Which would be reasonable if it were free of side effects. However, the process requires first expanding the file to at least the size of the desired map, which is done non-atomically i.e. is user visible. There are thus nasty race conditions in the cleanup, where after a failed mmap attempt (e.g the device doesn't support DAX), we try to shrink the file back to its original size, but something else has already opened it at its new, larger size. This is not theoretical: I got caught by it whilst adapting some of our middleware to use pmem. Therefore, some way to probe the file path for its capability would be nice, much the same as I can e.g. inspect file permissions to (more or less) evaluate if I can write it without actually mutating it. Thanks! On 25/02/2020 19:37, Jeff Moyer wrote: > Christoph Hellwig writes: > >> And my point is that if we ensure S_DAX can only be checked if there >> are no blocks on the file, is is fairly easy to provide the same >> guarantee. And I've not heard any argument that we really need more >> flexibility than that. In fact I think just being able to change it >> on the parent directory and inheriting the flag might be more than >> plenty, which would lead to a very simple implementation without any >> of the crazy overhead in this series. > > I know of one user who had at least mentioned it to me, so I cc'd him. > Jonathan, can you describe your use case for being able to change a > file between dax and non-dax modes? Or, if I'm misremembering, just > correct me? > > Thanks! > Jeff > -- Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander