Received: by 2002:a25:ca44:0:0:0:0:0 with SMTP id a65csp2535581ybg; Fri, 31 Jul 2020 02:43:17 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyuGJSGBsrPnn5brIZgfSrlIALKqlGqB836nXjutOIqXjJu3vQjl7nGy1uqUk5+gqIvAyl/ X-Received: by 2002:aa7:c45a:: with SMTP id n26mr3042222edr.45.1596188596936; Fri, 31 Jul 2020 02:43:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1596188596; cv=none; d=google.com; s=arc-20160816; b=dXV0PmI1MsbdjQLMqWZRG1sxy4l56K5tVq0VcI1F7C+sjK+zpQkPkSP86PAvSx3PIP 01uoeCq3o/glJex6RnpJngWEXQOBd9PiNDQl0BMEwJV5rrA7c8eXTNG54it4UuTB4BNW aGNgGjA5N50Mf7OUW0QpmjGkih6KahZSTRyoju0HVls/RhvDj+D5Kau7aBEzk9lkIgVS PxxancAvAF/aBYBuaAowQgXnZw+yAdjhm+Jq32f5YFOTe2YAxwYexJSaOOR7mpNuLQks I2om/68CLt9JXNrVew/Z2MOvf034OuXkEW2XAbiKDlhM21tpic9dDhb0KQR4WgqPsjsK c6ig== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=yW1amiISh90U+7G4GBjuyuOrXFidnl8gA0+LdBOz9oA=; b=KcQlwqT/XD+dlpQwmzB3xRb4UDgAFugvpovz/u2tcMF6sx5mVjBoX7KRVvEAVrM1iq G4kBMhdKKmimxEG+NvkxmXvu8IPgMCiynvkDoGGDa+OPhr2RLdhJ+txXuGks6nprrOnp M/fvgaTMp24kFKKs6+WIGKJ4ldkTwYT0bXyv3M8qdyiakPHeT6HJVItA5FlfN/YWterT uRaHZvITCsmuOFGvyHh8MVGFO0TvqPST3HdiBNaD19HpuSVgiG5OuiBidZBS6ElYnOgg 55LuatEFk/rzy6nSCYEJgV7RCRWGRXjiRXHXVsWQwkvlUgtnutvWmV0l+FZbMdiPVQgK v6ZA== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=casper.20170209 header.b=PCla3Z6g; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id n13si4674284eje.393.2020.07.31.02.42.52; Fri, 31 Jul 2020 02:43:16 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=casper.20170209 header.b=PCla3Z6g; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732301AbgGaJlp (ORCPT + 99 others); Fri, 31 Jul 2020 05:41:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34684 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732254AbgGaJlo (ORCPT ); Fri, 31 Jul 2020 05:41:44 -0400 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 23D52C061574; Fri, 31 Jul 2020 02:41:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=yW1amiISh90U+7G4GBjuyuOrXFidnl8gA0+LdBOz9oA=; b=PCla3Z6gdsm86AfGkcADwFVzi7 HbRDcD/SlnJelgcZZgvKBC3Ql5xFrnfPU53TlnpOhyuqNPY0gRqBVZHrJErJQ6/0OxQh5jljdTgpt Dq8IW7p01tolCCIxV+ujqh/2/gMdQC8c50YpxU6jYEPW4N/QeIEzLVkcWACdArfP93u6TJ2IA7ZdW x1hM1O8AffyTzSNYYE5auRSwkCZ2DJk5XTHo+y7bTtgsqFVsTd9ai36CVFqTnnpUFH7BsyIiDiv0q 08S2tip82hsIVDzqLV5YY1RADfyKJr1+7sA0BN1VU0xvjvVXnH9WKFBAP0ulwffDKp7YfFvrxhsoB 5sfSebfA==; Received: from hch by casper.infradead.org with local (Exim 4.92.3 #3 (Red Hat Linux)) id 1k1RXj-0001Yr-IF; Fri, 31 Jul 2020 09:41:35 +0000 Date: Fri, 31 Jul 2020 10:41:35 +0100 From: "hch@infradead.org" To: Damien Le Moal Cc: "hch@infradead.org" , Kanchan Joshi , Jens Axboe , Pavel Begunkov , Kanchan Joshi , "viro@zeniv.linux.org.uk" , "bcrl@kvack.org" , Matthew Wilcox , "linux-fsdevel@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-aio@kvack.org" , "io-uring@vger.kernel.org" , "linux-block@vger.kernel.org" , "linux-api@vger.kernel.org" , SelvaKumar S , Nitesh Shetty , Javier Gonzalez , Johannes Thumshirn Subject: Re: [PATCH v4 6/6] io_uring: add support for zone-append Message-ID: <20200731094135.GA4104@infradead.org> References: <20200731064526.GA25674@infradead.org> <20200731091416.GA29634@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-SRS-Rewrite: SMTP reverse-path rewritten from by casper.infradead.org. See http://www.infradead.org/rpr.html Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 31, 2020 at 09:34:50AM +0000, Damien Le Moal wrote: > Sync writes are done under the inode lock, so there cannot be other writers at > the same time. And for the sync case, since the actual written offset is > necessarily equal to the file size before the write, there is no need to report > it (there is no system call that can report that anyway). For this sync case, > the only change that the use of zone append introduces compared to regular > writes is the potential for more short writes. > > Adding a flag for "report the actual offset for appending writes" is fine with > me, but do you also mean to use this flag for driving zone append write vs > regular writes in zonefs ? Let's keep semantics and implementation separate. For the case where we report the actual offset we need a size imitation and no short writes. Anything with those semantics can be implemented using Zone Append trivially in zonefs, and we don't even need the exclusive lock in that case. But even without that flag anything that has an exclusive lock can at least in theory be implemented using Zone Append, it just need support for submitting another request from the I/O completion handler of the first. I just don't think it is worth it - with the exclusive lock we do have access to the zone serialied so a normal write works just fine. Both for the sync and async case. > The fcntl or ioctl for getting the max atomic write size would be fine too. > Given that zonefs is very close to the underlying zoned drive, I was assuming > that the application can simply consult the device sysfs zone_append_max_bytes > queue attribute. For zonefs we can, yes. But in many ways that is a lot more cumbersome that having an API that works on the fd you want to write on. > For regular file systems, this value would be used internally > only. I do not really see how it can be useful to applications. Furthermore, the > file system may have a hard time giving that information to the application > depending on its underlying storage configuration (e.g. erasure > coding/declustered RAID). File systems might have all kinds of limits of their own (e.g. extent sizes). And a good API that just works everywhere and is properly documented is much better than heaps of cargo culted crap all over applications.