Received: by 2002:a05:6a10:a0d1:0:0:0:0 with SMTP id j17csp199139pxa; Fri, 14 Aug 2020 01:18:44 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzRwYBMG/1UfvpTMunaUiXRiUGMjFIZ2wVU+KAPbVQOIh0UmRuOutjrDUxfbh1Y6wkGLGQm X-Received: by 2002:a05:6402:6d9:: with SMTP id n25mr1111098edy.304.1597393123885; Fri, 14 Aug 2020 01:18:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1597393123; cv=none; d=google.com; s=arc-20160816; b=C9DsV6dsReU30Yhv9ngyZJTPYnVPwlGTm/b228UbRFfvKW2Iyj6S2XSdtwraqMZyL+ V/SpaOF1Fk1y8UWontMD/sL5vmWBdHFjy3+tXSZHO38GA2Nh5Ltlt0GrfXMK34nNLF/e S2JXcWHs20PkjI56MPfQ/GwU6Rq3U7RpgV3ZdjDELwXntYLPTnrpPVxrKustCCadzCfe NhxIx01jbbJDOJ6eX+S9y1u2clplU5xqQ2MtwgzgEyZ0VHl4Q5WaFWbtQuGA2e67RA1M LJn/vf2DlpaWd8S0Vx0UooDKw/4qaH0QDDOnlIerOLZE73I7DwrAt2m/DtcZTk2FeNof GxoQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=O4ePLcpRGoDo0O/rzXsIf8bGgZ7ERrg5VoASl6NkmpI=; b=Wj+ddli20YeTcAbKiU+jE2uqUVF0modj1N8tHUBWDj9omYDV8aS06BAx6jyVkEHWGB SXW8zzvbT6qBub059d7GahsLbUr3rvIUXo2Co4ocJfa33Zni6J0+5zEG3RLy8zIj6KS3 H3c+X/e38n0k86dlYYVyTOO2uL97tAKhRH0+tXudKafAXpxb+dUDQ3HoS2P6TrSDIgtk Y9ifYewwL8pbrvayw3YDCeoPKhqix7GwYUPq4jkNwUX/kMzAxrcArU8AdtkKzh41oY5t v5ovDb2HRzXdhDcAu8gpJUwd7cpKrdlEEcC3gjR0G6Oe84c2Q9HWFjuAvJyr1c/5RW2z tpGA== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=casper.20170209 header.b=gRZbAEl0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id q26si4774032edw.453.2020.08.14.01.18.21; Fri, 14 Aug 2020 01:18:43 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=casper.20170209 header.b=gRZbAEl0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727046AbgHNIOW (ORCPT + 99 others); Fri, 14 Aug 2020 04:14:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36416 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726268AbgHNIOW (ORCPT ); Fri, 14 Aug 2020 04:14:22 -0400 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DF850C061383; Fri, 14 Aug 2020 01:14:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=O4ePLcpRGoDo0O/rzXsIf8bGgZ7ERrg5VoASl6NkmpI=; b=gRZbAEl0QMe7sTmXwITLyqDLLW 86HhdIPkEJZ2AU5dluHodxWdRr+v5G7M7fNtcCNAFXSfAjRf9svixEVUbJUJUzlKhl7pIhucxGnq5 oX4Wwdhc3cpKaMD42w7anOCqmhokigU4FEIHUX8uJX4KJUqW3trKTscwO2zJPDPVsnyHJfiBrsb4r b0LZ7Powl7iAIzQiUM4CI1vqvApLDqieTEpd+E7KVW9BY5vABIH7pRpwrjO0dvuToc6zsZi7VL8/6 pkHymJChFtuOnKgNuTg4Y6VngcpIverT6gkDp5fQyz6Ytt0k+Vny4BYJxnNzcWx/DpoEQm8TFlAvE vyGXmEqA==; Received: from hch by casper.infradead.org with local (Exim 4.92.3 #3 (Red Hat Linux)) id 1k6Uqp-0004Wc-Dj; Fri, 14 Aug 2020 08:14:11 +0000 Date: Fri, 14 Aug 2020 09:14:11 +0100 From: "hch@infradead.org" To: Damien Le Moal Cc: "hch@infradead.org" , Kanchan Joshi , Jens Axboe , Pavel Begunkov , Kanchan Joshi , "viro@zeniv.linux.org.uk" , "bcrl@kvack.org" , Matthew Wilcox , "linux-fsdevel@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-aio@kvack.org" , "io-uring@vger.kernel.org" , "linux-block@vger.kernel.org" , "linux-api@vger.kernel.org" , SelvaKumar S , Nitesh Shetty , Javier Gonzalez , Johannes Thumshirn , Naohiro Aota Subject: Re: [PATCH v4 6/6] io_uring: add support for zone-append Message-ID: <20200814081411.GA16943@infradead.org> References: <20200731064526.GA25674@infradead.org> <20200731091416.GA29634@infradead.org> <20200731094135.GA4104@infradead.org> <20200731125110.GA11500@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-SRS-Rewrite: SMTP reverse-path rewritten from by casper.infradead.org. See http://www.infradead.org/rpr.html Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 05, 2020 at 07:35:28AM +0000, Damien Le Moal wrote: > > the write pointer. The only interesting addition is that we also want > > to report where we wrote. So I'd rather have RWF_REPORT_OFFSET or so. > > That works for me. But that rules out having the same interface for raw block > devices since O_APPEND has no meaning in that case. So for raw block devices, it > will have to be through zonefs. That works for me, and I think it was your idea > all along. Can you confirm please ? Yes. I don't think think raw syscall level access to the zone append primitive makes sense. Either use zonefs for a file-like API, or use the NVMe pass through interface for 100% raw access. > > - take the exclusive per-inode (zone) lock and just issue either normal > > writes or zone append at your choice, relying on the lock to > > serialize other writers. For the async case this means we need a > > lock than can be release in a different context than it was acquired, > > which is a little ugly but can be done. > > Yes, that would be possible. But likely, this will also need calls to > inode_dio_wait() to avoid ending up with a mix of regular write and zone append > writes in flight (which likely would result in the regular write failing as the > zone append writes would go straight to the device without waiting for the zone > write lock like regular writes do). inode_dio_wait is a really bad implementation of almost a lock. I've started some work that I need to finish to just replace it with proper non-owner rwsems (or even the range locks Dave has been looking into). > > This all sound sensible to me. One last point though, specific to zonefs: if the > user opens a zone file with O_APPEND, I do want to have that necessarily mean > "use zone append". And same for the "RWF_REPORT_OFFSET". The point here is that > both O_APPEND and RWF_REPORT_OFFSET can be used with both regular writes and > zone append writes, but none of them actually clearly specify if the > application/user tolerates writing data to disk in a different order than the > issuing order... So another flag to indicate "atomic out-of-order writes" (== > zone append) ? O_APPEND pretty much implies out of order, as there is no way for an application to know which thread wins the race to write the next chunk.