Received: by 2002:ab2:1149:0:b0:1f3:1f8c:d0c6 with SMTP id z9csp291096lqz; Fri, 29 Mar 2024 18:03:46 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCW8Dx2iwLIgn2E4qOBujej1Ihgz/lQ8IM3oRnt3v1nlI2k2LgcyJ7r0MBqRqPeRSARLMyjbczS0b8KwqUOtm49wtbi1oz7yK8+9ssiWrw== X-Google-Smtp-Source: AGHT+IHHUAd50xA5TdysFU2E+NDLrVyS7/KKLtcaZq7JQAw1QINh9ZeeDf1k75TTG9W/XlsrT4rn X-Received: by 2002:a05:6830:1b7b:b0:6e6:94cc:d2d2 with SMTP id d27-20020a0568301b7b00b006e694ccd2d2mr3775461ote.37.1711760626522; Fri, 29 Mar 2024 18:03:46 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1711760626; cv=pass; d=google.com; s=arc-20160816; b=Gyex4KCDvDPXPY97D3SIKTKG7pPzmj9bRO50GiJTxZn7k8ejZbH3OjEVqOekRJfeV5 SMZq0zREZKX/uOR5q4NayDZOM6NSVkfxKg15rcSJHwnouNPN/iWCO1rws25XNyNsKWUc rMM/tJ2AIqnHxW+TNGS3AdmL3n9S0cumMnTB6qEolmBW570wmb43YAUJS2elE4f1adto ADcmEjkNLMHo/bgf6qOM6hVS0n3KUa8NTXIA/95LnyU1drj+vaUzt1qO0STDMB07+VI5 EvWsrUtawyodBAOFOkQQoJP1HLhnGBfwtrt2BRRnyc1vWoSuAUkofd4J79hq1z1TuQNt hnQw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:dkim-signature:message-id; bh=JXiJJHZRlrxCQbjNpD/DG/Sam9Mk/EeRhDZTqjMOOdA=; fh=G0IybGN/20dPhgb8VplJiznh8EBdHxscUjOeOsUjAnM=; b=tHLXCKn7xESoO3VEOwuodH+GsvDtR3S4t6zSeesDEsrelQX0Tp7fJCsuFtHzAPdbYa hMQ2CGdbHqtdHdg/VLguaehXCIF7mUG1nnUsCjqf/a8tnDEvrJblZmoB8Sc3B1DFgQhU Yi62v0BgxkayhyGrsU/SVtlKDsYfJW4+w9X20A2qJl7ZvVYg0x3mWAkbBnSPt8Tc/mpv 6OGi2aXb5TDwB1X+cwYECXXX8gLW+MAYppg0hGwKRRUPmJbozqWz15XvdHMoNqoB7zR5 K4ELi83+4uSeu/Kwv5JbOO+4kiN0N8QfR0K9EanE4JvJxrngezufCjEutY9M95nulMIG fCVg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=nEiRr5d0; arc=pass (i=1 spf=pass spfdomain=linux.dev dkim=pass dkdomain=linux.dev dmarc=pass fromdomain=linux.dev); spf=pass (google.com: domain of linux-nfs+bounces-2565-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-nfs+bounces-2565-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id g19-20020a05620a13d300b0078831bd38dfsi4778091qkl.379.2024.03.29.18.03.46 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Mar 2024 18:03:46 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs+bounces-2565-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=nEiRr5d0; arc=pass (i=1 spf=pass spfdomain=linux.dev dkim=pass dkdomain=linux.dev dmarc=pass fromdomain=linux.dev); spf=pass (google.com: domain of linux-nfs+bounces-2565-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-nfs+bounces-2565-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 30A3E1C2144C for ; Sat, 30 Mar 2024 01:03:46 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 342F31C17; Sat, 30 Mar 2024 01:03:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="nEiRr5d0" X-Original-To: linux-nfs@vger.kernel.org Received: from out-172.mta1.migadu.com (out-172.mta1.migadu.com [95.215.58.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 64DBC7E8; Sat, 30 Mar 2024 01:03:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711760615; cv=none; b=ppN8a8Ia/v+ya5qyylKrNW9nxtSQyrkUdHJPDNdUjpiwOndu98+VUlLjL+/P5civRQ+e/TTNcJ14NDLsykRAHVinCxZEhEJo3RMreAqL/OEezBQRQ8AwWloF7W0njRIWYNEhiKVMk38q5ZFAPJdlXn99vx8NVBI8WAc4Hy5lkco= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711760615; c=relaxed/simple; bh=PhpfGLga0n/ALXtTxHjokXPE/jXjvNuKnl7wAoZ5VfM=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=XMfpP0yDtjCzCtQ0QRjFnCl0hSuq+wznHA60RbqPi+dTb9F+yTSqwoc1Y/RN0n1gSNfRiq0soUFDCWNUswzgwKEdfkrcnse2x73UWYSJap4tT5kzfEbRI7zdljOeEI33GPJWj7ShUgs9Wzd2Ie/U21+s5naQDFsahWovSxehGUA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=nEiRr5d0; arc=none smtp.client-ip=95.215.58.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1711760609; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=JXiJJHZRlrxCQbjNpD/DG/Sam9Mk/EeRhDZTqjMOOdA=; b=nEiRr5d0ISkrWl+RgoStfXsEYYaBXkKMHJrzYX/lNHAgBPe02h6dEOPwkvTr3HEfXDaU3o ro9FTWY02LthEpm6/PRMqPJHmn81nw7jWZKghjOT1Yu4tyPiSQ2KJpcQGeb4WR7DoJDRZP CRlh1My2puEF8N8UqRBF1F80IqgE6hw= Date: Fri, 29 Mar 2024 18:03:21 -0700 Precedence: bulk X-Mailing-List: linux-nfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH 19/26] netfs: New writeback implementation Content-Language: en-US To: David Howells , Christian Brauner , Jeff Layton , Gao Xiang , Dominique Martinet Cc: Matthew Wilcox , Steve French , Marc Dionne , Paulo Alcantara , Shyam Prasad N , Tom Talpey , Eric Van Hensbergen , Ilya Dryomov , netfs@lists.linux.dev, linux-cachefs@redhat.com, linux-afs@lists.infradead.org, linux-cifs@vger.kernel.org, linux-nfs@vger.kernel.org, ceph-devel@vger.kernel.org, v9fs@lists.linux.dev, linux-erofs@lists.ozlabs.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Latchesar Ionkov , Christian Schoenebeck References: <20240328163424.2781320-1-dhowells@redhat.com> <20240328163424.2781320-20-dhowells@redhat.com> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Vadim Fedorenko In-Reply-To: <20240328163424.2781320-20-dhowells@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT On 28/03/2024 16:34, David Howells wrote: > The current netfslib writeback implementation creates writeback requests of > contiguous folio data and then separately tiles subrequests over the space > twice, once for the server and once for the cache. This creates a few > issues: > > (1) Every time there's a discontiguity or a change between writing to only > one destination or writing to both, it must create a new request. > This makes it harder to do vectored writes. > > (2) The folios don't have the writeback mark removed until the end of the > request - and a request could be hundreds of megabytes. > > (3) In future, I want to support a larger cache granularity, which will > require aggregation of some folios that contain unmodified data (which > only need to go to the cache) and some which contain modifications > (which need to be uploaded and stored to the cache) - but, currently, > these are treated as discontiguous. > > There's also a move to get everyone to use writeback_iter() to extract > writable folios from the pagecache. That said, currently writeback_iter() > has some issues that make it less than ideal: > > (1) there's no way to cancel the iteration, even if you find a "temporary" > error that means the current folio and all subsequent folios are going > to fail; > > (2) there's no way to filter the folios being written back - something > that will impact Ceph with it's ordered snap system; > > (3) and if you get a folio you can't immediately deal with (say you need > to flush the preceding writes), you are left with a folio hanging in > the locked state for the duration, when really we should unlock it and > relock it later. > > In this new implementation, I use writeback_iter() to pump folios, > progressively creating two parallel, but separate streams and cleaning up > the finished folios as the subrequests complete. Either or both streams > can contain gaps, and the subrequests in each stream can be of variable > size, don't need to align with each other and don't need to align with the > folios. > > Indeed, subrequests can cross folio boundaries, may cover several folios or > a folio may be spanned by multiple folios, e.g.: > > +---+---+-----+-----+---+----------+ > Folios: | | | | | | | > +---+---+-----+-----+---+----------+ > > +------+------+ +----+----+ > Upload: | | |.....| | | > +------+------+ +----+----+ > > +------+------+------+------+------+ > Cache: | | | | | | > +------+------+------+------+------+ > > The progressive subrequest construction permits the algorithm to be > preparing both the next upload to the server and the next write to the > cache whilst the previous ones are already in progress. Throttling can be > applied to control the rate of production of subrequests - and, in any > case, we probably want to write them to the server in ascending order, > particularly if the file will be extended. > > Content crypto can also be prepared at the same time as the subrequests and > run asynchronously, with the prepped requests being stalled until the > crypto catches up with them. This might also be useful for transport > crypto, but that happens at a lower layer, so probably would be harder to > pull off. > > The algorithm is split into three parts: > > (1) The issuer. This walks through the data, packaging it up, encrypting > it and creating subrequests. The part of this that generates > subrequests only deals with file positions and spans and so is usable > for DIO/unbuffered writes as well as buffered writes. > > (2) The collector. This asynchronously collects completed subrequests, > unlocks folios, frees crypto buffers and performs any retries. This > runs in a work queue so that the issuer can return to the caller for > writeback (so that the VM can have its kswapd thread back) or async > writes. > > (3) The retryer. This pauses the issuer, waits for all outstanding > subrequests to complete and then goes through the failed subrequests > to reissue them. This may involve reprepping them (with cifs, the > credits must be renegotiated, and a subrequest may need splitting), > and doing RMW for content crypto if there's a conflicting change on > the server. > > [!] Note that some of the functions are prefixed with "new_" to avoid > clashes with existing functions. These will be renamed in a later patch > that cuts over to the new algorithm. > > Signed-off-by: David Howells > cc: Jeff Layton > cc: Eric Van Hensbergen > cc: Latchesar Ionkov > cc: Dominique Martinet > cc: Christian Schoenebeck > cc: Marc Dionne > cc: v9fs@lists.linux.dev > cc: linux-afs@lists.infradead.org > cc: netfs@lists.linux.dev > cc: linux-fsdevel@vger.kernel.org [..snip..] > +/* > + * Begin a write operation for writing through the pagecache. > + */ > +struct netfs_io_request *new_netfs_begin_writethrough(struct kiocb *iocb, size_t len) > +{ > + struct netfs_io_request *wreq = NULL; > + struct netfs_inode *ictx = netfs_inode(file_inode(iocb->ki_filp)); > + > + mutex_lock(&ictx->wb_lock); > + > + wreq = netfs_create_write_req(iocb->ki_filp->f_mapping, iocb->ki_filp, > + iocb->ki_pos, NETFS_WRITETHROUGH); > + if (IS_ERR(wreq)) > + mutex_unlock(&ictx->wb_lock); > + > + wreq->io_streams[0].avail = true; in case IS_ERR(wreq) is true, the execution falls through and this derefere is invalid. > + trace_netfs_write(wreq, netfs_write_trace_writethrough); not sure if we still need trace function call in case of error > + return wreq; > +} > + [..snip..]