Received: by 2002:a05:6358:53a8:b0:117:f937:c515 with SMTP id z40csp6008443rwe; Tue, 18 Apr 2023 15:18:21 -0700 (PDT) X-Google-Smtp-Source: AKy350Ye4vb/IpuIO8BU+ghkV+JKXcGZ9TItbGLcTJNvtrtQAPZuW7ZZZ+7qpBtRLAmulQDI8m+S X-Received: by 2002:a05:6a00:1810:b0:637:c959:8ea1 with SMTP id y16-20020a056a00181000b00637c9598ea1mr1211640pfa.22.1681856301558; Tue, 18 Apr 2023 15:18:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681856301; cv=none; d=google.com; s=arc-20160816; b=v0cgxwMnpXd0simEySPyfjoXJUDrFAh1r8qGfwlb479JKffXgF6WogD57iqf7q5NTR i0iTOyxXdmp8zaPtA+ijYuC8BL1gXEfEZzaam5hzsvEnOnqkjyNq1lhOq6OhAJGVq5oL PjOoekWxsfalShXH7oAl5TJab0ZLxHQiH9wwXQN/h239OGP87Hf41aEMGvukAog7auMm HuKhhkSxmkkzDWdmcwdKlTff1DWfDMzADRYN4gMZUI10n7xXymJljzBLu+YMNsIJ91Yn Lvm5ayg/ZBAPR0yowOjZX5kbexaEax69/uJ8FxDaFysE1NnRuStr+UJnp+eJ8rwUS6fp a4GA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=JpFm5m6tO5mx05kVIN/ttwb3Qau/R4UqBZIeKrztNYA=; b=n3RQ/gAdWYGQLuJWOnAN7OQGPPAQtBKvxcbUUSg4JFg8CwusQY63qdb34me/N+nivB usinkQuAfU9Ve91awYhJXf5hJcmBCeZAXe8PDMRDD0/4Yy7fowhbmoNYT2Z9Io0X77xZ Z4Y40JidmAQk8o6+hzEgkNfGBtf21MUMb6qL1AiSpJkDD9UJN4c1ReV3ruQR5s5QAYqC 62MpikB1smgkl45t3CbLm0mRiaSExM3SqjB1rEF0JnqO1JLThiQoSz+TB+9fQ6fUkdjp CA+wmQ7gAzA/+iVYl5+zPn9iapZ3W9QRz9c4xpBy0evozNIqEGvTd0cfJ3kdK8alsjb6 bpWg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@fromorbit-com.20221208.gappssmtp.com header.s=20221208 header.b=FnQjklus; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=fromorbit.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f187-20020a6238c4000000b0063414224634si14755609pfa.396.2023.04.18.15.18.06; Tue, 18 Apr 2023 15:18:21 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@fromorbit-com.20221208.gappssmtp.com header.s=20221208 header.b=FnQjklus; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=fromorbit.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232137AbjDRWNe (ORCPT + 99 others); Tue, 18 Apr 2023 18:13:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48654 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233232AbjDRWN0 (ORCPT ); Tue, 18 Apr 2023 18:13:26 -0400 Received: from mail-pl1-x62b.google.com (mail-pl1-x62b.google.com [IPv6:2607:f8b0:4864:20::62b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5879E6EA5 for ; Tue, 18 Apr 2023 15:13:03 -0700 (PDT) Received: by mail-pl1-x62b.google.com with SMTP id d9443c01a7336-1a6862e47b1so21477895ad.0 for ; Tue, 18 Apr 2023 15:13:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1681855983; x=1684447983; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=JpFm5m6tO5mx05kVIN/ttwb3Qau/R4UqBZIeKrztNYA=; b=FnQjklus3REFTnQ81KALNZESVQXH2keBj2dkRkYOlAz3Dgho0s+pjDXI5TzwS5e3G8 3D1Csr1z7hqDvbVuINHJoxalmL3zzIL6c4crYXxGs0ENxn2GpS1xdGUfuHoevhI6OC49 Jcq6yKYs3UojyLTG8FQxrmka1GRY9X8yQbsZVE30fb1EVeiiK7b+XxVjdBK4cUIJILhA u49PC1WOafr9HrwZ2veM/dFW3xOgn/zi8yCjfYGjk125kKAUxkTZ91ZG2h54mV+KMXOK bZ+AIQ7JhIswIsa+YYq+egllYyc24SiGUt4GrEU3hejwIYIrUjupm79+uXXRFA/qp0Bm NO8Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681855983; x=1684447983; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=JpFm5m6tO5mx05kVIN/ttwb3Qau/R4UqBZIeKrztNYA=; b=VM+n8L3Y17c7s3oyezoiUivunLTPTujn+0C2hZcT1U1HQXYF15npKdXfrBOLInGux/ CGs2FA7CGpov5DdArO+POmXoWi3zVDJI2SmNHTdn6zesyYgP5NnEOPA8zC/Vs97N1ztt vlSoxIn9YgpS7FBO9udjsjiSZznnAhCrINhko020WDaLd6N5w95WwRB9hzaUZYs4UzGQ BdeCQ5QOksHhIUr2YP8scwERiath7GPzDJFM0JWICa/VQCEAcdWENHSdkvM4lfc8z0Du 8llx2NIEDpWGCIv00Re1C9s4oH+KXdbm8RELxtng5xnIeSqjDKC1AxJzcK4BCDypKgzK RJ1w== X-Gm-Message-State: AAQBX9dWz9edpV9x9p9tqLszDIfrYbwuI5g9Wy9KO2xY/R1hHqmWYPZD q1Mxf0v46TbBTb27LA03JqGkBfjevQc4c/1M6CwGng== X-Received: by 2002:a17:903:2348:b0:1a5:2fbd:d094 with SMTP id c8-20020a170903234800b001a52fbdd094mr4139805plh.9.1681855982883; Tue, 18 Apr 2023 15:13:02 -0700 (PDT) Received: from dread.disaster.area (pa49-180-41-174.pa.nsw.optusnet.com.au. [49.180.41.174]) by smtp.gmail.com with ESMTPSA id u1-20020a170902a60100b001a671a396efsm10093392plq.214.2023.04.18.15.13.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 18 Apr 2023 15:13:02 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1potZM-0051YJ-3Z; Wed, 19 Apr 2023 08:13:00 +1000 Date: Wed, 19 Apr 2023 08:13:00 +1000 From: Dave Chinner To: Bernd Schubert Cc: Miklos Szeredi , Jens Axboe , "Darrick J. Wong" , Christoph Hellwig , "io-uring@vger.kernel.org" , "linux-ext4@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "linux-xfs@vger.kernel.org" , Dharmendra Singh Subject: Re: [PATCH 1/2] fs: add FMODE_DIO_PARALLEL_WRITE flag Message-ID: <20230418221300.GT3223426@dread.disaster.area> References: <20230307172015.54911-2-axboe@kernel.dk> <20230412134057.381941-1-bschubert@ddn.com> <20230414153612.GB360881@frogsfrogsfrogs> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Tue, Apr 18, 2023 at 12:55:40PM +0000, Bernd Schubert wrote: > On 4/18/23 14:42, Miklos Szeredi wrote: > > On Sat, 15 Apr 2023 at 15:15, Jens Axboe wrote: > > > >> Yep, that is pretty much it. If all writes to that inode are serialized > >> by a lock on the fs side, then we'll get a lot of contention on that > >> mutex. And since, originally, nothing supported async writes, everything > >> would get punted to the io-wq workers. io_uring added per-inode hashing > >> for this, so that any punt to io-wq of a write would get serialized. > >> > >> IOW, it's an efficiency thing, not a correctness thing. > > > > We could still get a performance regression if the majority of writes > > still trigger the exclusive locking. The questions are: > > > > - how often does that happen in real life? > > Application depending? My personal opinion is that > applications/developers knowing about uring would also know that they > should set the right file size first. Like MPIIO is extending files > persistently and it is hard to fix with all these different MPI stacks > (I can try to notify mpich and mvapich developers). So best would be to > document it somewhere in the uring man page that parallel extending > files might have negative side effects? There are relatively few applications running concurrent async RWF_APPEND DIO writes. IIRC SycallaDB was the first we came across a few years ago. Apps that use RWF_APPEND for individual DIOs expect that it doesn't cause performance anomolies. These days XFS will run concurrent append DIO writes and it doesn't serialise RWF_APPEND IO against other RWF_APPEND IOs. Avoiding data corruption due to racing append IOs doing file extension has been delegated to the userspace application similar to how we delegate the responsibility for avoiding data corruption due to overlapping concurrent DIO to userspace. > > - how bad the performance regression would be? > > I can give it a try with fio and fallocate=none over fuse during the > next days. It depends on where the lock that triggers serialisation is, and how bad the contention on it is. rwsems suck for write contention because of the "spin on owner" "optimisations" for write locking and long write holds that occur in the IO path. In general, it will be no worse than using userspace threads to issue the exact same IO pattern using concurrent sync IO. > > Without first attempting to answer those questions, I'd be reluctant > > to add FMODE_DIO_PARALLEL_WRITE to fuse. I'd tag it with this anyway - for the majority of apps that are doing concurrent DIO within EOF, shared locking is big win. If there's a corner case that apps trigger that is slow, deal with them when they are reported.... Cheers, Dave. -- Dave Chinner david@fromorbit.com