Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp1733044rwd; Thu, 25 May 2023 17:47:57 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4R2PiwdEOYlOeiFtDX6a30KNdYqo7HAWdMpprQHhZklLDK9/KO4ouHMuxX9scP0JYcYWQi X-Received: by 2002:a17:902:d34b:b0:1af:eea0:4f5b with SMTP id l11-20020a170902d34b00b001afeea04f5bmr159528plk.2.1685062077628; Thu, 25 May 2023 17:47:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685062077; cv=none; d=google.com; s=arc-20160816; b=KL9O8SfK0TExdgN5sEmUMuYULQTWdyFYyCE2vxm7Zt4KlKrb2J76x0bzcAWfe9Ckxo WU2pUKjyPoRXd73wn9NRWWCLOjl3fp5NM2aKOTeglPWeGw0UqZIekXwUZsi50I9u5Hmw EqDkZsO/NXtHoOiWpFzyD+SMKKLc9zEEpK4N8T49CD4pN/HW6xYEsMYQW5TyLSRXSyfD cSy56jhHfDspwDDWhM5MlLcD5wNXfY9WmuSONJLjYpuZpwQsJwW5F6f4BkePp9tdXM1m MSVLIjML3S6Kp+OKIp32UZJnR7PQjibw8lGAIsZnu8vQnBLiNCEV1/h1GesJZV5+qZKr 3fOw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:dkim-signature:date; bh=pAij41UNtGpWYUDu6iAwvqbyiSleyjppykUtqBTUdmw=; b=g+NUljK3oG4cLpR7GTsFEOVIdzMiUBypzI4Fn7GtnIaPBuRPR2PoEzRgG4RRO8wnuY GHW6dWnQsfBDGFtzlJoQl2rOgqUGv8BejH+YEc2vkwxMe+3bx/KdaA9toHVy7O2qBDGp 59T/us2zFXPFPmS32Kn36DGch7W9e2qfIhY2X7cXN3OPUePvz0OOTTosMNpXf/rzc7OM MK2t1agDtxqYrm3jajEmapF+DNB+bf0ejwcU8YQZJpWiMnk+cDyEAwO0GwkSKXCNH6+C jEFhj2uFSJr5EgmOmSvVUGWe+xh2MKBBm566zM7mCTziV9xuhTcyW/FLRm+h1SF/4pap mD0A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=Iq5WwXif; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id u6-20020a170902e5c600b001a1bbc5bea5si2945894plf.537.2023.05.25.17.47.45; Thu, 25 May 2023 17:47:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=Iq5WwXif; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230020AbjEZAjn (ORCPT + 99 others); Thu, 25 May 2023 20:39:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45458 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240830AbjEZAjm (ORCPT ); Thu, 25 May 2023 20:39:42 -0400 Received: from out-17.mta0.migadu.com (out-17.mta0.migadu.com [IPv6:2001:41d0:1004:224b::11]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3E056198 for ; Thu, 25 May 2023 17:39:39 -0700 (PDT) Date: Thu, 25 May 2023 20:39:32 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1685061577; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=pAij41UNtGpWYUDu6iAwvqbyiSleyjppykUtqBTUdmw=; b=Iq5WwXiffblb+kMf4YgSJILtg49ZkoeBlkvd4Sn6fK4FUaZzfflDGR7GR6nnqzXUEOKfvY xBCxOT1GV52keCCM74XBRB6BqZ8oQ5uRa0ENTb7EnJqM7CE2QYLugDMVdsuyQUKsIftfjw l+3e1I9MTs4wRZPpzoGErmgkvZn07as= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Kent Overstreet To: Andreas =?utf-8?Q?Gr=C3=BCnbacher?= Cc: Christoph Hellwig , Jan Kara , cluster-devel@redhat.com, "Darrick J . Wong" , linux-kernel@vger.kernel.org, dhowells@redhat.com, linux-bcachefs@vger.kernel.org, linux-fsdevel@vger.kernel.org, Kent Overstreet Subject: Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping Message-ID: References: <20230509165657.1735798-1-kent.overstreet@linux.dev> <20230509165657.1735798-7-kent.overstreet@linux.dev> <20230510010737.heniyuxazlprrbd6@quack3> <20230523133431.wwrkjtptu6vqqh5e@quack3> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 26, 2023 at 02:05:26AM +0200, Andreas Grünbacher wrote: > Oh, it's just that gfs2 uses one dlm lock per inode to control access > to that inode. In the code, this is called the "inode glock" --- > glocks being an abstraction above dlm locks --- but it boils down to > dlm locks in the end. An additional layer of locking will only work > correctly if all cluster nodes use the new locks consistently, so old > cluster nodes will become incompatible. Those kinds of changes are > hard. > > But the additional lock taking would also hurt performance, forever, > and I'd really like to avoid taking that hit. > > It may not be obvious to everyone, but allowing page faults during > reads and writes (i.e., while holding dlm locks) can lead to > distributed deadlocks. We cannot just pretend to be a local > filesystem. Ah, gotcha. Same basic issue then, dio -> page fault means you're taking two DLM locks simulateously, so without lock ordering you get deadlock. So that means promoting this to the VFS won't be be useful to you (because the VFS lock will be a local lock, not your DLM lock). Do you have trylock() and a defined lock ordering you can check for or glocks (inode number)? Here's the new and expanded commit message, in case my approach is of interest to you: commit 32858d0fb90b503c8c39e899ad5155e44639d3b5 Author: Kent Overstreet Date: Wed Oct 16 15:03:50 2019 -0400 sched: Add task_struct->faults_disabled_mapping There has been a long standing page cache coherence bug with direct IO. This provides part of a mechanism to fix it, currently just used by bcachefs but potentially worth promoting to the VFS. Direct IO evicts the range of the pagecache being read or written to. For reads, we need dirty pages to be written to disk, so that the read doesn't return stale data. For writes, we need to evict that range of the pagecache so that it's not stale after the write completes. However, without a locking mechanism to prevent those pages from being re-added to the pagecache - by a buffered read or page fault - page cache inconsistency is still possible. This isn't necessarily just an issue for userspace when they're playing games; filesystems may hang arbitrary state off the pagecache, and so page cache inconsistency may cause real filesystem bugs, depending on the filesystem. This is less of an issue for iomap based filesystems, but e.g. buffer heads caches disk block mappings (!) and attaches them to the pagecache, and bcachefs attaches disk reservations to pagecache pages. This issue has been hard to fix, because - we need to add a lock (henceforth calld pagecache_add_lock), which would be held for the duration of the direct IO - page faults add pages to the page cache, thus need to take the same lock - dio -> gup -> page fault thus can deadlock And we cannot enforce a lock ordering with this lock, since userspace will be controlling the lock ordering (via the fd and buffer arguments to direct IOs), so we need a different method of deadlock avoidance. We need to tell the page fault handler that we're already holding a pagecache_add_lock, and since plumbing it through the entire gup() path would be highly impractical this adds a field to task_struct. Then the full method is: - in the dio path, when we take first pagecache_add_lock, note the mapping in task_struct - in the page fault handler, if faults_disabled_mapping is set, we check if it's the same mapping as the one taking a page fault for, and if so return an error. Then we check lock ordering: if there's a lock ordering violation and trylock fails, we'll have to cycle the locks and return an error that tells the DIO path to retry: faults_disabled_mapping is also used for signalling "locks were dropped, please retry". Also relevant to this patch: mapping->invalidate_lock. mapping->invalidate_lock provides most of the required semantics - it's used by truncate/fallocate to block pages being added to the pagecache. However, since it's a rwsem, direct IOs would need to take the write side in order to block page cache adds, and would then be exclusive with each other - we'll need a new type of lock to pair with this approach. Signed-off-by: Kent Overstreet Cc: Jan Kara Cc: Darrick J. Wong Cc: linux-fsdevel@vger.kernel.org Cc: Andreas Grünbacher