Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp478316pxb; Tue, 19 Oct 2021 06:51:30 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw/B5apcIWACaaKJqcw6ADz0bZxOKY3At5kOzKn5hskJPnWr0jyLJ7EawoWKTl1ivoMRbW/ X-Received: by 2002:a17:906:3510:: with SMTP id r16mr39286018eja.209.1634651489945; Tue, 19 Oct 2021 06:51:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1634651489; cv=none; d=google.com; s=arc-20160816; b=mecz7ON8TjE69S0ps66iGWtRkoDdkcWL5qGyTMHk6PDtxFfPE3A/RO9Hw/GTN0lJpV xaM3yCez9kbMEbqTNbyJicWjQFZDP008mBL3+OwvMKUifL+Z67TIAT30LR7x22/Wk3cA SvxjrHQsJ7dZhv2HSWm4yBqtSnL8Zxu561Ji2WMlBy1idxsCSPAMcm5xMGSI/hCqTetq Si84T7mYMy18mGGsFn3AV9ow3Mu6A1aN/bZ9refRaW9Ggw5Z8C1zVK3DJ3+rLtJQ4Cy0 nUpjICASKWXf7++Dc7uK1HVQ59zkpbFYOJ5X94OjcZfgaY84oxEZumCUkozv4tIrtRtg atQg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=MGsX/Ml84CMC+OjHlH5RDMfLJcW/FIl8Qk/LyCJCDBE=; b=0lD6IDqCEz/MprNyPgnPc89cxxtiA4KHxiYEJVCA52BEnlOEUGNOAkkjwjjPM3cXBG FGV193DviN/lOqamNFbZdlx1b52bgrqbH13oXQAOtDE7IfkJ5KA84CoidG8YY4D36mfL 7zKZezbcDWbmkBIovWmNBcfUvYNEEn02NzSiQXvh9RwINOaspNuvodpecIfDIt+/dnyF m1hTjBmzC6V8aJ9H36vjL9RojUZuO8gxKyAAj+6nDZweaSUbgVHYIx8NJmImOsKf06oD /XKXjtAqSdvS40btYWtCgwHNkYGkD88ICesn0S6YgzBRfFU7IBRwYTeHDiANPlAoMyug voGw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=ZSclhDlI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id k9si22749993edh.518.2021.10.19.06.51.04; Tue, 19 Oct 2021 06:51:29 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=ZSclhDlI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235788AbhJSNuW (ORCPT + 99 others); Tue, 19 Oct 2021 09:50:22 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]:58272 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235741AbhJSNuV (ORCPT ); Tue, 19 Oct 2021 09:50:21 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1634651288; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=MGsX/Ml84CMC+OjHlH5RDMfLJcW/FIl8Qk/LyCJCDBE=; b=ZSclhDlI9f7DSnX5T+RtxbAZsiyWUPYaX6Ng9BCfTgxUNykaueuQpCVEckHMjmUYoflALw A7pP8KaYVIfM3x9VU4Da+PClamRu/Lg5sc2RkHLQRtqJ1a105HfKBORIxHTsnfxmMchDJO wMhZBvu6M3TxTPgNiJg7PRKfNWyB+3g= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-534-86WeWPCuMNSZLCuVHcUO8Q-1; Tue, 19 Oct 2021 09:46:11 -0400 X-MC-Unique: 86WeWPCuMNSZLCuVHcUO8Q-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id BF6C91006AA5; Tue, 19 Oct 2021 13:46:09 +0000 (UTC) Received: from max.com (unknown [10.40.193.143]) by smtp.corp.redhat.com (Postfix) with ESMTP id 3A35A102AE42; Tue, 19 Oct 2021 13:45:59 +0000 (UTC) From: Andreas Gruenbacher To: Linus Torvalds , Catalin Marinas Cc: Alexander Viro , Christoph Hellwig , "Darrick J. Wong" , Paul Mackerras , Jan Kara , Matthew Wilcox , cluster-devel@redhat.com, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, ocfs2-devel@oss.oracle.com, kvm-ppc@vger.kernel.org, linux-btrfs@vger.kernel.org, Andreas Gruenbacher Subject: [PATCH v8 17/17] gfs2: Fix mmap + page fault deadlocks for direct I/O Date: Tue, 19 Oct 2021 15:42:04 +0200 Message-Id: <20211019134204.3382645-18-agruenba@redhat.com> In-Reply-To: <20211019134204.3382645-1-agruenba@redhat.com> References: <20211019134204.3382645-1-agruenba@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Also disable page faults during direct I/O requests and implement a similar kind of retry logic as in the buffered I/O case. The retry logic in the direct I/O case differs from the buffered I/O case in the following way: direct I/O doesn't provide the kinds of consistency guarantees between concurrent reads and writes that buffered I/O provides, so once we lose the inode glock while faulting in user pages, we always resume the operation. We never need to return a partial read or write. This locking problem was originally reported by Jan Kara. Linus came up with the idea of disabling page faults. Many thanks to Al Viro and Matthew Wilcox for their feedback. Signed-off-by: Andreas Gruenbacher --- fs/gfs2/file.c | 101 +++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 89 insertions(+), 12 deletions(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index ae06defcf369..a8e440b4d21c 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -811,22 +811,65 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to, { struct file *file = iocb->ki_filp; struct gfs2_inode *ip = GFS2_I(file->f_mapping->host); - size_t count = iov_iter_count(to); + size_t prev_count = 0, window_size = 0; + size_t written = 0; ssize_t ret; - if (!count) + /* + * In this function, we disable page faults when we're holding the + * inode glock while doing I/O. If a page fault occurs, we indicate + * that the inode glock may be dropped, fault in the pages manually, + * and retry. + * + * Unlike generic_file_read_iter, for reads, iomap_dio_rw can trigger + * physical as well as manual page faults, and we need to disable both + * kinds. + * + * For direct I/O, gfs2 takes the inode glock in deferred mode. This + * locking mode is compatible with other deferred holders, so multiple + * processes and nodes can do direct I/O to a file at the same time. + * There's no guarantee that reads or writes will be atomic. Any + * coordination among readers and writers needs to happen externally. + */ + + if (!iov_iter_count(to)) return 0; /* skip atime */ gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh); +retry: ret = gfs2_glock_nq(gh); if (ret) goto out_uninit; +retry_under_glock: + pagefault_disable(); + to->nofault = true; + ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, + IOMAP_DIO_PARTIAL, written); + to->nofault = false; + pagefault_enable(); + if (ret > 0) + written = ret; - ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0, 0); - gfs2_glock_dq(gh); + if (unlikely(iov_iter_count(to) && (ret > 0 || ret == -EFAULT)) && + should_fault_in_pages(to, &prev_count, &window_size)) { + size_t leftover; + + gfs2_holder_allow_demote(gh); + leftover = fault_in_iov_iter_writeable(to, window_size); + gfs2_holder_disallow_demote(gh); + if (leftover != window_size) { + if (!gfs2_holder_queued(gh)) + goto retry; + goto retry_under_glock; + } + } + if (gfs2_holder_queued(gh)) + gfs2_glock_dq(gh); out_uninit: gfs2_holder_uninit(gh); - return ret; + if (ret < 0) + return ret; + return written; } static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from, @@ -835,10 +878,20 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from, struct file *file = iocb->ki_filp; struct inode *inode = file->f_mapping->host; struct gfs2_inode *ip = GFS2_I(inode); - size_t len = iov_iter_count(from); - loff_t offset = iocb->ki_pos; + size_t prev_count = 0, window_size = 0; + size_t read = 0; ssize_t ret; + /* + * In this function, we disable page faults when we're holding the + * inode glock while doing I/O. If a page fault occurs, we indicate + * that the inode glock may be dropped, fault in the pages manually, + * and retry. + * + * For writes, iomap_dio_rw only triggers manual page faults, so we + * don't need to disable physical ones. + */ + /* * Deferred lock, even if its a write, since we do no allocation on * this path. All we need to change is the atime, and this lock mode @@ -848,22 +901,46 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from, * VFS does. */ gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh); +retry: ret = gfs2_glock_nq(gh); if (ret) goto out_uninit; - +retry_under_glock: /* Silently fall back to buffered I/O when writing beyond EOF */ - if (offset + len > i_size_read(&ip->i_inode)) + if (iocb->ki_pos + iov_iter_count(from) > i_size_read(&ip->i_inode)) goto out; - ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0, 0); + from->nofault = true; + ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, + IOMAP_DIO_PARTIAL, read); + from->nofault = false; + if (ret == -ENOTBLK) ret = 0; + if (ret > 0) + read = ret; + + if (unlikely(iov_iter_count(from) && (ret > 0 || ret == -EFAULT)) && + should_fault_in_pages(from, &prev_count, &window_size)) { + size_t leftover; + + gfs2_holder_allow_demote(gh); + leftover = fault_in_iov_iter_readable(from, window_size); + gfs2_holder_disallow_demote(gh); + if (leftover != window_size) { + if (!gfs2_holder_queued(gh)) + goto retry; + goto retry_under_glock; + } + } out: - gfs2_glock_dq(gh); + if (gfs2_holder_queued(gh)) + gfs2_glock_dq(gh); out_uninit: gfs2_holder_uninit(gh); - return ret; + if (ret < 0) + return ret; + return read; } static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to) -- 2.26.3