Received: by 2002:a05:6a10:1d13:0:0:0:0 with SMTP id pp19csp846506pxb; Thu, 19 Aug 2021 12:45:20 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzS8CShzV49ehKj9MiIS68+eE9de9fHtKtyd45lpsV9lmig+j+Y8gnmwPwIelKaFhnjIsOY X-Received: by 2002:a05:6402:424c:: with SMTP id g12mr18239575edb.121.1629402320572; Thu, 19 Aug 2021 12:45:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1629402320; cv=none; d=google.com; s=arc-20160816; b=Lg7JYkXuyIHXWZTZ9tz1iBvP7oT13HVeOdBCaccwq9Nw2zOhBuQOOchntPoXagoCEm swXn8P/w8nHMe0TmzZsYpa+1tJkSkLugHjylwhSxPMDPxB+U4GeHQQBZO7+39VgMUVsJ qT3wzbZDX20Fo8vex4cLCVkDKIzCZPeMaZVBVl4FCYqV1y1LdhL5/o02tH30Yi0hbMAi a0aBu/0lV+7tZJD5ILrhkgGRmoe5M0WRFaQJF/264CZztMKdAGnxWechKFw9ppxySsz9 MfE5Ojh5fr51ZQk7xxjtIpxX/w71wUhwKOJCGhKeaJr6/0CbtgQISnFeDF3AP5YQzy/b rTBA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=DY2YDescUQ+uCrXiIjZoqoUQoOWwX0eabJrqxvw428w=; b=Vg490eUIBYH/24pBlFVhxHqTiQbBeWpeQwFY3/B9BoPTwOPoSPNAnfws8RHsxvXrSA +i9CSwMXxaljDovbaDpbI2k6F0wa15DPRgL9QUzcMKm7EDzaR+Vykk/QAzzW2RrlrXXy CEc3Z6Yjukewldw9aerfkhR5BsGLm6cEPbuIKrWRzB57RnVOzTPzV4uKzUXbqiqreu9M R3+jcOjATzAWZb+AaqdJSTxQUNsU1g4z8izJ4PSXmaa4SiuHHGNF1cnXHupkHN1xqetb +8dNSzGkELqfYAvvHHS31IC17murmSa4KLAFXgsIEeCWGBDWxRYeypx3SaPbdNHkLvm3 GtuA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=JcsDCa6M; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l7si5160345ejk.475.2021.08.19.12.44.57; Thu, 19 Aug 2021 12:45:20 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=JcsDCa6M; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235507AbhHSTnO (ORCPT + 99 others); Thu, 19 Aug 2021 15:43:14 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:20007 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235560AbhHSTnC (ORCPT ); Thu, 19 Aug 2021 15:43:02 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1629402145; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=DY2YDescUQ+uCrXiIjZoqoUQoOWwX0eabJrqxvw428w=; b=JcsDCa6MRi1mfSCJ3EfW4IbDchDAb2epfj4rB1R6DEHe3T6X8xT3/IwRLPaWOBXZmUvfUM TKHDBECMFa/ADKc4cjgbIS4TTw6OdNQ9yecYa6ycn4Xttrmr+JsKE6/30N3oNu5WZaZF4J NAUloYdw6fkZI6vHP01lbj5VIGix3oA= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-389-Ur5CbD0rN7mhAV1bqaWnGg-1; Thu, 19 Aug 2021 15:42:22 -0400 X-MC-Unique: Ur5CbD0rN7mhAV1bqaWnGg-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 9388B87D549; Thu, 19 Aug 2021 19:42:20 +0000 (UTC) Received: from max.com (unknown [10.40.194.206]) by smtp.corp.redhat.com (Postfix) with ESMTP id EEBF11B46B; Thu, 19 Aug 2021 19:42:17 +0000 (UTC) From: Andreas Gruenbacher To: Linus Torvalds , Alexander Viro , Christoph Hellwig , "Darrick J. Wong" Cc: Jan Kara , Matthew Wilcox , cluster-devel@redhat.com, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, ocfs2-devel@oss.oracle.com, Andreas Gruenbacher Subject: [PATCH v6 18/19] gfs2: Fix mmap + page fault deadlocks for direct I/O Date: Thu, 19 Aug 2021 21:41:01 +0200 Message-Id: <20210819194102.1491495-19-agruenba@redhat.com> In-Reply-To: <20210819194102.1491495-1-agruenba@redhat.com> References: <20210819194102.1491495-1-agruenba@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Also disable page faults during direct I/O requests and implement the same kind of retry logic as in the buffered I/O case. Direct I/O requests differ from buffered I/O requests in that they use bio_iov_iter_get_pages for grabbing page references and faulting in pages instead of triggering physical page faults. Those manual page faults can be disabled with the new iocb->nofault flag. This kind of locking problem in gfs2 was originally reported by Jan Kara. Linus came up with the proposal to disable page faults. Many thanks to Al Viro and Matthew Wilcox for their feedback. Signed-off-by: Andreas Gruenbacher --- fs/gfs2/file.c | 99 ++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 87 insertions(+), 12 deletions(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 8ca6bdba907c..2cf3466b9dde 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -811,22 +811,64 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to, { struct file *file = iocb->ki_filp; struct gfs2_inode *ip = GFS2_I(file->f_mapping->host); - size_t count = iov_iter_count(to); + size_t prev_count = 0, window_size = 0; + size_t written = 0; ssize_t ret; - if (!count) + /* + * In this function, we disable page faults when we're holding the + * inode glock while doing I/O. If a page fault occurs, we drop the + * inode glock, fault in the pages manually, and retry. + * + * Unlike generic_file_read_iter, for reads, iomap_dio_rw can trigger + * physical as well as manual page faults, and we need to disable both + * kinds. + * + * For direct I/O, gfs2 takes the inode glock in deferred mode. This + * locking mode is compatible with other deferred holders, so multiple + * processes and nodes can do direct I/O to a file at the same time. + * There's no guarantee that reads or writes will be atomic. Any + * coordination among readers and writers needs to happen externally. + */ + + if (!iov_iter_count(to)) return 0; /* skip atime */ gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh); +retry: ret = gfs2_glock_nq(gh); if (ret) goto out_uninit; +retry_under_glock: + pagefault_disable(); + to->nofault = true; + ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, + IOMAP_DIO_PARTIAL, written); + to->nofault = false; + pagefault_enable(); + if (ret > 0) + written = ret; + + if (unlikely(iov_iter_count(to) && (ret > 0 || ret == -EFAULT)) && + should_fault_in_pages(to, &prev_count, &window_size)) { + size_t leftover; - ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0, 0); - gfs2_glock_dq(gh); + gfs2_holder_allow_demote(gh); + leftover = fault_in_iov_iter_writeable(to, window_size); + gfs2_holder_disallow_demote(gh); + if (leftover != window_size) { + if (!gfs2_holder_queued(gh)) + goto retry; + goto retry_under_glock; + } + } + if (gfs2_holder_queued(gh)) + gfs2_glock_dq(gh); out_uninit: gfs2_holder_uninit(gh); - return ret; + if (ret < 0) + return ret; + return written; } static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from, @@ -835,10 +877,19 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from, struct file *file = iocb->ki_filp; struct inode *inode = file->f_mapping->host; struct gfs2_inode *ip = GFS2_I(inode); - size_t len = iov_iter_count(from); - loff_t offset = iocb->ki_pos; + size_t prev_count = 0, window_size = 0; + size_t read = 0; ssize_t ret; + /* + * In this function, we disable page faults when we're holding the + * inode glock while doing I/O. If a page fault occurs, we drop the + * inode glock, fault in the pages manually, and retry. + * + * For writes, iomap_dio_rw only triggers manual page faults, so we + * don't need to disable physical ones. + */ + /* * Deferred lock, even if its a write, since we do no allocation on * this path. All we need to change is the atime, and this lock mode @@ -848,22 +899,46 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from, * VFS does. */ gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh); +retry: ret = gfs2_glock_nq(gh); if (ret) goto out_uninit; - +retry_under_glock: /* Silently fall back to buffered I/O when writing beyond EOF */ - if (offset + len > i_size_read(&ip->i_inode)) + if (iocb->ki_pos + iov_iter_count(from) > i_size_read(&ip->i_inode)) goto out; - ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0, 0); + from->nofault = true; + ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, + IOMAP_DIO_PARTIAL, read); + from->nofault = false; + if (ret == -ENOTBLK) ret = 0; + if (ret > 0) + read = ret; + + if (unlikely(iov_iter_count(from) && (ret > 0 || ret == -EFAULT)) && + should_fault_in_pages(from, &prev_count, &window_size)) { + size_t leftover; + + gfs2_holder_allow_demote(gh); + leftover = fault_in_iov_iter_readable(from, window_size); + gfs2_holder_disallow_demote(gh); + if (leftover != window_size) { + if (!gfs2_holder_queued(gh)) + goto retry; + goto retry_under_glock; + } + } out: - gfs2_glock_dq(gh); + if (gfs2_holder_queued(gh)) + gfs2_glock_dq(gh); out_uninit: gfs2_holder_uninit(gh); - return ret; + if (ret < 0) + return ret; + return read; } static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to) -- 2.26.3