Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp3887147imu; Mon, 10 Dec 2018 09:20:25 -0800 (PST) X-Google-Smtp-Source: AFSGD/U+2XnpA1zHXKFQzLkYY/oTYRrQx4N0UhWg7mmfnrtri9ntAsQGIbMHduqaK1YsPv3/eC39 X-Received: by 2002:a17:902:9a04:: with SMTP id v4mr13095937plp.34.1544462425215; Mon, 10 Dec 2018 09:20:25 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544462425; cv=none; d=google.com; s=arc-20160816; b=yet9SnwrAxct6RSR/RLWAytVGV4vFrCfUuVkO9V5/x2VQt3Svell4WLEu/Tl4KIZcS T6k2gipwNtNJ65oDUKIETzMWKk7OrHafnWBkVnuZ/ZdGH62IagOQ6yYtkl6BwguACvQ1 8S3xlBEyleHQroDrNHn1Fmrz6rF4av5RTs+PVT9Y6fH7SIOzYaGqsn9d9cXl1wbymqd8 3eE+1kP7uEmRDRhX2Nu4KUavGCvxb//EGWsmzirkCMBAUBG5xjtmJAGJJOOaPTC9vqpD EwyIFA9d0Y8NpjlGFUUS+UhabStDhWsK6E3l24FG98r3taNSodmj1QrEAiwDXwK6AmUb 3sCQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from; bh=qy1ybGkbSxlR/3w1VtxCj6yuO/H1p1jXy8w7/SBBa9s=; b=BL19GUp+UnFEER1yFPFbpmWWrGQPhG/J/v41OO9YJnhXwXGY6vwMDV3c36sEHFUTo4 l9hj0KlgWLK/TPcy10D+uK006Q7R5aUdhjMwmilNwEWPsQ1Xvy18gOmHA1L5APCYmqkg bqU+5C0uuQRtOLCVOAk4ajVnBOOEFtNnjDdQI9TI27jr4MG9QQmB3HnoY9usjZdz/0nl qBKx0gpdXbBefdC5ULFNx1rJg+UuEB7RWdtXS9W5chG9OXRnzMoKemPlDVAmEESIDqW+ NiXQqBTcJs5vC1YnhhxqujthRLlrpvc95zb3Q/kCXpc2AR/0cWlzO5dzw/ZU0A3afg5v StKA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q66si10640359pfb.231.2018.12.10.09.20.10; Mon, 10 Dec 2018 09:20:25 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728506AbeLJRPx (ORCPT + 99 others); Mon, 10 Dec 2018 12:15:53 -0500 Received: from mx1.redhat.com ([209.132.183.28]:55360 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728475AbeLJRNi (ORCPT ); Mon, 10 Dec 2018 12:13:38 -0500 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 79227804E9; Mon, 10 Dec 2018 17:13:38 +0000 (UTC) Received: from horse.redhat.com (unknown [10.18.25.234]) by smtp.corp.redhat.com (Postfix) with ESMTP id 18920103BAB3; Mon, 10 Dec 2018 17:13:35 +0000 (UTC) Received: by horse.redhat.com (Postfix, from userid 10451) id AA19D224276; Mon, 10 Dec 2018 12:13:30 -0500 (EST) From: Vivek Goyal To: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: vgoyal@redhat.com, miklos@szeredi.hu, stefanha@redhat.com, dgilbert@redhat.com, sweil@redhat.com, swhiteho@redhat.com Subject: [PATCH 42/52] fuse: Wait for memory ranges to become free Date: Mon, 10 Dec 2018 12:13:08 -0500 Message-Id: <20181210171318.16998-43-vgoyal@redhat.com> In-Reply-To: <20181210171318.16998-1-vgoyal@redhat.com> References: <20181210171318.16998-1-vgoyal@redhat.com> X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Mon, 10 Dec 2018 17:13:38 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Sometimes we run out of memory ranges. So in that case, wait for memory ranges to become free, instead of returning -EBUSY. dax fault path is holding fuse_inode->i_mmap_sem and once that is being held, memory reclaim can't be done. Its not safe to wait while holding fuse_inode->i_mmap_sem for two reasons. - Worker thread to free memory might block on fuse_inode->i_mmap_sem as well. - This inode is holding all the memory and more memory can't be freed. In both the cases, deadlock will ensue. So return -ENOSPC from iomap_begin() in fault path if memory can't be allocated. Drop fuse_inode->i_mmap_sem, and wait for a free range to become available and retry. read/write path is a different story. We hold inode lock and lock ordering allows to grab fuse_inode->immap_sem, if needed. That means we can do direct reclaim in that path. But if there is no memory allocated to this inode, then direct reclaim will not work and we need to wait for a memory range to become free. So try following order. A. Try to get a free range. B. If not, try direct reclaim. C. If not, wait for a memory range to become free Here sleeping with locks held should be fine because in step B, we made sure this inode is not holding any ranges. That means other inodes are holding ranges and somebody should be able to free memory. Also, worker thread does a trylock() on inode lock. That means worker tread will not wait on this inode and move onto next memory range. Hence above sequence should be deadlock free. Signed-off-by: Vivek Goyal --- fs/fuse/file.c | 60 +++++++++++++++++++++++++++++++++++++++++++------------- fs/fuse/fuse_i.h | 3 +++ fs/fuse/inode.c | 1 + 3 files changed, 50 insertions(+), 14 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 709747458335..d0942ce0a6c3 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -220,6 +220,8 @@ static void __free_dax_mapping(struct fuse_conn *fc, { list_add_tail(&dmap->list, &fc->free_ranges); fc->nr_free_ranges++; + /* TODO: Wake up only when needed */ + wake_up(&fc->dax_range_waitq); } static void free_dax_mapping(struct fuse_conn *fc, @@ -1770,12 +1772,18 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length, goto iomap_hole; /* Can't do reclaim in fault path yet due to lock ordering */ - if (flags & IOMAP_FAULT) + if (flags & IOMAP_FAULT) { alloc_dmap = alloc_dax_mapping(fc); - else + if (!alloc_dmap) + return -ENOSPC; + } else { alloc_dmap = alloc_dax_mapping_reclaim(fc, inode); + if (IS_ERR(alloc_dmap)) + return PTR_ERR(alloc_dmap); + } - if (!alloc_dmap) + /* If we are here, we should have memory allocated */ + if (WARN_ON(!alloc_dmap)) return -EBUSY; /* @@ -2596,14 +2604,24 @@ static ssize_t fuse_file_splice_read(struct file *in, loff_t *ppos, static int __fuse_dax_fault(struct vm_fault *vmf, enum page_entry_size pe_size, bool write) { - int ret; + int ret, error = 0; struct inode *inode = file_inode(vmf->vma->vm_file); struct super_block *sb = inode->i_sb; pfn_t pfn; + struct fuse_conn *fc = get_fuse_conn(inode); + bool retry = false; if (write) sb_start_pagefault(sb); +retry: + if (retry && !(fc->nr_free_ranges > 0)) { + ret = -EINTR; + if (wait_event_killable_exclusive(fc->dax_range_waitq, + (fc->nr_free_ranges > 0))) + goto out; + } + /* * We need to serialize against not only truncate but also against * fuse dax memory range reclaim. While a range is being reclaimed, @@ -2611,13 +2629,20 @@ static int __fuse_dax_fault(struct vm_fault *vmf, enum page_entry_size pe_size, * to populate page cache or access memory we are trying to free. */ down_read(&get_fuse_inode(inode)->i_mmap_sem); - ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops); + ret = dax_iomap_fault(vmf, pe_size, &pfn, &error, &fuse_iomap_ops); + if ((ret & VM_FAULT_ERROR) && error == -ENOSPC) { + error = 0; + retry = true; + up_read(&get_fuse_inode(inode)->i_mmap_sem); + goto retry; + } if (ret & VM_FAULT_NEEDDSYNC) ret = dax_finish_sync_fault(vmf, pe_size, pfn); up_read(&get_fuse_inode(inode)->i_mmap_sem); +out: if (write) sb_end_pagefault(sb); @@ -3828,16 +3853,23 @@ static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc, struct fuse_dax_mapping *dmap; struct fuse_inode *fi = get_fuse_inode(inode); - dmap = alloc_dax_mapping(fc); - if (dmap) - return dmap; - - /* There are no mappings which can be reclaimed */ - if (!fi->nr_dmaps) - return NULL; + while(1) { + dmap = alloc_dax_mapping(fc); + if (dmap) + return dmap; - /* Try reclaim a fuse dax memory range */ - return fuse_dax_reclaim_first_mapping(fc, inode); + if (fi->nr_dmaps) + return fuse_dax_reclaim_first_mapping(fc, inode); + /* + * There are no mappings which can be reclaimed. + * Wait for one. + */ + if (!(fc->nr_free_ranges > 0)) { + if (wait_event_killable_exclusive(fc->dax_range_waitq, + (fc->nr_free_ranges > 0))) + return ERR_PTR(-EINTR); + } + } } int fuse_dax_free_one_mapping_locked(struct fuse_conn *fc, struct inode *inode, diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index bbefa7c11078..7b2db87c6ead 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -886,6 +886,9 @@ struct fuse_conn { /* Worker to free up memory ranges */ struct delayed_work dax_free_work; + /* Wait queue for a dax range to become free */ + wait_queue_head_t dax_range_waitq; + /* * DAX Window Free Ranges. TODO: This might not be best place to store * this free list diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index d31acb97eede..178ac3171564 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -695,6 +695,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns, atomic_set(&fc->dev_count, 1); init_waitqueue_head(&fc->blocked_waitq); init_waitqueue_head(&fc->reserved_req_waitq); + init_waitqueue_head(&fc->dax_range_waitq); fuse_iqueue_init(&fc->iq, fiq_ops, fiq_priv); INIT_LIST_HEAD(&fc->bg_queue); INIT_LIST_HEAD(&fc->entry); -- 2.13.6