Received: by 2002:ac0:950c:0:0:0:0:0 with SMTP id f12csp1015338imc; Mon, 11 Mar 2019 04:45:19 -0700 (PDT) X-Google-Smtp-Source: APXvYqxEyDu/NKOicpCeV5Mh56qB7PTLBln6YBXMjgr2ztro6iT+XWvrfhQTU1ds/uAy0+qHP3Da X-Received: by 2002:aa7:92da:: with SMTP id k26mr32922827pfa.216.1552304718944; Mon, 11 Mar 2019 04:45:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1552304718; cv=none; d=google.com; s=arc-20160816; b=HyypzwosHwO6S0L6vWYwTUdasZnp6TQ3u6PGBG/FeWh/eWTE7+NdyE44R8FZ0Y4GHp t767H8Dcs1ogZN6lifJOfiLiifx4W0pxhGDQ8lEJ6snXPW5U+G74xr6pItPwZkhYp8Qk BYnenkQaiOrMCd7FqWnZzXe/xbT4r055P+T7k4qQ7SLpM0AiKWCfMhrKB8bvXOIfFBs6 ZrfNPjRT7vC13tX4Ea5j+4EmALc7p4d2Wb2ZitlTIMkjU9NY5rIcJGEkSDRehh3b6cB2 K0M3TJFqe3j1des5GaUgDdxLys7O54UGxwdsLqV+Wc/jJbDIG0b6EJq7ukaeBENBV/Li DYfw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:in-reply-to:date:references:subject:cc:to:from; bh=gFL/g0BV1NtbrF/ekiLk+pPJ4fgdjOyy+h2NxDtcfn8=; b=00/8XgQ5Q45FWuFzCKs7JDivhGz3tygZf8XR+9IdpmWDGSYqqg+GXKDHJ2O9Oj14te morHUN0eAJ1Aogej37yYEov1MT4wm2qd71Duk+fXUVNbNyDdst1jn61gKu6t/3FnydB2 SBMg5dPJrqKDlVXW7f/NFhLpX81zRN5s/3bXKxDQUunCCge8hJx46qHqcF010dHlbOJT fRPggj26qR+a0MIcbkR+R/CH7iqjtgAeGrg4X5VCnJIUDpIirL+ONM7SKQfNlTZO/nGj BKI5Iq4bVNTcDgB7qZbiJGF7evWM/ynTtJRhQk/dMYGUwhVFEDrNCXrrSYUvDoBJPYP1 wo/Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l6si4980887plt.326.2019.03.11.04.45.03; Mon, 11 Mar 2019 04:45:18 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727460AbfCKLnL convert rfc822-to-8bit (ORCPT + 99 others); Mon, 11 Mar 2019 07:43:11 -0400 Received: from mx2.suse.de ([195.135.220.15]:52638 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726914AbfCKLnL (ORCPT ); Mon, 11 Mar 2019 07:43:11 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 024E9AE60; Mon, 11 Mar 2019 11:43:09 +0000 (UTC) From: Luis Henriques To: "Yan\, Zheng" Cc: "Yan\, Zheng" , Sage Weil , Ilya Dryomov , ceph-devel , Linux Kernel Mailing List , Hendrik Peyerl Subject: Re: [PATCH 2/2] ceph: quota: fix quota subdir mounts References: <20190308162757.9260-1-lhenriques@suse.com> <20190308162757.9260-3-lhenriques@suse.com> Date: Mon, 11 Mar 2019 11:43:07 +0000 In-Reply-To: (Zheng Yan's message of "Mon, 11 Mar 2019 11:12:19 +0800") Message-ID: <87h8c9fxh0.fsf@suse.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org "Yan, Zheng" writes: > On Sat, Mar 9, 2019 at 12:30 AM Luis Henriques wrote: >> >> The CephFS kernel client does not enforce quotas set in a directory that isn't >> visible from the mount point. For example, given the path '/dir1/dir2', if quotas >> are set in 'dir1' and the filesystem is mounted with >> >> mount -t ceph ::/dir1/ /mnt >> >> then the client won't be able to access 'dir1' inode, even if 'dir2' belongs to >> a quota realm that points to it. >> >> This patch fixes this issue by simply doing an MDS LOOKUPINO operation for >> unknown inodes. Any inode reference obtained this way will be added to a list >> in ceph_mds_client, and will only be released when the filesystem is umounted. >> >> Link: https://tracker.ceph.com/issues/38482 >> Reported-by: Hendrik Peyerl >> Signed-off-by: Luis Henriques >> --- >> fs/ceph/mds_client.c | 14 +++++++ >> fs/ceph/mds_client.h | 2 + >> fs/ceph/quota.c | 91 +++++++++++++++++++++++++++++++++++++++----- >> fs/ceph/super.h | 2 + >> 4 files changed, 99 insertions(+), 10 deletions(-) >> >> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c >> index 163fc74bf221..72c5ce5e4209 100644 >> --- a/fs/ceph/mds_client.c >> +++ b/fs/ceph/mds_client.c >> @@ -3656,6 +3656,8 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc) >> mdsc->max_sessions = 0; >> mdsc->stopping = 0; >> atomic64_set(&mdsc->quotarealms_count, 0); >> + INIT_LIST_HEAD(&mdsc->quotarealms_inodes_list); >> + spin_lock_init(&mdsc->quotarealms_inodes_lock); >> mdsc->last_snap_seq = 0; >> init_rwsem(&mdsc->snap_rwsem); >> mdsc->snap_realms = RB_ROOT; >> @@ -3726,9 +3728,21 @@ static void wait_requests(struct ceph_mds_client *mdsc) >> */ >> void ceph_mdsc_pre_umount(struct ceph_mds_client *mdsc) >> { >> + struct ceph_inode_info *ci; >> + >> dout("pre_umount\n"); >> mdsc->stopping = 1; >> >> + spin_lock(&mdsc->quotarealms_inodes_lock); >> + while(!list_empty(&mdsc->quotarealms_inodes_list)) { >> + ci = list_first_entry(&mdsc->quotarealms_inodes_list, >> + struct ceph_inode_info, >> + i_quotarealms_inode_item); >> + list_del(&ci->i_quotarealms_inode_item); >> + iput(&ci->vfs_inode); > > iput while holding spinlock is not good Yep, that's definitely not a good idea. And maybe this loop could even race with a lookup currently in progress and a new inode could be added to the list after the cleanup. I didn't verified this race could really occur, because an easy fix is to simply move this loop into ceph_mdsc_destroy() where we don't really need the spinlock anymore. >> +restart: >> down_read(&mdsc->snap_rwsem); >> - old_realm = get_quota_realm(mdsc, old); >> - new_realm = get_quota_realm(mdsc, new); >> + old_realm = get_quota_realm(mdsc, old, true); >> + /* This needs to be atomic, so we need to hold snap_rwsem */ > > I don't understand this comment. get_quota_realm() unlock snap_rwsem > no matter the 'retry' parameter is true or not/ Right, maybe the parameter name and the comment are a bit misleading. get_quota_realm may need to do an inode lookup and then retry to get the quota realm. If this lookup is required, it has to drop the rwsem. However, in ceph_quota_is_same_realm we need to lookup 2 quota realms "atomically", i.e. with the rwsem held. If get_quota_realm needs to drop it, it will do the MDS inode lookup anyway but instead of retrying to get the quota realm it will return -EAGAIN (because 'retry' will be 'false'). This allows for ceph_quota_is_same_realm to restart the operation itself, and retry to get both quota realms without get_quota_realm dropping the rwsem. Does it make sense? I agree the design isn't great :-/ I tried to describe this behavior in get_quota_realm comment, but it's probably not good enough. >> + new_realm = get_quota_realm(mdsc, new, false); >> + if (PTR_ERR(new_realm) == -EAGAIN) { >> + up_read(&mdsc->snap_rwsem); >> + if (old_realm) >> + ceph_put_snap_realm(mdsc, old_realm); >> + goto restart; >> + } >> is_same = (old_realm == new_realm); >> up_read(&mdsc->snap_rwsem); >> >> @@ -166,6 +230,7 @@ static bool check_quota_exceeded(struct inode *inode, enum quota_check_op op, >> return false; >> >> down_read(&mdsc->snap_rwsem); >> +restart: >> realm = ceph_inode(inode)->i_snap_realm; >> if (realm) >> ceph_get_snap_realm(mdsc, realm); >> @@ -176,9 +241,15 @@ static bool check_quota_exceeded(struct inode *inode, enum quota_check_op op, >> spin_lock(&realm->inodes_with_caps_lock); >> in = realm->inode ? igrab(realm->inode) : NULL; >> spin_unlock(&realm->inodes_with_caps_lock); >> - if (!in) >> - break; >> - >> + if (!in) { > > maybe we should distinguish ‘realm->inode is null' from 'igrab fails' Sure, that makes sense. Cheers, -- Luis > >> + up_read(&mdsc->snap_rwsem); >> + in = lookup_quotarealm_inode(mdsc, inode->i_sb, realm); >> + down_read(&mdsc->snap_rwsem); >> + if (IS_ERR(in)) >> + break; >> + ceph_put_snap_realm(mdsc, realm); >> + goto restart; >> + } >> ci = ceph_inode(in); >> spin_lock(&ci->i_ceph_lock); >> if (op == QUOTA_CHECK_MAX_FILES_OP) { >> @@ -314,7 +385,7 @@ bool ceph_quota_update_statfs(struct ceph_fs_client *fsc, struct kstatfs *buf) >> bool is_updated = false; >> >> down_read(&mdsc->snap_rwsem); >> - realm = get_quota_realm(mdsc, d_inode(fsc->sb->s_root)); >> + realm = get_quota_realm(mdsc, d_inode(fsc->sb->s_root), true); >> up_read(&mdsc->snap_rwsem); >> if (!realm) >> return false; >> diff --git a/fs/ceph/super.h b/fs/ceph/super.h >> index ce51e98b08ec..cc7766aeb73b 100644 >> --- a/fs/ceph/super.h >> +++ b/fs/ceph/super.h >> @@ -375,6 +375,8 @@ struct ceph_inode_info { >> struct list_head i_snap_realm_item; >> struct list_head i_snap_flush_item; >> >> + struct list_head i_quotarealms_inode_item; >> + >> struct work_struct i_wb_work; /* writeback work */ >> struct work_struct i_pg_inv_work; /* page invalidation work */ >> >