Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp549502iob; Wed, 18 May 2022 07:50:59 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy1Iy02x36W146F3W28N4WroIKheyFwQ0e6GrHPR7I1qaRoo0wlLUjL+Mi8D4fp5Hj3roWc X-Received: by 2002:a63:cd42:0:b0:3db:3307:3a6 with SMTP id a2-20020a63cd42000000b003db330703a6mr24070366pgj.620.1652885458911; Wed, 18 May 2022 07:50:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652885458; cv=none; d=google.com; s=arc-20160816; b=TkV4nl5D6NGq+010qJsLnHadTP78m3z6m7eNlnVvCgKjnzmgGhGPubKE+HAKqFxb6c vPVxK2fWbPXz3byOfCarPzYNp3B2/fQLg2zHzfJ9Tld5XzTNmERcTGTGHKDTNfIGon0V 1EGdbECsmW4bg2sGI099mRMaOei5u6If97hEd5ziAZjqmXcdJMr8L//aHJ0i6jmulirE t2zfkqzl3GqE5v01j/L0OH1bQnMeLxuiDGdPsqXqXCdJHVHvysfgwghSz3fPsMc+SuLH OOLia09ttb71hUmZjbQHU5jn32l1ydIlmfM92yiBgHA1hJ+nR+2rxROxlntwAWN1Rnlg INog== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=1wFCT6qvo7E9AAX8qea1C7UYQwI/qfj/Rp5PsIGMAc0=; b=GBZZ+FzwKsAZFEFgLpM6QcOusu6aLbpvfEU/dwOQ4YsDBr2cnA+jMYmyLX3gWC2fj9 XpUCvA4HlNjXigtHzBhYvSbotS4UrbOg5qyHGYQpRexIe1Nt6DVicH2RwfCx7hjXvr/G GLkyuXzkHAcvBwZq7QhS5oI8TF1Cy8EQUrBNuZyg7uX+AuamVmOxqle6bw8RAwyvejJb sYWTv6KisNW0Yevo5AVllUfJJI0yBgZevFY46FYq4YBFVGIsJ4TvrJQDJ84cya6Ph5dK T3Q5m26uzBej9eNy2WmpxV+Ygd9OKl/ybjhzLDGLvs36/T4RxMSwzFaH5hDhhLUw6t56 c0aw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=iNx+Ars7; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id 202-20020a6303d3000000b003db698a3102si2551792pgd.576.2022.05.18.07.50.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 May 2022 07:50:58 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=iNx+Ars7; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 4823862CDC; Wed, 18 May 2022 07:46:38 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238942AbiEROq0 (ORCPT + 99 others); Wed, 18 May 2022 10:46:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32822 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238923AbiEROqR (ORCPT ); Wed, 18 May 2022 10:46:17 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id EAE4E4BFEE for ; Wed, 18 May 2022 07:46:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1652885173; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1wFCT6qvo7E9AAX8qea1C7UYQwI/qfj/Rp5PsIGMAc0=; b=iNx+Ars7H6QTbtY/7ru+ao8nu6qd7Jvu5HVLXHC/Q8ijIvzPrcWngrWaQpaYG3MHb6b9Si Am4DWKm1C6VsK32omDYmK5lZfmwUUZeyuHoH0o8YmhIeQnA9KwCAdxBUUNpRy4lV1sz4Py vSFQOXyh9LRi60rtnwL1VZGvwnCOop4= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-154-cps-y_YJPy2gM4UQGNBRLw-1; Wed, 18 May 2022 10:46:07 -0400 X-MC-Unique: cps-y_YJPy2gM4UQGNBRLw-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 273E338349AA; Wed, 18 May 2022 14:46:07 +0000 (UTC) Received: from localhost (unknown [10.72.47.117]) by smtp.corp.redhat.com (Postfix) with ESMTP id 2E9DB40C1244; Wed, 18 May 2022 14:46:05 +0000 (UTC) From: Xiubo Li To: jlayton@kernel.org, idryomov@gmail.com, viro@zeniv.linux.org.uk Cc: willy@infradead.org, vshankar@redhat.com, ceph-devel@vger.kernel.org, arnd@arndb.de, mcgrof@kernel.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Xiubo Li , kernel test robot Subject: [PATCH v4 2/2] ceph: wait the first reply of inflight async unlink Date: Wed, 18 May 2022 22:45:45 +0800 Message-Id: <20220518144545.246604-3-xiubli@redhat.com> In-Reply-To: <20220518144545.246604-1-xiubli@redhat.com> References: <20220518144545.246604-1-xiubli@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.84 on 10.11.54.2 X-Spam-Status: No, score=-2.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In async unlink case the kclient won't wait for the first reply from MDS and just drop all the links and unhash the dentry and then succeeds immediately. For any new create/link/rename,etc requests followed by using the same file names we must wait for the first reply of the inflight unlink request, or the MDS possibly will fail these following requests with -EEXIST if the inflight async unlink request was delayed for some reasons. And the worst case is that for the none async openc request it will successfully open the file if the CDentry hasn't been unlinked yet, but later the previous delayed async unlink request will remove the CDenty. That means the just created file is possiblly deleted later by accident. We need to wait for the inflight async unlink requests to finish when creating new files/directories by using the same file names. URL: https://tracker.ceph.com/issues/55332 Reported-by: kernel test robot Signed-off-by: Xiubo Li --- fs/ceph/dir.c | 70 +++++++++++++++++++++++++++++++++++++++--- fs/ceph/file.c | 4 +++ fs/ceph/mds_client.c | 73 ++++++++++++++++++++++++++++++++++++++++++++ fs/ceph/mds_client.h | 1 + fs/ceph/super.c | 3 ++ fs/ceph/super.h | 19 +++++++++--- 6 files changed, 160 insertions(+), 10 deletions(-) diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index eae417d71136..01e7facef9b2 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -856,6 +856,10 @@ static int ceph_mknod(struct user_namespace *mnt_userns, struct inode *dir, if (ceph_snap(dir) != CEPH_NOSNAP) return -EROFS; + err = ceph_wait_on_conflict_unlink(dentry); + if (err) + return err; + if (ceph_quota_is_max_files_exceeded(dir)) { err = -EDQUOT; goto out; @@ -918,6 +922,10 @@ static int ceph_symlink(struct user_namespace *mnt_userns, struct inode *dir, if (ceph_snap(dir) != CEPH_NOSNAP) return -EROFS; + err = ceph_wait_on_conflict_unlink(dentry); + if (err) + return err; + if (ceph_quota_is_max_files_exceeded(dir)) { err = -EDQUOT; goto out; @@ -968,9 +976,13 @@ static int ceph_mkdir(struct user_namespace *mnt_userns, struct inode *dir, struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(dir->i_sb); struct ceph_mds_request *req; struct ceph_acl_sec_ctx as_ctx = {}; - int err = -EROFS; + int err; int op; + err = ceph_wait_on_conflict_unlink(dentry); + if (err) + return err; + if (ceph_snap(dir) == CEPH_SNAPDIR) { /* mkdir .snap/foo is a MKSNAP */ op = CEPH_MDS_OP_MKSNAP; @@ -980,6 +992,7 @@ static int ceph_mkdir(struct user_namespace *mnt_userns, struct inode *dir, dout("mkdir dir %p dn %p mode 0%ho\n", dir, dentry, mode); op = CEPH_MDS_OP_MKDIR; } else { + err = -EROFS; goto out; } @@ -1037,6 +1050,10 @@ static int ceph_link(struct dentry *old_dentry, struct inode *dir, struct ceph_mds_request *req; int err; + err = ceph_wait_on_conflict_unlink(dentry); + if (err) + return err; + if (ceph_snap(dir) != CEPH_NOSNAP) return -EROFS; @@ -1071,9 +1088,27 @@ static int ceph_link(struct dentry *old_dentry, struct inode *dir, static void ceph_async_unlink_cb(struct ceph_mds_client *mdsc, struct ceph_mds_request *req) { + struct dentry *dentry = req->r_dentry; + struct ceph_fs_client *fsc = ceph_sb_to_client(dentry->d_sb); + struct ceph_dentry_info *di = ceph_dentry(dentry); int result = req->r_err ? req->r_err : le32_to_cpu(req->r_reply_info.head->result); + if (!test_bit(CEPH_DENTRY_ASYNC_UNLINK_BIT, &di->flags)) + pr_warn("%s dentry %p:%pd async unlink bit is not set\n", + __func__, dentry, dentry); + + spin_lock(&fsc->async_unlink_conflict_lock); + hash_del_rcu(&di->hnode); + spin_unlock(&fsc->async_unlink_conflict_lock); + + spin_lock(&dentry->d_lock); + di->flags &= ~CEPH_DENTRY_ASYNC_UNLINK; + wake_up_bit(&di->flags, CEPH_DENTRY_ASYNC_UNLINK_BIT); + spin_unlock(&dentry->d_lock); + + synchronize_rcu(); + if (result == -EJUKEBOX) goto out; @@ -1081,7 +1116,7 @@ static void ceph_async_unlink_cb(struct ceph_mds_client *mdsc, if (result) { int pathlen = 0; u64 base = 0; - char *path = ceph_mdsc_build_path(req->r_dentry, &pathlen, + char *path = ceph_mdsc_build_path(dentry, &pathlen, &base, 0); /* mark error on parent + clear complete */ @@ -1089,13 +1124,13 @@ static void ceph_async_unlink_cb(struct ceph_mds_client *mdsc, ceph_dir_clear_complete(req->r_parent); /* drop the dentry -- we don't know its status */ - if (!d_unhashed(req->r_dentry)) - d_drop(req->r_dentry); + if (!d_unhashed(dentry)) + d_drop(dentry); /* mark inode itself for an error (since metadata is bogus) */ mapping_set_error(req->r_old_inode->i_mapping, result); - pr_warn("ceph: async unlink failure path=(%llx)%s result=%d!\n", + pr_warn("async unlink failure path=(%llx)%s result=%d!\n", base, IS_ERR(path) ? "<>" : path, result); ceph_mdsc_free_path(path, pathlen); } @@ -1180,6 +1215,8 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry) if (try_async && op == CEPH_MDS_OP_UNLINK && (req->r_dir_caps = get_caps_for_async_unlink(dir, dentry))) { + struct ceph_dentry_info *di = ceph_dentry(dentry); + dout("async unlink on %llu/%.*s caps=%s", ceph_ino(dir), dentry->d_name.len, dentry->d_name.name, ceph_cap_string(req->r_dir_caps)); @@ -1187,6 +1224,16 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry) req->r_callback = ceph_async_unlink_cb; req->r_old_inode = d_inode(dentry); ihold(req->r_old_inode); + + spin_lock(&dentry->d_lock); + di->flags |= CEPH_DENTRY_ASYNC_UNLINK; + spin_unlock(&dentry->d_lock); + + spin_lock(&fsc->async_unlink_conflict_lock); + hash_add_rcu(fsc->async_unlink_conflict, &di->hnode, + dentry->d_name.hash); + spin_unlock(&fsc->async_unlink_conflict_lock); + err = ceph_mdsc_submit_request(mdsc, dir, req); if (!err) { /* @@ -1198,6 +1245,15 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry) } else if (err == -EJUKEBOX) { try_async = false; ceph_mdsc_put_request(req); + + spin_lock(&dentry->d_lock); + di->flags &= ~CEPH_DENTRY_ASYNC_UNLINK; + spin_unlock(&dentry->d_lock); + + spin_lock(&fsc->async_unlink_conflict_lock); + hash_del_rcu(&di->hnode); + spin_unlock(&fsc->async_unlink_conflict_lock); + goto retry; } } else { @@ -1237,6 +1293,10 @@ static int ceph_rename(struct user_namespace *mnt_userns, struct inode *old_dir, (!ceph_quota_is_same_realm(old_dir, new_dir))) return -EXDEV; + err = ceph_wait_on_conflict_unlink(new_dentry); + if (err) + return err; + dout("rename dir %p dentry %p to dir %p dentry %p\n", old_dir, old_dentry, new_dir, new_dentry); req = ceph_mdsc_create_request(mdsc, op, USE_AUTH_MDS); diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 8c8226c0feac..f039e799f5f4 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -740,6 +740,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry, if (dentry->d_name.len > NAME_MAX) return -ENAMETOOLONG; + err = ceph_wait_on_conflict_unlink(dentry); + if (err) + return err; + if (flags & O_CREAT) { if (ceph_quota_is_max_files_exceeded(dir)) return -EDQUOT; diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index e8c87dea0551..ffe52b7c6cbc 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -655,6 +655,79 @@ static void destroy_reply_info(struct ceph_mds_reply_info_parsed *info) free_pages((unsigned long)info->dir_entries, get_order(info->dir_buf_size)); } +/* + * In async unlink case the kclient won't wait for the first reply + * from MDS and just drop all the links and unhash the dentry and then + * succeeds immediately. + * + * For any new create/link/rename,etc requests followed by using the + * same file names we must wait for the first reply of the inflight + * unlink request, or the MDS possibly will fail these following + * requests with -EEXIST if the inflight async unlink request was + * delayed for some reasons. + * + * And the worst case is that for the none async openc request it will + * successfully open the file if the CDentry hasn't been unlinked yet, + * but later the previous delayed async unlink request will remove the + * CDenty. That means the just created file is possiblly deleted later + * by accident. + * + * We need to wait for the inflight async unlink requests to finish + * when creating new files/directories by using the same file names. + */ +int ceph_wait_on_conflict_unlink(struct dentry *dentry) +{ + struct ceph_fs_client *fsc = ceph_sb_to_client(dentry->d_sb); + struct dentry *pdentry = dentry->d_parent; + struct dentry *udentry, *found = NULL; + struct ceph_dentry_info *di; + struct qstr dname; + u32 hash = dentry->d_name.hash; + int err; + + dname.name = dentry->d_name.name; + dname.len = dentry->d_name.len; + + rcu_read_lock(); + hash_for_each_possible_rcu(fsc->async_unlink_conflict, di, + hnode, hash) { + udentry = di->dentry; + + spin_lock(&udentry->d_lock); + if (udentry->d_name.hash != hash) + goto next; + if (unlikely(udentry->d_parent != pdentry)) + goto next; + if (!hash_hashed(&di->hnode)) + goto next; + + if (!test_bit(CEPH_DENTRY_ASYNC_UNLINK_BIT, &di->flags)) + pr_warn("%s dentry %p:%pd async unlink bit is not set\n", + __func__, dentry, dentry); + + if (d_compare(pdentry, udentry, &dname)) + goto next; + + spin_unlock(&udentry->d_lock); + found = dget(udentry); + break; +next: + spin_unlock(&udentry->d_lock); + } + rcu_read_unlock(); + + if (likely(!found)) + return 0; + + dout("%s dentry %p:%pd conflict with old %p:%pd\n", __func__, + dentry, dentry, found, found); + + err = wait_on_bit(&di->flags, CEPH_DENTRY_ASYNC_UNLINK_BIT, + TASK_KILLABLE); + dput(found); + return err; +} + /* * sessions diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index 33497846e47e..d1ae679c52c3 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -582,6 +582,7 @@ static inline int ceph_wait_on_async_create(struct inode *inode) TASK_INTERRUPTIBLE); } +extern int ceph_wait_on_conflict_unlink(struct dentry *dentry); extern u64 ceph_get_deleg_ino(struct ceph_mds_session *session); extern int ceph_restore_deleg_ino(struct ceph_mds_session *session, u64 ino); #endif diff --git a/fs/ceph/super.c b/fs/ceph/super.c index b73b4f75462c..6542b71f8627 100644 --- a/fs/ceph/super.c +++ b/fs/ceph/super.c @@ -816,6 +816,9 @@ static struct ceph_fs_client *create_fs_client(struct ceph_mount_options *fsopt, if (!fsc->cap_wq) goto fail_inode_wq; + hash_init(fsc->async_unlink_conflict); + spin_lock_init(&fsc->async_unlink_conflict_lock); + spin_lock(&ceph_fsc_lock); list_add_tail(&fsc->metric_wakeup, &ceph_fsc_list); spin_unlock(&ceph_fsc_lock); diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 506d52633627..251e726ec628 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -19,6 +19,7 @@ #include #include #include +#include #include @@ -99,6 +100,8 @@ struct ceph_mount_options { char *mon_addr; }; +#define CEPH_ASYNC_CREATE_CONFLICT_BITS 8 + struct ceph_fs_client { struct super_block *sb; @@ -124,6 +127,9 @@ struct ceph_fs_client { struct workqueue_struct *inode_wq; struct workqueue_struct *cap_wq; + DECLARE_HASHTABLE(async_unlink_conflict, CEPH_ASYNC_CREATE_CONFLICT_BITS); + spinlock_t async_unlink_conflict_lock; + #ifdef CONFIG_DEBUG_FS struct dentry *debugfs_dentry_lru, *debugfs_caps; struct dentry *debugfs_congestion_kb; @@ -281,7 +287,8 @@ struct ceph_dentry_info { struct dentry *dentry; struct ceph_mds_session *lease_session; struct list_head lease_list; - unsigned flags; + struct hlist_node hnode; + unsigned long flags; int lease_shared_gen; u32 lease_gen; u32 lease_seq; @@ -290,10 +297,12 @@ struct ceph_dentry_info { u64 offset; }; -#define CEPH_DENTRY_REFERENCED 1 -#define CEPH_DENTRY_LEASE_LIST 2 -#define CEPH_DENTRY_SHRINK_LIST 4 -#define CEPH_DENTRY_PRIMARY_LINK 8 +#define CEPH_DENTRY_REFERENCED (1 << 0) +#define CEPH_DENTRY_LEASE_LIST (1 << 1) +#define CEPH_DENTRY_SHRINK_LIST (1 << 2) +#define CEPH_DENTRY_PRIMARY_LINK (1 << 3) +#define CEPH_DENTRY_ASYNC_UNLINK_BIT (4) +#define CEPH_DENTRY_ASYNC_UNLINK (1 << CEPH_DENTRY_ASYNC_UNLINK_BIT) struct ceph_inode_xattrs_info { /* -- 2.36.0.rc1