Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934039AbcJMRS2 (ORCPT ); Thu, 13 Oct 2016 13:18:28 -0400 Received: from out03.mta.xmission.com ([166.70.13.233]:59063 "EHLO out03.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933766AbcJMRSU (ORCPT ); Thu, 13 Oct 2016 13:18:20 -0400 From: ebiederm@xmission.com (Eric W. Biederman) To: Andrei Vagin Cc: Alexander Viro , containers@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org References: <1476141965-21429-1-git-send-email-avagin@openvz.org> Date: Thu, 13 Oct 2016 12:14:55 -0500 In-Reply-To: <1476141965-21429-1-git-send-email-avagin@openvz.org> (Andrei Vagin's message of "Mon, 10 Oct 2016 16:26:05 -0700") Message-ID: <877f9c6ui8.fsf@x220.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1bujcw-0000oT-ND;;;mid=<877f9c6ui8.fsf@x220.int.ebiederm.org>;;;hst=in02.mta.xmission.com;;;ip=75.170.125.99;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX19pPgPU8xyruCYsD9C52zNsRCm9nJYQ7w0= X-SA-Exim-Connect-IP: 75.170.125.99 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.7 XMSubLong Long Subject * 0.0 TVD_RCVD_IP Message was received from an IP address * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa08 1397; Body=1 Fuz1=1 Fuz2=1] * 0.2 T_XMDrugObfuBody_14 obfuscated drug references * 0.0 T_TooManySym_02 5+ unique symbols in subject * 0.0 T_TooManySym_01 4+ unique symbols in subject X-Spam-DCC: XMission; sa08 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: ;Andrei Vagin X-Spam-Relay-Country: X-Spam-Timing: total 400 ms - load_scoreonly_sql: 0.03 (0.0%), signal_user_changed: 3.0 (0.8%), b_tie_ro: 2.1 (0.5%), parse: 0.71 (0.2%), extract_message_metadata: 15 (3.6%), get_uri_detail_list: 2.8 (0.7%), tests_pri_-1000: 6 (1.6%), tests_pri_-950: 1.65 (0.4%), tests_pri_-900: 1.36 (0.3%), tests_pri_-400: 30 (7.5%), check_bayes: 29 (7.2%), b_tokenize: 9 (2.1%), b_tok_get_all: 10 (2.4%), b_comp_prob: 3.6 (0.9%), b_tok_touch_all: 3.7 (0.9%), b_finish: 0.79 (0.2%), tests_pri_0: 333 (83.2%), check_dkim_signature: 0.85 (0.2%), check_dkim_adsp: 3.4 (0.9%), tests_pri_500: 6 (1.4%), poll_dns_idle: 0.48 (0.1%), rewrite_mail: 0.00 (0.0%) Subject: Re: [PATCH] [v3] mount: dont execute propagate_umount() many times for same mounts X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4328 Lines: 124 Andrei Vagin writes: > The reason of this optimization is that umount() can hold namespace_sem > for a long time, this semaphore is global, so it affects all users. > Recently Eric W. Biederman added a per mount namespace limit on the > number of mounts. The default number of mounts allowed per mount > namespace at 100,000. Currently this value is allowed to construct a tree > which requires hours to be umounted. > > In a worse case the current complexity of umount_tree() is O(n^3). > * Enumirate all mounts in a target tree (propagate_umount) > * Enumirate mounts to find where these changes have to > be propagated (mark_umount_candidates) > * Enumirate mounts to find a requered mount by parent and dentry > (__lookup_mnt_lat) > > The worse case is when all mounts from the tree live in the same shared > group. In this case we have to enumirate all mounts on each step. > > Here we can optimize the second step. We don't need to make it for > mounts which we already met when we did this step for previous mounts. > It reduces the complexity of umount_tree() to O(n^2). To O(n) not O(n^2). A hash table lookup (aka __lookup_mnt() and friends) is O(1) or the hash table is malfunctioning. Please don't call Arguably we are getting into sizes where the mount hash table fills up and is on the edge of malfunctioning, but that is not particularly relevant to this case. What your patch is aiming to do is to take a O(n^2) algorithm and make it O(n). That is very much worth doing. However your patch confuses two separate issues. Marking mounts that may be unmounted. Marking pieces of the propagation tree that have already been traversed. I do not see anything requiring propagation trees to intersect at the set of mounts that are unmounted in umount_tree before propagate_umount is called. Which means there are topologies where we can and should do better than your patch. I am also bothered that your patch changes how we look up the mount mounted on a mount point (aka playing with __lookup_mnt_last). There is no reason to do that to solve the problem, and I think it obscures what is actually going on. I am going to see if I can rework your basic concept with explicit marking of the propagation tree. In the meantime for people who are want to see what your patch is doing the version below essentially does the same thing, without the extra essentially meaningless loop. Eric diff --git a/fs/namespace.c b/fs/namespace.c index 8183fba9ab4d..33a76ee1b76b 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -650,13 +650,11 @@ struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry) p = __lookup_mnt(mnt, dentry); if (!p) goto out; - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) - res = p; + res = p; hlist_for_each_entry_continue(p, mnt_hash) { if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry) break; - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) - res = p; + res = p; } out: return res; diff --git a/fs/pnode.c b/fs/pnode.c index 234a9ac49958..ade5e7d8308c 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -399,11 +399,18 @@ static void mark_umount_candidates(struct mount *mnt) BUG_ON(parent == mnt); + if (IS_MNT_MARKED(mnt)) + return; + + SET_MNT_MARK(mnt); + for (m = propagation_next(parent, parent); m; m = propagation_next(m, parent)) { struct mount *child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint); - if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) { + if (child && ((child->mnt.mnt_flags & MNT_UMOUNT) || + !IS_MNT_LOCKED(child) || + IS_MNT_MARKED(m))) { SET_MNT_MARK(child); } } @@ -420,6 +427,11 @@ static void __propagate_umount(struct mount *mnt) BUG_ON(parent == mnt); + if (!IS_MNT_MARKED(mnt)) + return; + + CLEAR_MNT_MARK(mnt); + for (m = propagation_next(parent, parent); m; m = propagation_next(m, parent)) { @@ -432,7 +444,8 @@ static void __propagate_umount(struct mount *mnt) if (!child || !IS_MNT_MARKED(child)) continue; CLEAR_MNT_MARK(child); - if (list_empty(&child->mnt_mounts)) { + if (!(child->mnt.mnt_flags & MNT_UMOUNT) && + list_empty(&child->mnt_mounts)) { list_del_init(&child->mnt_child); child->mnt.mnt_flags |= MNT_UMOUNT; list_move_tail(&child->mnt_list, &mnt->mnt_list);