Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757790AbcJYX5i (ORCPT ); Tue, 25 Oct 2016 19:57:38 -0400 Received: from mail-he1eur01on0100.outbound.protection.outlook.com ([104.47.0.100]:53472 "EHLO EUR01-HE1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1755916AbcJYX5d (ORCPT ); Tue, 25 Oct 2016 19:57:33 -0400 X-Greylist: delayed 5215 seconds by postgrey-1.27 at vger.kernel.org; Tue, 25 Oct 2016 19:57:32 EDT Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=avagin@virtuozzo.com; Date: Tue, 25 Oct 2016 16:41:26 -0700 From: Andrei Vagin To: "Eric W. Biederman" CC: Andrey Vagin , Alexander Viro , Linux Containers , linux-fsdevel , LKML Subject: Re: [RFC][PATCH v2] mount: In propagate_umount handle overlapping mount propagation trees Message-ID: <20161025234125.GA20335@outlook.office365.com> References: <87wphb4pjn.fsf@x220.int.ebiederm.org> <8737jy3htt.fsf_-_@x220.int.ebiederm.org> <20161018024000.GA4901@outlook.office365.com> <87r37e9mnj.fsf@xmission.com> <877f95ngpr.fsf_-_@xmission.com> <20161020213052.GA25226@outlook.office365.com> <87pomtec6c.fsf@xmission.com> <877f90b27o.fsf_-_@xmission.com> <20161025205846.GA25080@outlook.office365.com> <87mvhs14s7.fsf@xmission.com> MIME-Version: 1.0 Content-Type: text/plain; charset="koi8-r" Content-Disposition: inline In-Reply-To: <87mvhs14s7.fsf@xmission.com> User-Agent: Mutt/1.7.0 (2016-08-17) X-Originating-IP: [162.246.95.100] X-ClientProxiedBy: CY1PR13CA0009.namprd13.prod.outlook.com (10.162.30.147) To DB6PR0801MB1973.eurprd08.prod.outlook.com (10.168.85.146) X-MS-Office365-Filtering-Correlation-Id: a33a667a-06bf-4bb0-1afe-08d3fd30793b X-Microsoft-Exchange-Diagnostics: 1;DB6PR0801MB1973;2:1b1fXXTYHkZN/UQQodU80mXHGmnsKc6Qc4zYUWmVHs79KIu89OKQFRe/pFPtPZBHuHkO1CNYoHwtn5juYr7FV7Rrnj+GEez/LPU8v/zpM+19EuYKK7zNXEuDthLuC8Uo7EfqlEFxFEyDuRusKEPYsznnV+hIkb185/CkTWq9vxTFASPnKKANKnxjdNyqdNae;3:zXZGtZ7Xifxjp1r4Vh3k2bB8SUSs1Ab6QofEOjR4SCdl05Wh9AnBSGHubRSwuqJCoobsJmocWdGtvKHH1XoJUdG6rlGglLXqz4Hw/9FuWSg1XNzsjTgzDaEkJqHa6Obq;25:yz3PdnKjomDyxF1AFQBKZNSXY7RoHtgK7hHYHeH3TPL4jIW3F2CYBdDzCaC+ANO9WS8zXYQ+fOZIlaZ9Xbh4MhQzz7+rToFHmD5KEEqQhxyuUN1v7DythU5pITW5Qvy9f+yTli6Arqbp8fe2MckqWXSj/9TmL+R8wpRYVhMbmJ4naOfs0QVvJgmHtE6K6Z8ZTQm2+uF7WwgkomkPrIjVfIbMBTZTxnijc4jd/0Adbv2hGBo3Rvv5e1bBIrDjNPly07ILn/s5wuJsRcgZIGcvjJUCd3pwctkoPLHwDoUQBC9t34gAR24I1HLM0dNxBewgdWMqCthsvopvcv4IUQAAbWElQC2okmW0ECv2T8uOTbn38bGzIkahi3FN0jOLNfwANrALa66oTOsIlEb/iw3fSsJlBLQVf7Qork3Yy9C+5mM= X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:DB6PR0801MB1973; X-Microsoft-Exchange-Diagnostics: 1;DB6PR0801MB1973;31:yzvcKafdQ8XZYWwWc3vMBpjlnCmWNcjAAek5C214Cb/Ijg5y3EKxm9AOQCIH9uAwA3sClFaQzrOXJNH+aKW+9bjuAQpPf3YQkyHWJkstWKQ8UI/qCNd2OrWiWlDyRAgkn1mSthj5iyf6WdeNoR1WMmmm4vnyMu8IvAvdEv7xTNVHHEiJdHbOh846uZ3gZSvH50YCjK4xg8/eLYOC1i2aUYpvBVCas2jnlIQfnHGxCfYbWjkQ2HliYvzt9op7dym1Bw/H3+6AvPKaISGhj+Hv+Q==;4:0kiVGGA3XpBkaD2cih3NKS77jJ3YJwW+tWcrQpkzbxwPN1rGOc4V4mFWxux3WEcusP1tZH5u/UIPo8sOls/pjRDW7dtCsC8BSf0gkLz0bddAD2T+NiD+PPrpf0tMNUx05o/AELFC6hlsUeK5D38VXNYpEHVLGzIX2/mRgfIi4JAbzkaEpcgOQ3rYySSbREyM9rZeGg8N/eT0dVgOtVNLxUFVSqVENo389fxONvxrf9Q3CPWxPeEoFLkuXVd7iWh6WrT5gA18RGOWznVDyv0tlF8sq1QXRV7w7AReLvhOZEnQGLcVAvVKCIxqaFvfr9MHU4QYugMaMZjvGm0qtX5TVALB7LBtV2mnaTpZK4L3SsHh0bfwQnMzbytCzwoScZdDQ9Wwhz7A6AvQzG+UkJoyZ5Dl3tOx7x2zpJr4yIV94b3bH6ZQT0UGEgfS155Eebg9VQwMUHIPWa5YJ0FcqIdSZuKVWBw03hlSoK0hf6JE/L2kqpx5HrwKRZHEm8KBKC3ogqENgndsP2k8oV7q6yFupiz0TuEtaF7it8T/dWD9uTE= X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:(20558992708506)(9452136761055)(151381331826461); X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(6040176)(601004)(2401047)(5005006)(8121501046)(10201501046)(3002001)(6043046)(6042046);SRVR:DB6PR0801MB1973;BCL:0;PCL:0;RULEID:;SRVR:DB6PR0801MB1973; X-Forefront-PRVS: 01068D0A20 X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10019020)(4630300001)(6009001)(7916002)(24454002)(199003)(189002)(81166006)(105586002)(110136003)(19580405001)(47776003)(4326007)(3846002)(5890100001)(6116002)(586003)(19580395003)(86362001)(101416001)(189998001)(83506001)(69596002)(4001350100001)(8676002)(23686003)(575784001)(50986999)(66066001)(92566002)(9686002)(2906002)(6916009)(106356001)(54356999)(6666003)(1076002)(97736004)(2950100002)(7736002)(5660300001)(93886004)(33656002)(305945005)(76176999)(77096005)(7846002)(68736007)(50466002)(42186005)(81156014)(53416004)(7099028)(18370500001)(26326002);DIR:OUT;SFP:1102;SCL:1;SRVR:DB6PR0801MB1973;H:outlook.office365.com;FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:1;LANG:en; X-Microsoft-Exchange-Diagnostics: =?koi8-r?Q?1;DB6PR0801MB1973;23:4C9KZuAX4C42jP/T7roG52IxKOXqLv4hUNHDaqbK5?= =?koi8-r?Q?5dlOgcIkKBpsXZtVHGNVRWxG7jtYSr+3LpI15Ftp9tBUH5XeJuhrYWk9CJa1hs?= =?koi8-r?Q?PHc547aEjW+kxX0PU31W/UNUU6Oz4SWLOmmJt45vV5I9/J13A+r1g7iggB2Y/A?= =?koi8-r?Q?QDITVpkRAsD+kvvlOTQyAvGBIB1dLtN4ww8LS4HRXrZqe95pQOQcfJRgp95Ohv?= =?koi8-r?Q?8LDiX9PddstCMohUwbLu+2JQMk+ML7aC+nTkixA2Qq8+GZy/SsEQqmpbVKj5f/?= =?koi8-r?Q?cTocgHxPCpByN6pQ2gvwrtQUbU3WWkfjekESkqMvrUDBNhre8wPxwSafdV9K77?= =?koi8-r?Q?rDGj7bkWszFWbXM702dBeQZey2kDm/++HEo4eQCVQueKtSECIeI37pKhcPiiWm?= =?koi8-r?Q?O9Zv1zkLz9uyOme37g+Irpu+tZh5MzYLmhfhv7Qz33zNz+pEeSFkw2uhA0kOqb?= =?koi8-r?Q?U3tNydiaLfZRB/Ux++v1m+MxbPmCtvy4pJlHMyz88+RNUcvrOI9wwbcUzDG5Zd?= =?koi8-r?Q?pAYagXwNc52edAplcp2i8ARuSI7PWYHErDGwhtwHjHeZhLXOP2tdqLcxNh5uGP?= =?koi8-r?Q?cp4CnO3CxQH5ZrCfovGFLa7unptKE6+4vk+IYzE1et/eDlisyv9Vr+U2NhbK+I?= =?koi8-r?Q?J/oEnKc2gQMji5kiPt4aB4EdYhJr+yx2T1pVCTRU+lpm1OGtDXLeTa5WF2zZI+?= =?koi8-r?Q?jcyND5HNski8vKartX3j8yjSS2TM6b7A0ObjLCLnaQRJzsMHXCEE6kiSGyxPTz?= =?koi8-r?Q?dWARp22bwE40LAbtupuH7anEqzJ1OkjgLgQupawumRB5rFuU1oFl9NuVdhT3qg?= =?koi8-r?Q?cPRPVkkWHw4M84vvrmj0MOEyKEnNjqEVimaF6ZbwkQHnCVed73KAQcA9dSV2/l?= =?koi8-r?Q?vRIi4EkHJ24snqLANqH0pffA0m6vbltHjqTqfFYEYbareXUA2TCYLECl4RoNmL?= =?koi8-r?Q?yM7Ix1IknzrkgW+WRUuPDda7D1JpZBwWCEWMJY3k7Bmyd9hmKKWkp5Yll3dcmh?= =?koi8-r?Q?AZcrc8X7fWcKyiaMizcDETN0aSNxRo3kOQndnT1K2+ctZl53WFhc3lhf/SeF0W?= =?koi8-r?Q?JM5rTHYexjuiQJmjmsz1J+T6WH+TvCdbrQmpGjbY8gwFTdau1g7tejfawfsUUP?= =?koi8-r?Q?ZdH7sHRXDoL4iDyk3hqS/3bCG1W3TIL09R1vC8cLvdnTeJVJmgqAurYDEe0jCN?= =?koi8-r?Q?hYIQeX8Drg7VpcljUmYT/UaQ0vVG99LIPC1OzrCoAzfyBaOtFWCozsd/yyECZm?= =?koi8-r?Q?0ESBkIYmxYx5KdZVgd5gCShM4RJlunOS5F2u6//vTNNzAXwtX4lcpZkfiXLd9I?= =?koi8-r?Q?t?= X-Microsoft-Exchange-Diagnostics: 1;DB6PR0801MB1973;6:5BSMlfdzW0zuaOEBsdJE4bYGV8igWDd4RSgXiTWNbYpw4TMR17xyVVZwlOuHBRTnkzwuALjVDaqljsWB7lHycMJ6Jlv0V3GE91QDxfPn+hSzu2thxSIQZ3g2dJXpJ6UGJm6E69lwqwB/T2gt4mP2qy04UYj1jBzmvuPHCLanXvcc+SIu3Oz1umLrMYHFjQRV5H1A3mDHWuhuRvjeKBifqaJD75C3N+xYO0djUwfw1SxdqlE3ABA6BE3ksYXhoI3CxmWAWGxoAwGafQTRxTUwO9AUZdGjnScs0Fur0kLW6eFxk7BJ1tFwnYEnBKXChqwL;5:a56+2opOzmAjea9QCrKoI8eFd88sdBckLiz2s8iCWnmFbLsEe+tqp4YDAJs6PI/KuNI4naiHaebA+siPzwpYXgNs4B97q4WG6Txe8fLasuRxsH0vl9O6aq5kWpzXLXdQ3qSsi1IpLRD/YSTlcLtPzA==;24:Cl8I5OjOB6UFJfIEIU1dFGvHFbLobn9PzI58lko1PPWJtDPvaFWk6o/I4JFlLAaULfNu4nEkdl72CgMAdprPlBUXCub0vt0nAVuGc4sIvts=;7:rpZAsyx9uaWAxTlXweJxYgCLHmdH9mnIQ7XVhu31aaK1SbjrOWuPv0fa3vGUpduOwXHwTQyRc6Gircvfmyn+VNzbg9GMWudJ1AdXahCdI0I5td5R38HwvhAcuoh4YymsfMPQWuW7JBFH4uebjLQdZ0CofrNKfgL3jehmeX7FjzC1CJWzvujQXxYqi36V5PaB7Zpm+WeSPh5OtGpiHQR2za3LfD+mMBrUn02hRt0Oz07p6XL4n8jmU33LfCv3jNj49d/AA+ql/uJiVPCi/HPnWFFdtFzcOuaiX99QXgiL3gDtBdzLFpWtKbUDtBPh222kpRs1j8tY1YWRXo3kP7J9ew== SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1;DB6PR0801MB1973;20:agM42IqZOc4DVFJPMfj2fQuJadoPdqrbzzewpFqSbXt3J8+QYHXMdBsSZfL2fjgqUbbZgULB8/HccYsnaJt6ow93K7Bj46z7PFKW1zmYqt6LICxCYoS6JMRp/PN8gqgudus3zf+eQEdh/nnQUqd8UQQ0uBPNupq+OG701wY58Bc= X-OriginatorOrg: virtuozzo.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 25 Oct 2016 23:41:41.2066 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB6PR0801MB1973 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 21019 Lines: 562 On Tue, Oct 25, 2016 at 04:45:44PM -0500, Eric W. Biederman wrote: > Andrei Vagin writes: > > > On Sat, Oct 22, 2016 at 02:42:03PM -0500, Eric W. Biederman wrote: > >> > >> Andrei, > >> > >> This fixes the issue you have reported and through a refactoring > >> makes the code simpler and easier to verify. That said I find your > >> last test case very interesting. While looking at it in detail > >> I have realized I don't fully understand why we have both lookup_mnt and > >> lookup_mnt_last, so I can't say that this change is fully correct. > >> > >> Outside of propogate_umount I am don't have concerns but I am not 100% > >> convinced that my change to lookup_mnt_last does the right thing > >> in the case of propagate_umount. > >> > >> I do see why your last test case scales badly. Long chains of shared > >> mounts that we can't skip. At the same time I don't really understand > >> that case. Part of it has to do with multiple child mounts of the same > >> mount on the same mountpoint. > >> > >> So I am working through my concerns. In the mean time I figured it > >> would be useful to post this version. As this version is clearly better > >> than the version of this change that have come before it. > > > > Hi Eric, > > > > I have tested this version and it works fine. > > > > As for the the last test case, could you look at the attached patch? > > The idea is that we can skip all mounts from a shared group, if one > > of them already marked. > > > >> > >> Eric > >> > >> From: "Eric W. Biederman" > >> Date: Thu, 13 Oct 2016 13:27:19 -0500 > >> > >> Adrei Vagin pointed out that time to executue propagate_umount can go > >> non-linear (and take a ludicrious amount of time) when the mount > >> propogation trees of the mounts to be unmunted by a lazy unmount > >> overlap. > >> > >> While investigating the horrible performance I realized that in > >> the case overlapping mount trees since the addition of locked > >> mount support the code has been failing to unmount all of the > >> mounts it should have been unmounting. > >> > >> Make the walk of the mount propagation trees nearly linear by using > >> MNT_MARK to mark pieces of the mount propagation trees that have > >> already been visited, allowing subsequent walks to skip over > >> subtrees. > >> > >> Make the processing of mounts order independent by adding a list of > >> mount entries that need to be unmounted, and simply adding a mount to > >> that list when it becomes apparent the mount can safely be unmounted. > >> For mounts that are locked on other mounts but otherwise could be > >> unmounted move them from their parnets mnt_mounts to mnt_umounts so > >> that if and when their parent becomes unmounted these mounts can be > >> added to the list of mounts to unmount. > >> > >> Add a final pass to clear MNT_MARK and to restore mnt_mounts > >> from mnt_umounts for anything that did not get unmounted. > >> > >> Add the functions propagation_visit_next and propagation_revisit_next > >> to coordinate walking of the mount tree and setting and clearing the > >> mount mark. > >> > >> The skipping of already unmounted mounts has been moved from > >> __lookup_mnt_last to mark_umount_candidates, so that the new > >> propagation functions can notice when the propagation tree passes > >> through the initial set of unmounted mounts. Except in umount_tree as > >> part of the unmounting process the only place where unmounted mounts > >> should be found are in unmounted subtrees. All of the other callers > >> of __lookup_mnt_last are from mounted subtrees so the not checking for > >> unmounted mounts should not affect them. > >> > >> A script to generate overlapping mount propagation trees: > >> $ cat run.sh > >> mount -t tmpfs test-mount /mnt > >> mount --make-shared /mnt > >> for i in `seq $1`; do > >> mkdir /mnt/test.$i > >> mount --bind /mnt /mnt/test.$i > >> done > >> cat /proc/mounts | grep test-mount | wc -l > >> time umount -l /mnt > >> $ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done > >> > >> Here are the performance numbers with and without the patch: > >> > >> mhash | 8192 | 8192 | 8192 | 131072 | 131072 | 104857 | 104857 > >> mounts | before | after | after (sys) | after | after (sys) | after | after (sys) > >> ------------------------------------------------------------------------------------- > >> 1024 | 0.071s | 0.020s | 0.000s | 0.022s | 0.004s | 0.020s | 0.004s > >> 2048 | 0.184s | 0.022s | 0.004s | 0.023s | 0.004s | 0.022s | 0.008s > >> 4096 | 0.604s | 0.025s | 0.020s | 0.029s | 0.008s | 0.026s | 0.004s > >> 8912 | 4.471s | 0.053s | 0.020s | 0.051s | 0.024s | 0.047s | 0.016s > >> 16384 | 34.826s | 0.088s | 0.060s | 0.081s | 0.048s | 0.082s | 0.052s > >> 32768 | | 0.216s | 0.172s | 0.160s | 0.124s | 0.160s | 0.096s > >> 65536 | | 0.819s | 0.726s | 0.330s | 0.260s | 0.338s | 0.256s > >> 131072 | | 4.502s | 4.168s | 0.707s | 0.580s | 0.709s | 0.592s > >> > >> Andrei Vagin reports fixing the performance problem is part of the > >> work to fix CVE-2016-6213. > >> > >> A script for a pathlogical set of mounts: > >> > >> $ cat pathological.sh > >> > >> mount -t tmpfs base /mnt > >> mount --make-shared /mnt > >> mkdir -p /mnt/b > >> > >> mount -t tmpfs test1 /mnt/b > >> mount --make-shared /mnt/b > >> mkdir -p /mnt/b/10 > >> > >> mount -t tmpfs test2 /mnt/b/10 > >> mount --make-shared /mnt/b/10 > >> mkdir -p /mnt/b/10/20 > >> > >> mount --rbind /mnt/b /mnt/b/10/20 > >> > >> unshare -Urm sleep 2 > >> umount -l /mnt/b > >> wait %% > >> > >> $ unshare -Urm pathlogical.sh > >> > >> Cc: stable@vger.kernel.org > >> Fixes: a05964f3917c ("[PATCH] shared mounts handling: umount") > >> Fixes: 0c56fe31420c ("mnt: Don't propagate unmounts to locked mounts") > >> Reported-by: Andrei Vagin > >> Signed-off-by: "Eric W. Biederman" > >> --- > >> fs/mount.h | 1 + > >> fs/namespace.c | 7 +-- > >> fs/pnode.c | 179 +++++++++++++++++++++++++++++++++++++++++---------------- > >> fs/pnode.h | 2 +- > >> 4 files changed, 133 insertions(+), 56 deletions(-) > >> > >> diff --git a/fs/mount.h b/fs/mount.h > >> index d2e25d7b64b3..00fe0d1d6ba7 100644 > >> --- a/fs/mount.h > >> +++ b/fs/mount.h > >> @@ -58,6 +58,7 @@ struct mount { > >> struct mnt_namespace *mnt_ns; /* containing namespace */ > >> struct mountpoint *mnt_mp; /* where is it mounted */ > >> struct hlist_node mnt_mp_list; /* list mounts with the same mountpoint */ > >> + struct list_head mnt_umounts; /* list of children that are being unmounted */ > >> #ifdef CONFIG_FSNOTIFY > >> struct hlist_head mnt_fsnotify_marks; > >> __u32 mnt_fsnotify_mask; > >> diff --git a/fs/namespace.c b/fs/namespace.c > >> index e6c234b1a645..73801391bb00 100644 > >> --- a/fs/namespace.c > >> +++ b/fs/namespace.c > >> @@ -237,6 +237,7 @@ static struct mount *alloc_vfsmnt(const char *name) > >> INIT_LIST_HEAD(&mnt->mnt_slave_list); > >> INIT_LIST_HEAD(&mnt->mnt_slave); > >> INIT_HLIST_NODE(&mnt->mnt_mp_list); > >> + INIT_LIST_HEAD(&mnt->mnt_umounts); > >> #ifdef CONFIG_FSNOTIFY > >> INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks); > >> #endif > >> @@ -650,13 +651,11 @@ struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry) > >> p = __lookup_mnt(mnt, dentry); > >> if (!p) > >> goto out; > >> - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) > >> - res = p; > >> + res = p; > >> hlist_for_each_entry_continue(p, mnt_hash) { > >> if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry) > >> break; > >> - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) > >> - res = p; > >> + res = p; > >> } > >> out: > >> return res; > >> diff --git a/fs/pnode.c b/fs/pnode.c > >> index 234a9ac49958..8fd1a3fb420c 100644 > >> --- a/fs/pnode.c > >> +++ b/fs/pnode.c > >> @@ -134,7 +134,8 @@ void change_mnt_propagation(struct mount *mnt, int type) > >> } > >> > >> /* > >> - * get the next mount in the propagation tree. > >> + * get the next mount that is not a slave of the current mount in the > >> + * propagation tree. > >> * @m: the mount seen last > >> * @origin: the original mount from where the tree walk initiated > >> * > >> @@ -143,13 +144,9 @@ void change_mnt_propagation(struct mount *mnt, int type) > >> * vfsmount found while iterating with propagation_next() is > >> * a peer of one we'd found earlier. > >> */ > >> -static struct mount *propagation_next(struct mount *m, > >> - struct mount *origin) > >> +static struct mount *propagation_next_sib(struct mount *m, > >> + struct mount *origin) > >> { > >> - /* are there any slaves of this mount? */ > >> - if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list)) > >> - return first_slave(m); > >> - > >> while (1) { > >> struct mount *master = m->mnt_master; > >> > >> @@ -164,6 +161,26 @@ static struct mount *propagation_next(struct mount *m, > >> } > >> } > >> > >> +/* > >> + * get the next mount in the propagation tree. > >> + * @m: the mount seen last > >> + * @origin: the original mount from where the tree walk initiated > >> + * > >> + * Note that peer groups form contiguous segments of slave lists. > >> + * We rely on that in get_source() to be able to find out if > >> + * vfsmount found while iterating with propagation_next() is > >> + * a peer of one we'd found earlier. > >> + */ > >> +static struct mount *propagation_next(struct mount *m, > >> + struct mount *origin) > >> +{ > >> + /* are there any slaves of this mount? */ > >> + if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list)) > >> + return first_slave(m); > >> + > >> + return propagation_next_sib(m, origin); > >> +} > >> + > >> static struct mount *next_group(struct mount *m, struct mount *origin) > >> { > >> while (1) { > >> @@ -389,57 +406,92 @@ void propagate_mount_unlock(struct mount *mnt) > >> } > >> } > >> > >> -/* > >> - * Mark all mounts that the MNT_LOCKED logic will allow to be unmounted. > >> - */ > >> -static void mark_umount_candidates(struct mount *mnt) > >> +static struct mount *propagation_visit_child(struct mount *last_child, > >> + struct mount *origin_child) > >> { > >> - struct mount *parent = mnt->mnt_parent; > >> - struct mount *m; > >> + struct mount *m = last_child->mnt_parent; > >> + struct mount *origin = origin_child->mnt_parent; > >> + struct dentry *mountpoint = origin_child->mnt_mountpoint; > >> + struct mount *child; > >> > >> - BUG_ON(parent == mnt); > >> + /* Has this part of the propgation tree already been visited? */ > >> + if (IS_MNT_MARKED(last_child)) > >> + return NULL; > >> > >> - for (m = propagation_next(parent, parent); m; > >> - m = propagation_next(m, parent)) { > >> - struct mount *child = __lookup_mnt_last(&m->mnt, > >> - mnt->mnt_mountpoint); > >> - if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) { > >> - SET_MNT_MARK(child); > >> - } > >> + SET_MNT_MARK(last_child); > >> + > >> + m = propagation_next(m, origin); > >> + while (m) { > >> + child = __lookup_mnt_last(&m->mnt, mountpoint); > >> + if (child && !IS_MNT_MARKED(child)) > >> + return child; > >> + > >> + if (!child) > >> + m = propagation_next(m, origin); > >> + else > >> + m = propagation_next_sib(m, origin); > >> } > >> + return NULL; > >> } > >> > >> -/* > >> - * NOTE: unmounting 'mnt' naturally propagates to all other mounts its > >> - * parent propagates to. > >> - */ > >> -static void __propagate_umount(struct mount *mnt) > >> +static struct mount *propagation_revisit_child(struct mount *last_child, > >> + struct mount *origin_child) > >> { > >> - struct mount *parent = mnt->mnt_parent; > >> - struct mount *m; > >> + struct mount *m = last_child->mnt_parent; > >> + struct mount *origin = origin_child->mnt_parent; > >> + struct dentry *mountpoint = origin_child->mnt_mountpoint; > >> + struct mount *child; > >> > >> - BUG_ON(parent == mnt); > >> + /* Has this part of the propgation tree already been revisited? */ > >> + if (!IS_MNT_MARKED(last_child)) > >> + return NULL; > >> > >> - for (m = propagation_next(parent, parent); m; > >> - m = propagation_next(m, parent)) { > >> + CLEAR_MNT_MARK(last_child); > >> > >> - struct mount *child = __lookup_mnt_last(&m->mnt, > >> - mnt->mnt_mountpoint); > >> - /* > >> - * umount the child only if the child has no children > >> - * and the child is marked safe to unmount. > >> - */ > >> - if (!child || !IS_MNT_MARKED(child)) > >> - continue; > >> - CLEAR_MNT_MARK(child); > >> - if (list_empty(&child->mnt_mounts)) { > >> - list_del_init(&child->mnt_child); > >> - child->mnt.mnt_flags |= MNT_UMOUNT; > >> - list_move_tail(&child->mnt_list, &mnt->mnt_list); > >> - } > >> + m = propagation_next(m, origin); > >> + while (m) { > >> + child = __lookup_mnt_last(&m->mnt, mountpoint); > >> + if (child && IS_MNT_MARKED(child)) > >> + return child; > >> + > >> + if (!child) > >> + m = propagation_next(m, origin); > >> + else > >> + m = propagation_next_sib(m, origin); > >> } > >> + return NULL; > >> } > >> > >> +static void start_umount_propagation(struct mount *child, > >> + struct list_head *to_umount) > >> +{ > >> + do { > >> + struct mount *parent = child->mnt_parent; > >> + > >> + if ((child->mnt.mnt_flags & MNT_UMOUNT) || > >> + !list_empty(&child->mnt_mounts)) > >> + return; > >> + > >> + if (!IS_MNT_LOCKED(child)) > >> + list_move_tail(&child->mnt_child, to_umount); > >> + else > >> + list_move_tail(&child->mnt_child, &parent->mnt_umounts); > >> + > >> + child = NULL; > >> + if (IS_MNT_MARKED(parent)) > >> + child = parent; > >> + } while (child); > >> +} > >> + > >> +static void end_umount_propagation(struct mount *child) > >> +{ > >> + struct mount *parent = child->mnt_parent; > >> + > >> + if (!list_empty(&parent->mnt_umounts)) > >> + list_splice_tail_init(&parent->mnt_umounts, &parent->mnt_mounts); > >> +} > >> + > >> + > >> /* > >> * collect all mounts that receive propagation from the mount in @list, > >> * and return these additional mounts in the same list. > >> @@ -447,14 +499,39 @@ static void __propagate_umount(struct mount *mnt) > >> * > >> * vfsmount lock must be held for write > >> */ > >> -int propagate_umount(struct list_head *list) > >> +void propagate_umount(struct list_head *list) > >> { > >> struct mount *mnt; > >> + LIST_HEAD(to_umount); > >> + LIST_HEAD(tmp_list); > >> + > >> + /* Find candidates for unmounting */ > >> + list_for_each_entry(mnt, list, mnt_list) { > >> + struct mount *child; > >> + for (child = propagation_visit_child(mnt, mnt); child; > >> + child = propagation_visit_child(child, mnt)) > >> + start_umount_propagation(child, &to_umount); > >> + } > >> > >> - list_for_each_entry_reverse(mnt, list, mnt_list) > >> - mark_umount_candidates(mnt); > >> + /* Begin unmounting */ > >> + while (!list_empty(&to_umount)) { > >> + mnt = list_first_entry(&to_umount, struct mount, mnt_child); > >> > >> - list_for_each_entry(mnt, list, mnt_list) > >> - __propagate_umount(mnt); > >> - return 0; > >> + list_del_init(&mnt->mnt_child); > >> + mnt->mnt.mnt_flags |= MNT_UMOUNT; > >> + list_move_tail(&mnt->mnt_list, &tmp_list); > >> + > >> + if (!list_empty(&mnt->mnt_umounts)) > >> + list_splice_tail_init(&mnt->mnt_umounts, &to_umount); > >> + } > >> + > >> + /* Cleanup the mount propagation tree */ > >> + list_for_each_entry(mnt, list, mnt_list) { > >> + struct mount *child; > >> + for (child = propagation_revisit_child(mnt, mnt); child; > >> + child = propagation_revisit_child(child, mnt)) > >> + end_umount_propagation(child); > >> + } > >> + > >> + list_splice_tail(&tmp_list, list); > >> } > >> diff --git a/fs/pnode.h b/fs/pnode.h > >> index 550f5a8b4fcf..38c6cdb96b34 100644 > >> --- a/fs/pnode.h > >> +++ b/fs/pnode.h > >> @@ -41,7 +41,7 @@ static inline void set_mnt_shared(struct mount *mnt) > >> void change_mnt_propagation(struct mount *, int); > >> int propagate_mnt(struct mount *, struct mountpoint *, struct mount *, > >> struct hlist_head *); > >> -int propagate_umount(struct list_head *); > >> +void propagate_umount(struct list_head *); > >> int propagate_mount_busy(struct mount *, int); > >> void propagate_mount_unlock(struct mount *); > >> void mnt_release_group_id(struct mount *); > >> -- > >> 2.10.1 > >> > > > > From 8e0f45c0272aa1f789d1657a0acc98c58919dcc3 Mon Sep 17 00:00:00 2001 > > From: Andrei Vagin > > Date: Tue, 25 Oct 2016 13:57:31 -0700 > > Subject: [PATCH] mount: skip all mounts from a shared group if one is marked > > > > If we meet a marked mount, it means that all mounts from > > its group have been already revised. > > > > Signed-off-by: Andrei Vagin > > --- > > fs/pnode.c | 18 +++++++++++++++--- > > 1 file changed, 15 insertions(+), 3 deletions(-) > > > > diff --git a/fs/pnode.c b/fs/pnode.c > > index 8fd1a3f..ebb7134 100644 > > --- a/fs/pnode.c > > +++ b/fs/pnode.c > > @@ -426,10 +426,16 @@ static struct mount *propagation_visit_child(struct mount *last_child, > > if (child && !IS_MNT_MARKED(child)) > > return child; > > > > - if (!child) > > + if (!child) { > > m = propagation_next(m, origin); > > - else > > + } else { > > + if (IS_MNT_MARKED(child)) { > > + if (m->mnt_group_id == origin->mnt_group_id) > > + return NULL; > > + m = m->mnt_master; > > + } > > m = propagation_next_sib(m, origin); > > + } > > } > > return NULL; > > } > > @@ -456,8 +462,14 @@ static struct mount *propagation_revisit_child(struct mount *last_child, > > > > if (!child) > > m = propagation_next(m, origin); > > - else > > + else { > > + if (!IS_MNT_MARKED(child)) { > > + if (m->mnt_group_id == origin->mnt_group_id) > > + return NULL; > > + m = m->mnt_master; > > + } > > m = propagation_next_sib(m, origin); > > + } > > } > > return NULL; > > } > > That is certainly interesting. The problem is that the reason we were > going slow is that there were in fact mounts that had not been traversed > in the share group. You are right. > > And in fact the entire idea of visiting a vfsmount mountpoint pair > exactly once is wrong in the face of shadow mounts. For a vfsmount > mountpoint pair that has shadow mounts the number of shadow mounts needs > to be descreased by one each time the propgation tree is traversed > during unmount. Which means that as far as I can see we have to kill > shadow mounts to correctly optimize this code. Once shadow mounts are > gone I don't know of a case where need your optimization. Without shadow mounts, it will be hard to save predictable behaviour for cases like this: $ unshare --propagation private -m sh test.sh + mount -t tmpfs --make-shared zzzz A + mkdir A/a + mount -t tmpfs zzzz A/a + mount --bind A B + mount -t tmpfs zzzz B/a + grep zzzz + cat /proc/self/mountinfo 155 123 0:44 / /root/tmp/A rw,relatime shared:70 - tmpfs zzzz rw 156 155 0:45 / /root/tmp/A/a rw,relatime shared:71 - tmpfs zzzz rw 157 123 0:44 / /root/tmp/B rw,relatime shared:70 - tmpfs zzzz rw 158 157 0:46 / /root/tmp/B/a rw,relatime shared:72 - tmpfs zzzz rw 159 155 0:46 / /root/tmp/A/a rw,relatime shared:72 - tmpfs zzzz rw + umount B/a + grep zzzz + cat /proc/self/mountinfo 155 123 0:44 / /root/tmp/A rw,relatime shared:70 - tmpfs zzzz rw 156 155 0:45 / /root/tmp/A/a rw,relatime shared:71 - tmpfs zzzz rw 157 123 0:44 / /root/tmp/B rw,relatime shared:70 - tmpfs zzzz rw X + a - a = X Maybe we need to add another ID for propagated mounts and when we do umount, we will detach only mounts with the same propagation id. I support the idea to kill shadow mounts. I guess it will help us to simplify algorithm of dumping and restoring a mount tree in CRIU. Currently it is a big pain for us. > > I am busily verifying my patch to kill shadow mounts but the following > patch is the minimal version. As far as I can see propagate_one > is the only place we create shadow mounts, and holding the > namespace_lock over attach_recursive_mnt, propagate_mnt, and > propgate_one is sufficient for that __lookup_mnt to be competely safe. > > diff --git a/fs/pnode.c b/fs/pnode.c > index 234a9ac49958..b14119b370d4 100644 > --- a/fs/pnode.c > +++ b/fs/pnode.c > @@ -217,6 +217,9 @@ static int propagate_one(struct mount *m) > /* skip if mountpoint isn't covered by it */ > if (!is_subdir(mp->m_dentry, m->mnt.mnt_root)) > return 0; > + /* skip if mountpoint already has a mount on it */ > + if (__lookup_mnt(&m->mnt, mp->m_dentry)) > + return 0; > if (peers(m, last_dest)) { > type = CL_MAKE_SHARED; > } else { > > If you run with that patch you will see that there are go faster stripes. > > Eric >