MIME-Version: 1.0
In-Reply-To: <20161013214650.GB19836@outlook.office365.com>
References: <1476141965-21429-1-git-send-email-avagin@openvz.org>
 <877f9c6ui8.fsf@x220.int.ebiederm.org> <87pon458l1.fsf_-_@x220.int.ebiederm.org>
 <20161013214650.GB19836@outlook.office365.com>
From: Andrey Vagin <avagin@openvz.org>
Date: Thu, 13 Oct 2016 19:31:22 -0700
Message-ID: <CANaxB-xPkgdyeg0z6TvExMfyy4uOC+Nu4Q99WpCscNKMWz8VPg@mail.gmail.com>
Subject: Re: [RFC][PATCH] mount: In mark_umount_candidates and
 __propogate_umount visit each mount once
To: Andrei Vagin <avagin@virtuozzo.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
        Alexander Viro <viro@zeniv.linux.org.uk>,
        Linux Containers <containers@lists.linux-foundation.org>,
        linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        LKML <linux-kernel@vger.kernel.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3528
Lines: 106

On Thu, Oct 13, 2016 at 2:46 PM, Andrei Vagin <avagin@virtuozzo.com> wrote:
> On Thu, Oct 13, 2016 at 02:53:46PM -0500, Eric W. Biederman wrote:
>>
>> Adrei Vagin pointed out that time to executue propagate_umount can go
>> non-linear (and take a ludicrious amount of time) when the mount
>> propogation trees of the mounts to be unmunted by a lazy unmount
>> overlap.
>>
>> Solve this in the most straight forward way possible, by adding a new
>> mount flag to mark parts of the mount propagation tree that have been
>> visited, and use that mark to skip parts of the mount propagation tree
>> that have already been visited during an unmount.  This guarantees
>> that each mountpoint in the possibly overlapping mount propagation
>> trees will be visited exactly once.
>>
>> Add the functions propagation_visit_next and propagation_revisit_next
>> to coordinate setting and clearling the visited mount mark.
>>
>> Here is a script to generate such mount tree:
>> $ cat run.sh
>> mount -t tmpfs test-mount /mnt
>> mount --make-shared /mnt
>> for i in `seq $1`; do
>>         mkdir /mnt/test.$i
>>         mount --bind /mnt /mnt/test.$i
>> done
>> cat /proc/mounts | grep test-mount | wc -l
>> time umount -l /mnt
>> $ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done
>>
>> Here are the performance numbers with and without the patch:
>>
>> mounts | before | after (real sec)
>> -----------------------------
>>   1024 |  0.071 | 0.024
>>   2048 |  0.184 | 0.030
>>   4096 |  0.604 | 0.040
>>   8912 |  4.471 | 0.043
>>  16384 | 34.826 | 0.082
>>  32768 |        | 0.151
>>  65536 |        | 0.289
>> 131072 |        | 0.659
>>
>> Andrei Vagin fixing this performance problem is part of the
>> work to fix CVE-2016-6213.
>>
>> Cc: stable@vger.kernel.org
>> Reported-by: Andrei Vagin <avagin@openvz.org>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>
>> Andrei can you take a look at this patch and see if you can see any
>> problems.  My limited testing suggests this approach does a much better
>> job of solving the problem you were seeing.  With the time looking
>> almost linear in the number of mounts now.
>
> I read this patch and I like the idea.
>
> Then I run my tests and one of them doesn't work with this patch.
> I haven't found a reason yet.

>> +     for (m = propagation_visit_next(parent, parent); m;
>> +                     m = propagation_visit_next(m, parent)) {
>>               struct mount *child = __lookup_mnt_last(&m->mnt,
>>                                               mnt->mnt_mountpoint);

The reason is that this loop is called for different "mnt", but
it is executed only once with this optimization.

So I think the idea to mark parent will not work, because one parent
can have a few children which have to be umounted.

>
> Here is the test:
>
> [root@fc24 mounts]# cat run.sh
> set -e
> mount -t tmpfs zdtm /mnt
> mkdir -p /mnt/1 /mnt/2
> mount -t tmpfs zdtm /mnt/1
> mount --make-shared /mnt/1
> for i in `seq $1`; do
>         mount --bind /mnt/1 `mktemp -d /mnt/1/test.XXXXXX`
> done
> mount --rbind /mnt/1 /mnt/2
> cat /proc/self/mountinfo | grep zdtm | wc -l
> time umount -l /mnt/1
> cat /proc/self/mountinfo | grep zdtm | wc -l
> umount /mnt/2
>
>
> [root@fc24 mounts]# unshare -Urm ./run.sh  5
> 65
>
> real    0m0.014s
> user    0m0.000s
> sys     0m0.004s
> 33
> umount: /mnt/2: target is busy
>         (In some cases useful info about processes that
>          use the device is found by lsof(8) or fuser(1).)
>
>>

Thanks,
Andrei