2006-02-18 01:35:49

by Herbert Poetzl

[permalink] [raw]
Subject: kjournald keeps reference to namespace


Hi Folks!

when creating a private namespace (CLONE_NS) and
then mounting an ext3 filesystem, a new kernel
thread (kjournald) is created, which keeps a
reference to the namespace, which after the the
process exits, remains and blocks access to the
block device, as it is still bd_claim-ed.

this leaves a private namespace behind and a
block device which cannot be opened exclusively.
unmount is not an option, as the namespace is
not longer reachable.

this behaviour seems to be there since ever,
well since namespaces and kjournald exists :)

the following 'cruel' hack 'solves' this issue

best,
Herbert


--- fs/jbd/journal.c.orig 2006-01-03 17:29:56 +0100
+++ fs/jbd/journal.c 2006-02-18 02:23:21 +0100
@@ -33,6 +33,7 @@
#include <linux/mm.h>
#include <linux/suspend.h>
#include <linux/pagemap.h>
+#include <linux/namespace.h>
#include <asm/uaccess.h>
#include <asm/page.h>
#include <linux/proc_fs.h>
@@ -116,6 +117,13 @@ static int kjournald(void *arg)
struct timer_list timer;

daemonize("kjournald");
+ {
+ struct namespace *ns = current->namespace;
+
+ current->namespace = NULL;
+ put_namespace(ns);
+ }
+

/* Set up an interval timer which can be used to trigger a
commit wakeup after the commit interval expires */


2006-02-18 01:55:55

by Andrew Morton

[permalink] [raw]
Subject: Re: kjournald keeps reference to namespace

Herbert Poetzl <[email protected]> wrote:
>
>
> Hi Folks!
>
> when creating a private namespace (CLONE_NS) and
> then mounting an ext3 filesystem, a new kernel
> thread (kjournald) is created, which keeps a
> reference to the namespace, which after the the
> process exits, remains and blocks access to the
> block device, as it is still bd_claim-ed.

There are numerous ways in which user processes can parent kernel threads.

bix:/usr/src/linux-2.6.16-rc4> grep -rl kernel_thread drivers net fs | wc
64 64 1657

> this leaves a private namespace behind and a
> block device which cannot be opened exclusively.
> unmount is not an option, as the namespace is
> not longer reachable.
>
> this behaviour seems to be there since ever,
> well since namespaces and kjournald exists :)
>
> the following 'cruel' hack 'solves' this issue
>
> best,
> Herbert
>
>
> --- fs/jbd/journal.c.orig 2006-01-03 17:29:56 +0100
> +++ fs/jbd/journal.c 2006-02-18 02:23:21 +0100
> @@ -33,6 +33,7 @@
> #include <linux/mm.h>
> #include <linux/suspend.h>
> #include <linux/pagemap.h>
> +#include <linux/namespace.h>
> #include <asm/uaccess.h>
> #include <asm/page.h>
> #include <linux/proc_fs.h>
> @@ -116,6 +117,13 @@ static int kjournald(void *arg)
> struct timer_list timer;
>
> daemonize("kjournald");
> + {
> + struct namespace *ns = current->namespace;
> +
> + current->namespace = NULL;
> + put_namespace(ns);
> + }
> +
>

I think it'd be better to convert ext3 to use the kthread API which appears
to accidentally not have this problem, because such threads are parented by
keventd, which were parented by init.

That being said, perhaps we should do a put_namespace() in kernel_thread(),
too.

I'm kinda surprised that your patch didn't oops over a NULL ->namespace
when the kernel internally mounted the root filesystem.

2006-02-18 03:30:34

by Herbert Poetzl

[permalink] [raw]
Subject: Re: kjournald keeps reference to namespace

On Fri, Feb 17, 2006 at 05:54:28PM -0800, Andrew Morton wrote:
> Herbert Poetzl <[email protected]> wrote:
> >
> >
> > Hi Folks!
> >
> > when creating a private namespace (CLONE_NS) and
> > then mounting an ext3 filesystem, a new kernel
> > thread (kjournald) is created, which keeps a
> > reference to the namespace, which after the the
> > process exits, remains and blocks access to the
> > block device, as it is still bd_claim-ed.
>
> There are numerous ways in which user processes can parent kernel threads.
>
> bix:/usr/src/linux-2.6.16-rc4> grep -rl kernel_thread drivers net fs | wc
> 64 64 1657
>
> > this leaves a private namespace behind and a
> > block device which cannot be opened exclusively.
> > unmount is not an option, as the namespace is
> > not longer reachable.
> >
> > this behaviour seems to be there since ever,
> > well since namespaces and kjournald exists :)
> >
> > the following 'cruel' hack 'solves' this issue
> >
> > best,
> > Herbert
> >
> >
> > --- fs/jbd/journal.c.orig 2006-01-03 17:29:56 +0100
> > +++ fs/jbd/journal.c 2006-02-18 02:23:21 +0100
> > @@ -33,6 +33,7 @@
> > #include <linux/mm.h>
> > #include <linux/suspend.h>
> > #include <linux/pagemap.h>
> > +#include <linux/namespace.h>
> > #include <asm/uaccess.h>
> > #include <asm/page.h>
> > #include <linux/proc_fs.h>
> > @@ -116,6 +117,13 @@ static int kjournald(void *arg)
> > struct timer_list timer;
> >
> > daemonize("kjournald");
> > + {
> > + struct namespace *ns = current->namespace;
> > +
> > + current->namespace = NULL;
> > + put_namespace(ns);
> > + }
> > +
> >
>
> I think it'd be better to convert ext3 to use the kthread API which
> appears to accidentally not have this problem, because such threads
> are parented by keventd, which were parented by init.

sounds like a plan!

> That being said, perhaps we should do a put_namespace() in
> kernel_thread(), too.

hmm, keep the reference but put_namespace()?

> I'm kinda surprised that your patch didn't oops over a NULL
> ->namespace when the kernel internally mounted the root filesystem.

nope, booted just fine, but I was worried too, the
interesting detail is that the kjournald thread
vanishes with this 'hack' when the namespace is
disposed, which doesn't happen without it ...

anyway, will look into it, of course, input is
welcome ...

best,
Herbert


2006-02-18 13:36:51

by Björn Steinbrink

[permalink] [raw]
Subject: Re: kjournald keeps reference to namespace

On 2006.02.18 02:35:47 +0100, Herbert Poetzl wrote:
>
> Hi Folks!
>
> when creating a private namespace (CLONE_NS) and
> then mounting an ext3 filesystem, a new kernel
> thread (kjournald) is created, which keeps a
> reference to the namespace, which after the the
> process exits, remains and blocks access to the
> block device, as it is still bd_claim-ed.
>
> this leaves a private namespace behind and a
> block device which cannot be opened exclusively.
> unmount is not an option, as the namespace is
> not longer reachable.
>
> this behaviour seems to be there since ever,
> well since namespaces and kjournald exists :)
>
> the following 'cruel' hack 'solves' this issue

In daemonize() a new thread gets cleaned up and 'merged' with init_task.
The current fs_struct is handled there, but not the current namespace.
The following patch adds the namespace part.

Signed-off-by: Bj?rn Steinbrink <[email protected]>
---


diff -NurpP --minimal linux-2.6.16-rc4/kernel/exit.c linux-2.6.16-rc4-ns/kernel/exit.c
--- linux-2.6.16-rc4/kernel/exit.c 2006-02-18 13:59:59.000000000 +0100
+++ linux-2.6.16-rc4-ns/kernel/exit.c 2006-02-18 14:04:26.000000000 +0100
@@ -360,6 +360,8 @@ void daemonize(const char *name, ...)
fs = init_task.fs;
current->fs = fs;
atomic_inc(&fs->count);
+ exit_namespace(current);
+ current->namespace = init_task.namespace;
exit_files(current);
current->files = init_task.files;
atomic_inc(&current->files->count);

2006-02-18 16:32:31

by Björn Steinbrink

[permalink] [raw]
Subject: Re: kjournald keeps reference to namespace

> In daemonize() a new thread gets cleaned up and 'merged' with init_task.
> The current fs_struct is handled there, but not the current namespace.
> The following patch adds the namespace part.
>
> Signed-off-by: Bj?rn Steinbrink <[email protected]>
> ---

Oops, forgot the increment the namespace usage count...
---


diff -NurpP --minimal linux-2.6.16-rc4/kernel/exit.c linux-2.6.16-rc4-ns/kernel/exit.c
--- linux-2.6.16-rc4/kernel/exit.c 2006-02-18 13:59:59.000000000 +0100
+++ linux-2.6.16-rc4-ns/kernel/exit.c 2006-02-18 17:27:48.000000000 +0100
@@ -360,6 +360,9 @@ void daemonize(const char *name, ...)
fs = init_task.fs;
current->fs = fs;
atomic_inc(&fs->count);
+ exit_namespace(current);
+ current->namespace = init_task.namespace;
+ atomic_inc(&current->namespace->count);
exit_files(current);
current->files = init_task.files;
atomic_inc(&current->files->count);

2006-02-18 17:12:48

by Björn Steinbrink

[permalink] [raw]
Subject: Re: kjournald keeps reference to namespace

On 2006.02.18 17:32:27 +0100, Bj?rn Steinbrink wrote:
> > In daemonize() a new thread gets cleaned up and 'merged' with init_task.
> > The current fs_struct is handled there, but not the current namespace.
> > The following patch adds the namespace part.
> >
> > Signed-off-by: Bj?rn Steinbrink <[email protected]>
> > ---
>
> Oops, forgot the increment the namespace usage count...

Ok, this time with the get_namespace wrapper, thanks to Eric Biederman
for pointing that out to me.
---


diff -NurpP --minimal linux-2.6.16-rc4/kernel/exit.c linux-2.6.16-rc4-ns/kernel/exit.c
--- linux-2.6.16-rc4/kernel/exit.c 2006-02-18 13:59:59.000000000 +0100
+++ linux-2.6.16-rc4-ns/kernel/exit.c 2006-02-18 17:57:21.000000000 +0100
@@ -360,6 +360,9 @@ void daemonize(const char *name, ...)
fs = init_task.fs;
current->fs = fs;
atomic_inc(&fs->count);
+ exit_namespace(current);
+ current->namespace = init_task.namespace;
+ get_namespace(current->namespace);
exit_files(current);
current->files = init_task.files;
atomic_inc(&current->files->count);

2006-02-19 02:34:15

by Eric W. Biederman

[permalink] [raw]
Subject: Re: kjournald keeps reference to namespace

Bj?rn Steinbrink <[email protected]> writes:

> On 2006.02.18 17:32:27 +0100, Bj?rn Steinbrink wrote:
>> > In daemonize() a new thread gets cleaned up and 'merged' with init_task.
>> > The current fs_struct is handled there, but not the current namespace.
>> > The following patch adds the namespace part.
>> >
>> > Signed-off-by: Bj?rn Steinbrink <[email protected]>
>> > ---
>>
>> Oops, forgot the increment the namespace usage count...
>
> Ok, this time with the get_namespace wrapper, thanks to Eric Biederman
> for pointing that out to me.

Acked-by: Eric Biederman <[email protected]>

Note we can't ever count on using our parents namespace because
we already have called exit_fs(), which is the only way to the
namespace from a process.

> ---
>
>
> diff -NurpP --minimal linux-2.6.16-rc4/kernel/exit.c
> linux-2.6.16-rc4-ns/kernel/exit.c
> --- linux-2.6.16-rc4/kernel/exit.c 2006-02-18 13:59:59.000000000 +0100
> +++ linux-2.6.16-rc4-ns/kernel/exit.c 2006-02-18 17:57:21.000000000 +0100
> @@ -360,6 +360,9 @@ void daemonize(const char *name, ...)
> fs = init_task.fs;
> current->fs = fs;
> atomic_inc(&fs->count);
> + exit_namespace(current);
> + current->namespace = init_task.namespace;
> + get_namespace(current->namespace);
> exit_files(current);
> current->files = init_task.files;
> atomic_inc(&current->files->count);

2006-02-24 21:28:54

by Paul Collins

[permalink] [raw]
Subject: Re: kjournald keeps reference to namespace

Herbert Poetzl <[email protected]> writes:

> On Fri, Feb 17, 2006 at 05:54:28PM -0800, Andrew Morton wrote:
>> I think it'd be better to convert ext3 to use the kthread API which
>> appears to accidentally not have this problem, because such threads
>> are parented by keventd, which were parented by init.
>
> sounds like a plan!

Here's my attempt at such a conversion. Since jbd doesn't seem to
want to collect an exit status, I didn't bother with kthread_stop().

I got overexcited and also embedded the journal device in the process
name, but that's probably useless churn. Looks nice in pstree though:

|-kthread-+-kblockd/0
| |-khubd
| |-2*[pdflush]
| |-aio/0
| |-v9fs/0
| |-cqueue/0
| |-kfand
| |-kcryptd/0
| |-kjournald/3:3
| |-kjournald/3:8
| |-kjournald/3:4
| |-kjournald/3:5
| `-kjournald/254:1


Signed-off-by: Paul Collins <[email protected]>

diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index e4b516a..e33d993 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -33,6 +33,8 @@
#include <linux/mm.h>
#include <linux/suspend.h>
#include <linux/pagemap.h>
+#include <linux/kthread.h>
+#include <linux/kdev_t.h>
#include <asm/uaccess.h>
#include <asm/page.h>
#include <linux/proc_fs.h>
@@ -115,8 +117,6 @@ static int kjournald(void *arg)
transaction_t *transaction;
struct timer_list timer;

- daemonize("kjournald");
-
/* Set up an interval timer which can be used to trigger a
commit wakeup after the commit interval expires */
init_timer(&timer);
@@ -207,12 +207,14 @@ end_loop:
journal->j_task = NULL;
wake_up(&journal->j_wait_done_commit);
jbd_debug(1, "Journal thread exiting.\n");
- return 0;
+ do_exit(0);
}

static void journal_start_thread(journal_t *journal)
{
- kernel_thread(kjournald, journal, CLONE_VM|CLONE_FS|CLONE_FILES);
+ dev_t jdev = journal->j_dev->bd_dev;
+ kthread_run(kjournald, journal, "kjournald/%d:%d",
+ MAJOR(jdev), MINOR(jdev), NULL);
wait_event(journal->j_wait_done_commit, journal->j_task != 0);
}



--
Dag vijandelijk luchtschip de huismeester is dood

2006-02-24 21:36:49

by Andrew Morton

[permalink] [raw]
Subject: Re: kjournald keeps reference to namespace

Paul Collins <[email protected]> wrote:
>
> Herbert Poetzl <[email protected]> writes:
>
> > On Fri, Feb 17, 2006 at 05:54:28PM -0800, Andrew Morton wrote:
> >> I think it'd be better to convert ext3 to use the kthread API which
> >> appears to accidentally not have this problem, because such threads
> >> are parented by keventd, which were parented by init.
> >
> > sounds like a plan!
>
> Here's my attempt at such a conversion. Since jbd doesn't seem to
> want to collect an exit status, I didn't bother with kthread_stop().

Ah. I already did something similar.
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc4/2.6.16-rc4-mm2/broken-out/jbd-convert-kjournald-to-kthread-api.patch

> I got overexcited and also embedded the journal device in the process
> name, but that's probably useless churn. Looks nice in pstree though:
>
> |-kthread-+-kblockd/0
> | |-khubd
> | |-2*[pdflush]
> | |-aio/0
> | |-v9fs/0
> | |-cqueue/0
> | |-kfand
> | |-kcryptd/0
> | |-kjournald/3:3
> | |-kjournald/3:8
> | |-kjournald/3:4
> | |-kjournald/3:5
> | `-kjournald/254:1

We only have 15 chars for that string - the final one you have there is on
the raggedy edge.

2006-02-24 22:02:22

by Paul Collins

[permalink] [raw]
Subject: Re: kjournald keeps reference to namespace

Andrew Morton <[email protected]> writes:

> Paul Collins <[email protected]> wrote:
>>
>> Herbert Poetzl <[email protected]> writes:
>>
>> > On Fri, Feb 17, 2006 at 05:54:28PM -0800, Andrew Morton wrote:
>> >> I think it'd be better to convert ext3 to use the kthread API which
>> >> appears to accidentally not have this problem, because such threads
>> >> are parented by keventd, which were parented by init.
>> >
>> > sounds like a plan!
>>
>> Here's my attempt at such a conversion. Since jbd doesn't seem to
>> want to collect an exit status, I didn't bother with kthread_stop().
>
> Ah. I already did something similar.
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc4/2.6.16-rc4-mm2/broken-out/jbd-convert-kjournald-to-kthread-api.patch

Sweet, I managed to get it right at least.

>> I got overexcited and also embedded the journal device in the process
>> name, but that's probably useless churn. Looks nice in pstree though:
>>
>> | `-kjournald/254:1
>
> We only have 15 chars for that string - the final one you have there is on
> the raggedy edge.

Oh well, it's not that useful anyway. I can count the number of times
I've needed to kill a particular kjournald on the fingers of one head.

--
Dag vijandelijk luchtschip de huismeester is dood