--- ./kernel/timer.c.ppiddbg 2006-07-14 19:11:06.000000000 +0400
+++ ./kernel/timer.c 2006-08-08 14:19:24.000000000 +0400
@@ -1342,6 +1342,7 @@ asmlinkage long sys_getpid(void)
asmlinkage long sys_getppid(void)
{
int pid;
+#ifndef CONFIG_DEBUG_SLAB
struct task_struct *me = current;
struct task_struct *parent;
@@ -1364,6 +1365,16 @@ asmlinkage long sys_getppid(void)
#endif
break;
}
+#else
+ /*
+ * ->real_parent could be released before dereference and
+ * we accessed freed kernel memory, which faults with debugging on.
+ * Keep it simple and stupid.
+ */
+ read_lock(&tasklist_lock);
+ pid = current->group_leader->real_parent->tgid;
+ read_unlock(&tasklist_lock);
+#endif
return pid;
}
On Tue, 2006-08-08 at 14:22 +0400, Kirill Korotaev wrote:
> sys_getppid() optimization can access a freed memory.
> On kernels with DEBUG_SLAB turned ON, this results in
> Oops.
...
> +#else
> + /*
> + * ->real_parent could be released before dereference and
> + * we accessed freed kernel memory, which faults with debugging on.
> + * Keep it simple and stupid.
> + */
> + read_lock(&tasklist_lock);
> + pid = current->group_leader->real_parent->tgid;
> + read_unlock(&tasklist_lock);
> +#endif
> return pid;
> }
Accessing freed memory is a bug, always, not just *only* when slab
debugging is on, right? Doesn't this mean we could get junk, or that
the reader could potentially run off a bad pointer?
It seems that this patch only papers over the problem in the case when
it is observed, but doesn't really even fix the normal case.
Could we use a seqlock to determine when real_parent is in flux, and
re-read real_parent until we get a consistent one? We could use in in
lieu of the existing for() loop.
-- Dave
On 2006.08.08 08:26:57 -0700, Dave Hansen wrote:
> On Tue, 2006-08-08 at 14:22 +0400, Kirill Korotaev wrote:
> > sys_getppid() optimization can access a freed memory.
> > On kernels with DEBUG_SLAB turned ON, this results in
> > Oops.
> ...
> > +#else
> > + /*
> > + * ->real_parent could be released before dereference and
> > + * we accessed freed kernel memory, which faults with debugging on.
> > + * Keep it simple and stupid.
> > + */
> > + read_lock(&tasklist_lock);
> > + pid = current->group_leader->real_parent->tgid;
> > + read_unlock(&tasklist_lock);
> > +#endif
> > return pid;
> > }
>
> Accessing freed memory is a bug, always, not just *only* when slab
> debugging is on, right? Doesn't this mean we could get junk, or that
> the reader could potentially run off a bad pointer?
>
> It seems that this patch only papers over the problem in the case when
> it is observed, but doesn't really even fix the normal case.
>
> Could we use a seqlock to determine when real_parent is in flux, and
> re-read real_parent until we get a consistent one? We could use in in
> lieu of the existing for() loop.
See this thread: http://lkml.org/lkml/2006/8/8/215
Bj?rn
On Tue, 2006-08-08 at 17:34 +0200, Bj?rn Steinbrink wrote:
> See this thread: http://lkml.org/lkml/2006/8/8/215
Bah. I see it now.
I'll put this on the list of things that still have to be fixed before
we do any kind of memory removal. This will definitely be a bug there.
-- Dave
> Accessing freed memory is a bug, always, not just *only* when slab
> debugging is on, right? Doesn't this mean we could get junk, or that
> the reader could potentially run off a bad pointer?
no, read the comment in sys_getppid.
It is a valid optimization. _safe_ and alowing to bypass taking the lock.
BUT! This optimization relies on the fact that kernel memory (DMA + normal zone)
is always mapped into virtual address space.
Which is invalid for debug kernels only.
> It seems that this patch only papers over the problem in the case when
> it is observed, but doesn't really even fix the normal case.
>
> Could we use a seqlock to determine when real_parent is in flux, and
> re-read real_parent until we get a consistent one? We could use in in
> lieu of the existing for() loop.
Kirill
On Tue, 2006-08-08 at 19:43 +0400, Kirill Korotaev wrote:
> > Accessing freed memory is a bug, always, not just *only* when slab
> > debugging is on, right? Doesn't this mean we could get junk, or that
> > the reader could potentially run off a bad pointer?
> no, read the comment in sys_getppid.
> It is a valid optimization. _safe_ and alowing to bypass taking the lock.
> BUT! This optimization relies on the fact that kernel memory (DMA + normal zone)
> is always mapped into virtual address space.
> Which is invalid for debug kernels only.
Actually, it might also be invalid in hypervisor environments. s390 and
Xen use ballooning drivers to tell the hypervisor which pages are not
currently in use by the OS so that they may be used in virtual machines
elsewhere.
I'm cc'ing the s390 guys. Will the s390 kernel oops if it accesses a
page which was ballooned back to the hypervisor?
-- Dave
>>>Accessing freed memory is a bug, always, not just *only* when slab
>>>debugging is on, right? Doesn't this mean we could get junk, or that
>>>the reader could potentially run off a bad pointer?
>>
>>no, read the comment in sys_getppid.
>>It is a valid optimization. _safe_ and alowing to bypass taking the lock.
>>BUT! This optimization relies on the fact that kernel memory (DMA + normal zone)
>>is always mapped into virtual address space.
>>Which is invalid for debug kernels only.
>
>
> Actually, it might also be invalid in hypervisor environments. s390 and
> Xen use ballooning drivers to tell the hypervisor which pages are not
> currently in use by the OS so that they may be used in virtual machines
> elsewhere.
>
> I'm cc'ing the s390 guys. Will the s390 kernel oops if it accesses a
> page which was ballooned back to the hypervisor?
Yeah, a minute after my reply I got your idea and sent a new patch
which always takes the lock :)))
Thanks,
Kirill
On Tue, 2006-08-08 at 08:49 -0700, Dave Hansen wrote:
> On Tue, 2006-08-08 at 19:43 +0400, Kirill Korotaev wrote:
> > > Accessing freed memory is a bug, always, not just *only* when slab
> > > debugging is on, right? Doesn't this mean we could get junk, or that
> > > the reader could potentially run off a bad pointer?
> > no, read the comment in sys_getppid.
> > It is a valid optimization. _safe_ and alowing to bypass taking the lock.
> > BUT! This optimization relies on the fact that kernel memory (DMA + normal zone)
> > is always mapped into virtual address space.
> > Which is invalid for debug kernels only.
>
> Actually, it might also be invalid in hypervisor environments. s390 and
> Xen use ballooning drivers to tell the hypervisor which pages are not
> currently in use by the OS so that they may be used in virtual machines
> elsewhere.
>
> I'm cc'ing the s390 guys. Will the s390 kernel oops if it accesses a
> page which was ballooned back to the hypervisor?
Not with the ballooner, that just tells the hypervisor that it can
forget the current content. On the next access the hypervisor hands out
a zeroed page so the access will succeed. But with my guest page hinting
code the kernel will oops if a free page is accessed.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
Kirill Korotaev <[email protected]> writes:
[adding linux-arch]
> > Accessing freed memory is a bug, always, not just *only* when slab
> > debugging is on, right? Doesn't this mean we could get junk, or that
> > the reader could potentially run off a bad pointer?
> no, read the comment in sys_getppid.
> It is a valid optimization. _safe_ and alowing to bypass taking the lock.
> BUT! This optimization relies on the fact that kernel memory (DMA + normal zone)
> is always mapped into virtual address space.
> Which is invalid for debug kernels only.
In x86 arch code we would use __get_user for this (and we do in a couple
of places). But it wouldn't be portable because sometimes _user is
in a different address space.
Maybe it would be time to make a similar facility (read/write_kernel_safe() or similar)
with error return available to generic code?
It should be easy to implement - iirc near all architectures already
use the exception handling frame work and it is a simple extension
of that. x86 could just define it to __put/get_user
-Andi
On 09 Aug 2006 05:09:11 +0200
Andi Kleen <[email protected]> wrote:
> Kirill Korotaev <[email protected]> writes:
>
> [adding linux-arch]
>
> > > Accessing freed memory is a bug, always, not just *only* when slab
> > > debugging is on, right? Doesn't this mean we could get junk, or that
> > > the reader could potentially run off a bad pointer?
> > no, read the comment in sys_getppid.
> > It is a valid optimization. _safe_ and alowing to bypass taking the lock.
> > BUT! This optimization relies on the fact that kernel memory (DMA + normal zone)
> > is always mapped into virtual address space.
> > Which is invalid for debug kernels only.
>
> In x86 arch code we would use __get_user for this (and we do in a couple
> of places). But it wouldn't be portable because sometimes _user is
> in a different address space.
>
> Maybe it would be time to make a similar facility (read/write_kernel_safe() or similar)
> with error return available to generic code?
>
> It should be easy to implement - iirc near all architectures already
> use the exception handling frame work and it is a simple extension
> of that. x86 could just define it to __put/get_user
>
I just did something like that:
Similar to ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/broken-out/add-probe_kernel_address.patch
Although I'm not sure it's needed for this problem. A getppid() which does
asmlinkage long sys_getppid(void)
{
int pid;
read_lock(&tasklist_lock);
pid = current->group_leader->real_parent->tgid;
read_unlock(&tasklist_lock);
return pid;
}
seems like a fine implementation to me ;)