I recently saw cgroup_attach_task drop this bomb:
[ 46.045806] Unable to handle kernel paging request at virtual address 00200200
Which is clearly linked-list poison.
Dereferencing 00100104 has also been seen nearby according to a quick
web-search that I did.
Apparently, whether nodes are on a list is being checked with list_empty(),
and if they're on a list, they're list_del()ed. According to a subsequent
list_empty() check, they're still on a list, as list_del() doesn't turn
the nodes into singleton lists, it simply poisons both its pointers, and
merry poison dereferencing may ensue. Oops.
There are at least 2 to address this matter, I've gone for the latter:
1) Do not use list_empty() to check if a node is on a list or not. Have
an additional new function that checks to see whether the node is either
a singleton or is poisoned. Something like list_node_{on,off}_list()?
2) Ensure that you never leave poison anywhere where you might want
to use list_empty().
It might be that these oopses are seen only because there's a marginal
race in the cgroups code, as they seem to be very rare. In that case
this patchset might not fix the core problem, but might simply hide it.
Someone with more cgroups expertise might want to investigate that
possibility.
Patch 1 is the "hindsight is 20/20" patch which would have made
identifying the issue trivial.
Cheers,
Phil
list_del() leaves poison in the prev and next pointers. The next
list_empty() will compare those poisons, and say the list isn't empty.
Any list operations that assume the node is on a list because of such a
check will be fooled into dereferencing poison. One needs to INIT the
node after the del, and fortunately there's already a wrapper for that -
list_del_init().
Some of the dels are followed by deallocations, so can be ignored,
and one can be merged with an add to make a move. Apart from that, I
erred on the side of caution in making nodes list_empty()-queriable.
Signed-off-by: Phil Carmody <[email protected]>
---
kernel/cgroup.c | 14 ++++++--------
1 files changed, 6 insertions(+), 8 deletions(-)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b24d702..bcc7336 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1813,10 +1813,8 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
/* Update the css_set linked lists if we're using them */
write_lock(&css_set_lock);
- if (!list_empty(&tsk->cg_list)) {
- list_del(&tsk->cg_list);
- list_add(&tsk->cg_list, &newcg->tasks);
- }
+ if (!list_empty(&tsk->cg_list))
+ list_move(&tsk->cg_list, &newcg->tasks);
write_unlock(&css_set_lock);
for_each_subsys(root, ss) {
@@ -3655,12 +3653,12 @@ again:
spin_lock(&release_list_lock);
set_bit(CGRP_REMOVED, &cgrp->flags);
if (!list_empty(&cgrp->release_list))
- list_del(&cgrp->release_list);
+ list_del_init(&cgrp->release_list);
spin_unlock(&release_list_lock);
cgroup_lock_hierarchy(cgrp->root);
/* delete this cgroup from parent->children */
- list_del(&cgrp->sibling);
+ list_del_init(&cgrp->sibling);
cgroup_unlock_hierarchy(cgrp->root);
d = dget(cgrp->dentry);
@@ -3879,7 +3877,7 @@ void cgroup_unload_subsys(struct cgroup_subsys *ss)
subsys[ss->subsys_id] = NULL;
/* remove subsystem from rootnode's list of subsystems */
- list_del(&ss->sibling);
+ list_del_init(&ss->sibling);
/*
* disentangle the css from all css_sets attached to the dummytop. as
@@ -4253,7 +4251,7 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
if (!list_empty(&tsk->cg_list)) {
write_lock(&css_set_lock);
if (!list_empty(&tsk->cg_list))
- list_del(&tsk->cg_list);
+ list_del_init(&tsk->cg_list);
write_unlock(&css_set_lock);
}
--
1.7.2.rc1.37.gf8c40
Heed the notice in list_del: "Note: list_empty() on entry does not
return true after this, the entry is in an undefined state.", and
check for precisely that condition.
There are currently a few instances in the code of this sequence:
if(!list_empty(pnode))
list_del(pnode);
which seems to be useless or dangerous if intended to protect from
repeated del's. And given that I've seen an oops pointing to a
dereference of poison in such a list_empty, I'm veering towards
dangerous. This patch would make such errors obvious.
Nothing is changed in the non-DEBUG_LIST build.
Signed-off-by: Phil Carmody <[email protected]>
---
include/linux/list.h | 4 ++++
lib/list_debug.c | 18 ++++++++++++++++++
2 files changed, 22 insertions(+), 0 deletions(-)
diff --git a/include/linux/list.h b/include/linux/list.h
index 3a54266..59bebd1 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -183,10 +183,14 @@ static inline int list_is_last(const struct list_head *list,
* list_empty - tests whether a list is empty
* @head: the list to test.
*/
+#ifndef CONFIG_DEBUG_LIST
static inline int list_empty(const struct list_head *head)
{
return head->next == head;
}
+#else
+extern int list_empty(const struct list_head *head);
+#endif
/**
* list_empty_careful - tests whether a list is empty and not being modified
diff --git a/lib/list_debug.c b/lib/list_debug.c
index b8029a5..aaba728 100644
--- a/lib/list_debug.c
+++ b/lib/list_debug.c
@@ -73,3 +73,21 @@ void list_del(struct list_head *entry)
entry->prev = LIST_POISON2;
}
EXPORT_SYMBOL(list_del);
+
+/**
+ * list_empty - tests whether a list is empty
+ * @head: the list to test.
+ */
+int list_empty(const struct list_head *head)
+{
+ if ((head->prev == LIST_POISON2) || (head->prev == LIST_POISON1))
+ WARN(1, "list_empty performed on a node "
+ "at %p removed from a list.\n", head);
+ else
+ WARN((head->prev == head) != (head->next == head),
+ "list_empty corruption. %p<-%p->%p is half-empty.\n",
+ head->prev, head, head->next);
+
+ return head->next == head;
+}
+EXPORT_SYMBOL(list_empty);
--
1.7.2.rc1.37.gf8c40
On Tue, Mar 15, 2011 at 03:08:41PM +0200, Phil Carmody wrote:
> Apparently, whether nodes are on a list is being checked with list_empty(),
> and if they're on a list, they're list_del()ed. According to a subsequent
> list_empty() check, they're still on a list, as list_del() doesn't turn
> the nodes into singleton lists, it simply poisons both its pointers, and
> merry poison dereferencing may ensue. Oops.
>
> There are at least 2 to address this matter, I've gone for the latter:
>
> 1) Do not use list_empty() to check if a node is on a list or not. Have
> an additional new function that checks to see whether the node is either
> a singleton or is poisoned. Something like list_node_{on,off}_list()?
>
> 2) Ensure that you never leave poison anywhere where you might want
> to use list_empty().
The correct way is to use list_del_init() if you want to do list_empty
checks later on.
On 15/03/11 09:51 -0400, ext Christoph Hellwig wrote:
> On Tue, Mar 15, 2011 at 03:08:41PM +0200, Phil Carmody wrote:
> > Apparently, whether nodes are on a list is being checked with list_empty(),
> > and if they're on a list, they're list_del()ed. According to a subsequent
> > list_empty() check, they're still on a list, as list_del() doesn't turn
> > the nodes into singleton lists, it simply poisons both its pointers, and
> > merry poison dereferencing may ensue. Oops.
> >
> > There are at least 2 to address this matter, I've gone for the latter:
> >
> > 1) Do not use list_empty() to check if a node is on a list or not. Have
> > an additional new function that checks to see whether the node is either
> > a singleton or is poisoned. Something like list_node_{on,off}_list()?
> >
> > 2) Ensure that you never leave poison anywhere where you might want
> > to use list_empty().
>
> The correct way is to use list_del_init() if you want to do list_empty
> checks later on.
I.e. (2). Glad I chose that one.
Phil
On Tue, Mar 15, 2011 at 04:01:53PM +0200, Phil Carmody wrote:
> > > 2) Ensure that you never leave poison anywhere where you might want
> > > to use list_empty().
> >
> > The correct way is to use list_del_init() if you want to do list_empty
> > checks later on.
>
> I.e. (2). Glad I chose that one.
Yes - sorry didn't see the actual patches before.
On Tue, Mar 15, 2011 at 6:08 AM, Phil Carmody
<[email protected]> wrote:
> list_del() leaves poison in the prev and next pointers. The next
> list_empty() will compare those poisons, and say the list isn't empty.
> Any list operations that assume the node is on a list because of such a
> check will be fooled into dereferencing poison. One needs to INIT the
> node after the del, and fortunately there's already a wrapper for that -
> list_del_init().
>
> Some of the dels are followed by deallocations, so can be ignored,
> and one can be merged with an add to make a move. Apart from that, I
> erred on the side of caution in making nodes list_empty()-queriable.
>
> Signed-off-by: Phil Carmody <[email protected]>
Reviewed-by: Paul Menage <[email protected]>
Thanks,
Paul
> ---
> ?kernel/cgroup.c | ? 14 ++++++--------
> ?1 files changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index b24d702..bcc7336 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1813,10 +1813,8 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>
> ? ? ? ?/* Update the css_set linked lists if we're using them */
> ? ? ? ?write_lock(&css_set_lock);
> - ? ? ? if (!list_empty(&tsk->cg_list)) {
> - ? ? ? ? ? ? ? list_del(&tsk->cg_list);
> - ? ? ? ? ? ? ? list_add(&tsk->cg_list, &newcg->tasks);
> - ? ? ? }
> + ? ? ? if (!list_empty(&tsk->cg_list))
> + ? ? ? ? ? ? ? list_move(&tsk->cg_list, &newcg->tasks);
> ? ? ? ?write_unlock(&css_set_lock);
>
> ? ? ? ?for_each_subsys(root, ss) {
> @@ -3655,12 +3653,12 @@ again:
> ? ? ? ?spin_lock(&release_list_lock);
> ? ? ? ?set_bit(CGRP_REMOVED, &cgrp->flags);
> ? ? ? ?if (!list_empty(&cgrp->release_list))
> - ? ? ? ? ? ? ? list_del(&cgrp->release_list);
> + ? ? ? ? ? ? ? list_del_init(&cgrp->release_list);
> ? ? ? ?spin_unlock(&release_list_lock);
>
> ? ? ? ?cgroup_lock_hierarchy(cgrp->root);
> ? ? ? ?/* delete this cgroup from parent->children */
> - ? ? ? list_del(&cgrp->sibling);
> + ? ? ? list_del_init(&cgrp->sibling);
> ? ? ? ?cgroup_unlock_hierarchy(cgrp->root);
>
> ? ? ? ?d = dget(cgrp->dentry);
> @@ -3879,7 +3877,7 @@ void cgroup_unload_subsys(struct cgroup_subsys *ss)
> ? ? ? ?subsys[ss->subsys_id] = NULL;
>
> ? ? ? ?/* remove subsystem from rootnode's list of subsystems */
> - ? ? ? list_del(&ss->sibling);
> + ? ? ? list_del_init(&ss->sibling);
>
> ? ? ? ?/*
> ? ? ? ? * disentangle the css from all css_sets attached to the dummytop. as
> @@ -4253,7 +4251,7 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
> ? ? ? ?if (!list_empty(&tsk->cg_list)) {
> ? ? ? ? ? ? ? ?write_lock(&css_set_lock);
> ? ? ? ? ? ? ? ?if (!list_empty(&tsk->cg_list))
> - ? ? ? ? ? ? ? ? ? ? ? list_del(&tsk->cg_list);
> + ? ? ? ? ? ? ? ? ? ? ? list_del_init(&tsk->cg_list);
> ? ? ? ? ? ? ? ?write_unlock(&css_set_lock);
> ? ? ? ?}
>
> --
> 1.7.2.rc1.37.gf8c40
>
>
On Tue, Mar 15, 2011 at 03:08:43PM +0200, Phil Carmody wrote:
> list_del() leaves poison in the prev and next pointers. The next
> list_empty() will compare those poisons, and say the list isn't empty.
> Any list operations that assume the node is on a list because of such a
> check will be fooled into dereferencing poison. One needs to INIT the
> node after the del, and fortunately there's already a wrapper for that -
> list_del_init().
>
> Some of the dels are followed by deallocations, so can be ignored,
> and one can be merged with an add to make a move. Apart from that, I
> erred on the side of caution in making nodes list_empty()-queriable.
>
> Signed-off-by: Phil Carmody <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Tue, 15 Mar 2011 15:08:42 +0200
Phil Carmody <[email protected]> wrote:
> Heed the notice in list_del: "Note: list_empty() on entry does not
> return true after this, the entry is in an undefined state.", and
> check for precisely that condition.
>
> There are currently a few instances in the code of this sequence:
> if(!list_empty(pnode))
> list_del(pnode);
> which seems to be useless or dangerous if intended to protect from
> repeated del's. And given that I've seen an oops pointing to a
> dereference of poison in such a list_empty, I'm veering towards
> dangerous. This patch would make such errors obvious.
>
> Nothing is changed in the non-DEBUG_LIST build.
>
> ...
>
> +
> +/**
> + * list_empty - tests whether a list is empty
> + * @head: the list to test.
> + */
> +int list_empty(const struct list_head *head)
> +{
> + if ((head->prev == LIST_POISON2) || (head->prev == LIST_POISON1))
> + WARN(1, "list_empty performed on a node "
> + "at %p removed from a list.\n", head);
> + else
> + WARN((head->prev == head) != (head->next == head),
> + "list_empty corruption. %p<-%p->%p is half-empty.\n",
> + head->prev, head, head->next);
> +
> + return head->next == head;
> +}
> +EXPORT_SYMBOL(list_empty);
The second warning here is triggering maybe a hundred times from all
over the place just when booting the kernel.
Here's the first two:
[ 64.295941] WARNING: at lib/list_debug.c:89 list_empty+0x79/0x85()
[ 64.296129] list_empty corruption. ffff880255bcb788<-ffff880255bcb788->ffff88024c3a3c20 is half-empty.
[ 64.296443] Modules linked in: autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp parport sg option usb_wwan ide_cd_mod cdrom usbserial serio_raw floppy snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device button snd_pcm_oss snd_mixer_oss snd_pcm snd_timer i2c_i801 i2c_core snd soundcore snd_page_alloc shpchp pcspkr ehci_hcd ohci_hcd uhci_hcd
[ 64.299193] Pid: 3637, comm: cp Tainted: G W 2.6.38 #1
[ 64.299363] Call Trace:
[ 64.299531] [<ffffffff81037ba4>] warn_slowpath_common+0x80/0x98
[ 64.299700] [<ffffffff81037c50>] warn_slowpath_fmt+0x41/0x43
[ 64.299887] [<ffffffff811af661>] list_empty+0x79/0x85
[ 64.300074] [<ffffffff81383581>] unix_write_space+0xa5/0x10e
[ 64.300246] [<ffffffff813834dc>] ? unix_write_space+0x0/0x10e
[ 64.300418] [<ffffffff812fc62a>] sock_wfree+0x31/0x51
[ 64.300586] [<ffffffff81381bef>] unix_destruct_scm+0xc0/0xcd
[ 64.300755] [<ffffffff812feef6>] skb_release_head_state+0x7f/0xb0
[ 64.300928] [<ffffffff8130035a>] __kfree_skb+0x11/0x7c
[ 64.301096] [<ffffffff813003ed>] consume_skb+0x28/0x2a
[ 64.301264] [<ffffffff813826eb>] unix_stream_recvmsg+0x5ad/0x778
[ 64.301450] [<ffffffff810508a2>] ? autoremove_wake_function+0x0/0x38
[ 64.301623] [<ffffffff812f76c2>] sock_aio_read+0x148/0x160
[ 64.301793] [<ffffffff81172bb9>] ? file_has_perm+0x90/0x9e
[ 64.301961] [<ffffffff812f757a>] ? sock_aio_read+0x0/0x160
[ 64.302130] [<ffffffff810d2a9b>] do_sync_readv_writev+0xbc/0xfb
[ 64.302303] [<ffffffff81170840>] ? security_file_permission+0x80/0x89
[ 64.302472] [<ffffffff810d3116>] do_readv_writev+0xb6/0x182
[ 64.302641] [<ffffffff812f8896>] ? sys_connect+0x78/0x9e
[ 64.302822] [<ffffffff810d3355>] vfs_readv+0x3e/0x49
[ 64.302989] [<ffffffff810d341f>] sys_readv+0x48/0x72
[ 64.303158] [<ffffffff813b323b>] system_call_fastpath+0x16/0x1b
[ 64.303326] ---[ end trace 713534840f2a9415 ]---
[ 180.120065] ------------[ cut here ]------------
[ 180.120154] WARNING: at lib/list_debug.c:89 list_empty+0x79/0x85()
[ 180.120213] list_empty corruption. ffff88025fefd9d0<-ffff88025fefd9d0->ffff880235647c20 is half-empty.
[ 180.120306] Modules linked in: autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp parport sg option usb_wwan ide_cd_mod cdrom usbserial serio_raw floppy snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device button snd_pcm_oss snd_mixer_oss snd_pcm snd_timer i2c_i801 i2c_core snd soundcore snd_page_alloc shpchp pcspkr ehci_hcd ohci_hcd uhci_hcd
[ 180.122547] Pid: 6241, comm: sh Tainted: G W 2.6.38 #1
[ 180.122603] Call Trace:
[ 180.122670] [<ffffffff81037ba4>] warn_slowpath_common+0x80/0x98
[ 180.122728] [<ffffffff81037c50>] warn_slowpath_fmt+0x41/0x43
[ 180.122785] [<ffffffff81055b73>] ? local_clock+0x2b/0x3c
[ 180.122841] [<ffffffff811af661>] list_empty+0x79/0x85
[ 180.122906] [<ffffffff8105085c>] __wake_up_bit+0x1c/0x3d
[ 180.122964] [<ffffffff810938d4>] unlock_page+0x25/0x29
[ 180.123020] [<ffffffff810aaa16>] __do_fault+0x3da/0x411
[ 180.123077] [<ffffffff810ab5c0>] handle_pte_fault+0x289/0x79a
[ 180.123146] [<ffffffff813ad8d8>] ? _raw_spin_unlock+0x26/0x2a
[ 180.123205] [<ffffffff810acf19>] handle_mm_fault+0x1c6/0x1de
[ 180.123263] [<ffffffff813b0931>] do_page_fault+0x3cc/0x3f1
[ 180.123320] [<ffffffff813adb40>] ? restore_args+0x0/0x30
[ 180.123388] [<ffffffff811ab27d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 180.123445] [<ffffffff813add2f>] page_fault+0x1f/0x30
[ 180.123501] ---[ end trace 713534840f2a9416 ]---
On 21/03/11 16:52 -0700, ext Andrew Morton wrote:
> On Tue, 15 Mar 2011 15:08:42 +0200
> Phil Carmody <[email protected]> wrote:
> > +int list_empty(const struct list_head *head)
> > +{
> > + if ((head->prev == LIST_POISON2) || (head->prev == LIST_POISON1))
> > + WARN(1, "list_empty performed on a node "
> > + "at %p removed from a list.\n", head);
> > + else
> > + WARN((head->prev == head) != (head->next == head),
> > + "list_empty corruption. %p<-%p->%p is half-empty.\n",
> > + head->prev, head, head->next);
> The second warning here is triggering maybe a hundred times from all
> over the place just when booting the kernel.
>
> Here's the first two:
>
> [ 64.295941] WARNING: at lib/list_debug.c:89 list_empty+0x79/0x85()
> [ 64.296129] list_empty corruption. ffff880255bcb788<-ffff880255bcb788->ffff88024c3a3c20 is half-empty.
OK, so the patch is working as expected. Perhaps my expectations were wrong.
Looking at list.h I was sure that lists should always be either circular or
poisoned both ends. The above is a rho-shape, this == prev.
Traditional list_empty() returns false on such a node, so it should be
possible to list_del() it. But then next->prev will be set to this->prev
which is this. So this will never be deleted from the list. That situation
rings warning bells in my head. Which I guess is what the patch was trying
to concretise.
I presume the above are x86_64, I'll see if I can get access to such a
machine in the next few days, or reproduce it on one of the architectures
I do have here.
Phil
On Mon, Mar 21, 2011 at 04:52:06PM -0700, Andrew Morton wrote:
> On Tue, 15 Mar 2011 15:08:42 +0200
> Phil Carmody <[email protected]> wrote:
>
> > Heed the notice in list_del: "Note: list_empty() on entry does not
> > return true after this, the entry is in an undefined state.", and
> > check for precisely that condition.
> >
> > There are currently a few instances in the code of this sequence:
> > if(!list_empty(pnode))
> > list_del(pnode);
> > which seems to be useless or dangerous if intended to protect from
> > repeated del's. And given that I've seen an oops pointing to a
> > dereference of poison in such a list_empty, I'm veering towards
> > dangerous. This patch would make such errors obvious.
> >
> > Nothing is changed in the non-DEBUG_LIST build.
> >
> > ...
> >
> > +
> > +/**
> > + * list_empty - tests whether a list is empty
> > + * @head: the list to test.
> > + */
> > +int list_empty(const struct list_head *head)
> > +{
> > + if ((head->prev == LIST_POISON2) || (head->prev == LIST_POISON1))
> > + WARN(1, "list_empty performed on a node "
> > + "at %p removed from a list.\n", head);
> > + else
> > + WARN((head->prev == head) != (head->next == head),
> > + "list_empty corruption. %p<-%p->%p is half-empty.\n",
> > + head->prev, head, head->next);
> > +
> > + return head->next == head;
> > +}
> > +EXPORT_SYMBOL(list_empty);
>
> The second warning here is triggering maybe a hundred times from all
> over the place just when booting the kernel.
>
> Here's the first two:
>
>
> [ 64.295941] WARNING: at lib/list_debug.c:89 list_empty+0x79/0x85()
> [ 64.296129] list_empty corruption. ffff880255bcb788<-ffff880255bcb788->ffff88024c3a3c20 is half-empty.
It looks like a race between __list_del() and list_empty().
--
Kirill A. Shutemov