2011-06-01 10:44:50

by Igor Mammedov

[permalink] [raw]
Subject: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world

Freshly allocated 'mem_cgroup_per_node' list entries must be
initialized before the rest of the kernel can see them. Otherwise
zero initialized list fields can lead to race condition at
mem_cgroup_force_empty_list:
pc = list_entry(list->prev, struct page_cgroup, lru);
where 'pc' will be something like 0xfffffffc if list->prev is 0
and cause page fault later when 'pc' is dereferenced.

Signed-off-by: Igor Mammedov <[email protected]>
---
mm/memcontrol.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bd9052a..ee7cb4c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4707,7 +4707,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
if (!pn)
return 1;

- mem->info.nodeinfo[node] = pn;
for (zone = 0; zone < MAX_NR_ZONES; zone++) {
mz = &pn->zoneinfo[zone];
for_each_lru(l)
@@ -4716,6 +4715,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
mz->on_tree = false;
mz->mem = mem;
}
+ mem->info.nodeinfo[node] = pn;
return 0;
}

--
1.7.1


2011-06-01 12:39:20

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world

On Wed 01-06-11 12:44:04, Igor Mammedov wrote:
> Freshly allocated 'mem_cgroup_per_node' list entries must be
> initialized before the rest of the kernel can see them. Otherwise
> zero initialized list fields can lead to race condition at
> mem_cgroup_force_empty_list:
> pc = list_entry(list->prev, struct page_cgroup, lru);
> where 'pc' will be something like 0xfffffffc if list->prev is 0
> and cause page fault later when 'pc' is dereferenced.

Have you ever seen such a race? I do not see how this could happen.
mem_cgroup_force_empty_list is called only from
mem_cgroup_force_empty_write (aka echo whatever > group/force_empty)
or mem_cgroup_pre_destroy when the group is destroyed.

The initialization code is, however, called before a group is
given for use AFICS.

I am not saying tha the change is bad, I like it, but I do not think it
is a fix for potential race condition.

>
> Signed-off-by: Igor Mammedov <[email protected]>
> ---
> mm/memcontrol.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bd9052a..ee7cb4c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4707,7 +4707,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> if (!pn)
> return 1;
>
> - mem->info.nodeinfo[node] = pn;
> for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> mz = &pn->zoneinfo[zone];
> for_each_lru(l)
> @@ -4716,6 +4715,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> mz->on_tree = false;
> mz->mem = mem;
> }
> + mem->info.nodeinfo[node] = pn;
> return 0;
> }
>
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-01 13:08:00

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world

On 06/01/2011 02:39 PM, Michal Hocko wrote:
> On Wed 01-06-11 12:44:04, Igor Mammedov wrote:
>> Freshly allocated 'mem_cgroup_per_node' list entries must be
>> initialized before the rest of the kernel can see them. Otherwise
>> zero initialized list fields can lead to race condition at
>> mem_cgroup_force_empty_list:
>> pc = list_entry(list->prev, struct page_cgroup, lru);
>> where 'pc' will be something like 0xfffffffc if list->prev is 0
>> and cause page fault later when 'pc' is dereferenced.
> Have you ever seen such a race? I do not see how this could happen.
> mem_cgroup_force_empty_list is called only from
> mem_cgroup_force_empty_write (aka echo whatever> group/force_empty)
> or mem_cgroup_pre_destroy when the group is destroyed.
>
> The initialization code is, however, called before a group is
> given for use AFICS.
>
> I am not saying tha the change is bad, I like it, but I do not think it
> is a fix for potential race condition.
>

Yes I've seen it (RHBZ#700565). It causes random crashes
in virt env ocasionally. It's easier to reproduce if you overcommit
cpu.

>> Signed-off-by: Igor Mammedov<[email protected]>
>> ---
>> mm/memcontrol.c | 2 +-
>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index bd9052a..ee7cb4c 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -4707,7 +4707,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>> if (!pn)
>> return 1;
>>
>> - mem->info.nodeinfo[node] = pn;
>> for (zone = 0; zone< MAX_NR_ZONES; zone++) {
>> mz =&pn->zoneinfo[zone];
>> for_each_lru(l)
>> @@ -4716,6 +4715,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>> mz->on_tree = false;
>> mz->mem = mem;
>> }
>> + mem->info.nodeinfo[node] = pn;
>> return 0;
>> }
>>
>> --
>> 1.7.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/

2011-06-01 13:41:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world

[Let's CC some cgroup people]

On Wed 01-06-11 15:07:40, Igor Mammedov wrote:
> On 06/01/2011 02:39 PM, Michal Hocko wrote:
> >On Wed 01-06-11 12:44:04, Igor Mammedov wrote:
> >>Freshly allocated 'mem_cgroup_per_node' list entries must be
> >>initialized before the rest of the kernel can see them. Otherwise
> >>zero initialized list fields can lead to race condition at
> >>mem_cgroup_force_empty_list:
> >> pc = list_entry(list->prev, struct page_cgroup, lru);
> >>where 'pc' will be something like 0xfffffffc if list->prev is 0
> >>and cause page fault later when 'pc' is dereferenced.
> >
> >Have you ever seen such a race? I do not see how this could happen.
> >mem_cgroup_force_empty_list is called only from
> >mem_cgroup_force_empty_write (aka echo whatever> group/force_empty)
> >or mem_cgroup_pre_destroy when the group is destroyed.
> >
> >The initialization code is, however, called before a group is
> >given for use AFAICS.
> >
> >I am not saying tha the change is bad, I like it, but I do not think it
> >is a fix for potential race condition.
> >
>
> Yes I've seen it (RHBZ#700565).

I am not subscribed so I will not get there.

> It causes random crashes in virt env ocasionally. It's easier to
> reproduce if you overcommit cpu.

Hmm, I have missed that cgroup_create calls create callback for all
subsystems that the group should be attached to. But this shouldn't be
any problem as cgroup_call_pre_destroy (called from cgroup_rmdir) cannot
be called before the directory is actually created.
cgroup_create_dir is called after cgroup has been created for all
subsystems so the directory entry will show up only after everything is
initialized AFAICS. So I still do not see how we can race.

Anyway, if you are able to reproduce the problem then I would say that
the problem is somewhere in the generic cgroup code rather than in
memory controler. The patch just papers over a real bug IMO. But I may
be wrong as I am not definitely an expert in the area.

>
> >>Signed-off-by: Igor Mammedov<[email protected]>
> >>---
> >> mm/memcontrol.c | 2 +-
> >> 1 files changed, 1 insertions(+), 1 deletions(-)
> >>
> >>diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>index bd9052a..ee7cb4c 100644
> >>--- a/mm/memcontrol.c
> >>+++ b/mm/memcontrol.c
> >>@@ -4707,7 +4707,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> >> if (!pn)
> >> return 1;
> >>
> >>- mem->info.nodeinfo[node] = pn;
> >> for (zone = 0; zone< MAX_NR_ZONES; zone++) {
> >> mz =&pn->zoneinfo[zone];
> >> for_each_lru(l)
> >>@@ -4716,6 +4715,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> >> mz->on_tree = false;
> >> mz->mem = mem;
> >> }
> >>+ mem->info.nodeinfo[node] = pn;
> >> return 0;
> >> }
> >>
> >>--
> >>1.7.1

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-01 13:50:05

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world

On 06/01/2011 02:39 PM, Michal Hocko wrote:
> I am not saying tha the change is bad, I like it, but I do not think it
> is a fix for potential race condition.
And yes, I agree that it is rather a workaround than real fix of race
condition which shouldn't exist in the first place. But giving my very
limited knowledge of cgroups and difficulty to reproduce the crash
after adding/enabling additional debugging, that patch is what
can fix the problem for now.
And maybe more experienced guys will look at the code and fix it
in the right way. Well at least I hope for that.

2011-06-01 14:39:42

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world

On 06/01/2011 03:41 PM, Michal Hocko wrote:
> [Let's CC some cgroup people]
>
> On Wed 01-06-11 15:07:40, Igor Mammedov wrote:
>> Yes I've seen it (RHBZ#700565).
> I am not subscribed so I will not get there.
>
Sorry, I've not realized that BZ wasn't public, just fixed it.
It is public now.

OOPS backtrace looks like this:

Stopping cgconfig service: BUG: unable to handle kernel paging request at fffffffc
IP: [<c05235b3>] mem_cgroup_force_empty+0x123/0x4a0
*pdpt = 00000000016a0001 *pde = 000000000000a067 *pte = 0000000000000000
Oops: 0000 [#1] SMP
last sysfs file: /sys/module/nf_conntrack/refcnt
Modules linked in: xt_CHECKSUM tun bridge stp llc autofs4 sunrpc ipt_REJECT ip6t_REJECT ipv6 dm_mirror dm_region_hash dm_log uinput microcode xen_netfront sg i2c_piix4 i2c_core ext4 mbcache jbd2 sr_mod cdrom xen_blkfront ata_generic pata_acpi ata_piix dm_mod [last unloaded: nf_conntrack]

Pid: 2300, comm: cgclear Not tainted (2.6.32-131.0.10.el6.i686 #1) HVM domU
EIP: 0060:[<c05235b3>] EFLAGS: 00010206 CPU: 0
EIP is at mem_cgroup_force_empty+0x123/0x4a0
EAX: 00000206 EBX: fffffff4 ECX: c0a3f1e0 EDX: 00000206
ESI: 00000206 EDI: 00000000 EBP: f343ca00 ESP: f34e7e84
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process cgclear (pid: 2300, ti=f34e6000 task=f35baab0 task.ti=f34e6000)
Stack:
ffffffff 00000001 00000000 f34e7eb8 00000100 00000000 c0a3f1e0 c05a5af5
<0> f343ca00 f343cc00 00000000 00000000 00000000 00000000 00000000 00000000
<0> c0a3e7c0 f35172c0 00000005 00000000 f35172d0 f35baab0 00000000 f35172c0
Call Trace:
[<c05a5af5>] ? may_link+0xc5/0x130
[<c049b246>] ? cgroup_rmdir+0x96/0x3f0
[<c0473f20>] ? autoremove_wake_function+0x0/0x40
[<c05339ce>] ? vfs_rmdir+0x9e/0xd0
[<c0536806>] ? do_rmdir+0xc6/0xe0
[<c0506702>] ? do_munmap+0x1f2/0x2c0
[<c04adecc>] ? audit_syscall_entry+0x21c/0x240
[<c04adbe6>] ? audit_syscall_exit+0x216/0x240
[<c0409adf>] ? sysenter_do_call+0x12/0x28
Code: 89 7c 24 20 8b 54 95 08 31 ff 89 44 24 14 81 c2 00 01 00 00 89 54 24 10 e9 83 00 00 00 8d 76 00 8b 44 24 18 89 f2 e8 7d 12 30 00<8b> 73 08 8b 7c 24 24 8b 07 8b 40 18 85 c0 89 44 24 08 0f 84 c5
EIP: [<c05235b3>] mem_cgroup_force_empty+0x123/0x4a0 SS:ESP 0068:f34e7e84
CR2: 00000000fffffffc
---[ end trace 5700dda23f74f94a ]---

2011-06-01 15:20:45

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world

On Wed 01-06-11 16:39:08, Igor Mammedov wrote:
> On 06/01/2011 03:41 PM, Michal Hocko wrote:
> >[Let's CC some cgroup people]
> >
> >On Wed 01-06-11 15:07:40, Igor Mammedov wrote:
> >>Yes I've seen it (RHBZ#700565).
> >I am not subscribed so I will not get there.
> >
> Sorry, I've not realized that BZ wasn't public, just fixed it.
> It is public now.
>
> OOPS backtrace looks like this:
>
> Stopping cgconfig service: BUG: unable to handle kernel paging request at fffffffc

Looks like one pointer underflow from NULL.

> IP: [<c05235b3>] mem_cgroup_force_empty+0x123/0x4a0
> *pdpt = 00000000016a0001 *pde = 000000000000a067 *pte = 0000000000000000
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/module/nf_conntrack/refcnt
> Modules linked in: xt_CHECKSUM tun bridge stp llc autofs4 sunrpc ipt_REJECT ip6t_REJECT ipv6 dm_mirror dm_region_hash dm_log uinput microcode xen_netfront sg i2c_piix4 i2c_core ext4 mbcache jbd2 sr_mod cdrom xen_blkfront ata_generic pata_acpi ata_piix dm_mod [last unloaded: nf_conntrack]
>
> Pid: 2300, comm: cgclear Not tainted (2.6.32-131.0.10.el6.i686 #1) HVM domU

I realize that the issue is hard to reproduce but have you tried it with
the .39 vanilla?
Or at least try it with fce66477 (just a blind shot).

> EIP: 0060:[<c05235b3>] EFLAGS: 00010206 CPU: 0
> EIP is at mem_cgroup_force_empty+0x123/0x4a0
> EAX: 00000206 EBX: fffffff4 ECX: c0a3f1e0 EDX: 00000206
> ESI: 00000206 EDI: 00000000 EBP: f343ca00 ESP: f34e7e84
> DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> Process cgclear (pid: 2300, ti=f34e6000 task=f35baab0 task.ti=f34e6000)
> Stack:
> ffffffff 00000001 00000000 f34e7eb8 00000100 00000000 c0a3f1e0 c05a5af5
> <0> f343ca00 f343cc00 00000000 00000000 00000000 00000000 00000000 00000000
> <0> c0a3e7c0 f35172c0 00000005 00000000 f35172d0 f35baab0 00000000 f35172c0
> Call Trace:
> [<c05a5af5>] ? may_link+0xc5/0x130
> [<c049b246>] ? cgroup_rmdir+0x96/0x3f0
> [<c0473f20>] ? autoremove_wake_function+0x0/0x40
> [<c05339ce>] ? vfs_rmdir+0x9e/0xd0
> [<c0536806>] ? do_rmdir+0xc6/0xe0
> [<c0506702>] ? do_munmap+0x1f2/0x2c0
> [<c04adecc>] ? audit_syscall_entry+0x21c/0x240
> [<c04adbe6>] ? audit_syscall_exit+0x216/0x240
> [<c0409adf>] ? sysenter_do_call+0x12/0x28
> Code: 89 7c 24 20 8b 54 95 08 31 ff 89 44 24 14 81 c2 00 01 00 00 89 54 24 10 e9 83 00 00 00 8d 76 00 8b 44 24 18 89 f2 e8 7d 12 30 00<8b> 73 08 8b 7c 24 24 8b 07 8b 40 18 85 c0 89 44 24 08 0f 84 c5
> EIP: [<c05235b3>] mem_cgroup_force_empty+0x123/0x4a0 SS:ESP 0068:f34e7e84
> CR2: 00000000fffffffc
Code: 89 7c 24 20 8b 54 95 08 31 ff 89 44 24 14 81 c2 00 01 00 00 89 54 24 10 e9 83 00 00 00 8d 76 00 8b 44 24 18 89 f2 e8 7d 12 30 00 <8b> 73 08 8b 7c 24 24 8b 07 8b 40 18 85 c0 89 44 24 08 0f 84 c5
All code
========
0: 89 7c 24 20 mov %edi,0x20(%esp)
4: 8b 54 95 08 mov 0x8(%ebp,%edx,4),%edx
8: 31 ff xor %edi,%edi
a: 89 44 24 14 mov %eax,0x14(%esp)
e: 81 c2 00 01 00 00 add $0x100,%edx
14: 89 54 24 10 mov %edx,0x10(%esp)
18: e9 83 00 00 00 jmp 0xa0
1d: 8d 76 00 lea 0x0(%esi),%esi
20: 8b 44 24 18 mov 0x18(%esp),%eax
24: 89 f2 mov %esi,%edx
26: e8 7d 12 30 00 call 0x3012a8
2b:* 8b 73 08 mov 0x8(%ebx),%esi <-- trapping instruction
2e: 8b 7c 24 24 mov 0x24(%esp),%edi
32: 8b 07 mov (%edi),%eax
34: 8b 40 18 mov 0x18(%eax),%eax
37: 85 c0 test %eax,%eax
39: 89 44 24 08 mov %eax,0x8(%esp)
3d: 0f .byte 0xf
3e: 84 c5 test %al,%ch

Could you send your config file. I cannot find out much from this.
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-01 16:46:30

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world

On 06/01/2011 05:20 PM, Michal Hocko wrote:
> On Wed 01-06-11 16:39:08, Igor Mammedov wrote:
>> On 06/01/2011 03:41 PM, Michal Hocko wrote:
>>> [Let's CC some cgroup people]
>>>
>>> On Wed 01-06-11 15:07:40, Igor Mammedov wrote:
>>>> Yes I've seen it (RHBZ#700565).
>>> I am not subscribed so I will not get there.
>>>
>> Sorry, I've not realized that BZ wasn't public, just fixed it.
>> It is public now.
>>
>> OOPS backtrace looks like this:
>>
>> Stopping cgconfig service: BUG: unable to handle kernel paging request at fffffffc
> Looks like one pointer underflow from NULL.
>
yep. it is. Because list entry fields aren't initialized yet.

>> IP: [<c05235b3>] mem_cgroup_force_empty+0x123/0x4a0
>> *pdpt = 00000000016a0001 *pde = 000000000000a067 *pte = 0000000000000000
>> Oops: 0000 [#1] SMP
>> last sysfs file: /sys/module/nf_conntrack/refcnt
>> Modules linked in: xt_CHECKSUM tun bridge stp llc autofs4 sunrpc ipt_REJECT ip6t_REJECT ipv6 dm_mirror dm_region_hash dm_log uinput microcode xen_netfront sg i2c_piix4 i2c_core ext4 mbcache jbd2 sr_mod cdrom xen_blkfront ata_generic pata_acpi ata_piix dm_mod [last unloaded: nf_conntrack]
>>
>> Pid: 2300, comm: cgclear Not tainted (2.6.32-131.0.10.el6.i686 #1) HVM domU
> I realize that the issue is hard to reproduce but have you tried it with
> the .39 vanilla?

No yet.
But vanilla has the same code and down the road may dereference
invalid 'pc' variable too. That's why I posted patch here, theoretically
OOPS could happen.

> Or at least try it with fce66477 (just a blind shot).
We have this commit. It's probably another problem.

The trouble manifests itself here:

mem_cgroup_force_empty_list:
pc = list_entry(list->prev, struct page_cgroup, lru);
when list->prev == 0 we have invalid 'pc' pointer => crash


>> EIP: 0060:[<c05235b3>] EFLAGS: 00010206 CPU: 0
>> EIP is at mem_cgroup_force_empty+0x123/0x4a0
>> EAX: 00000206 EBX: fffffff4 ECX: c0a3f1e0 EDX: 00000206
>> ESI: 00000206 EDI: 00000000 EBP: f343ca00 ESP: f34e7e84
>> DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
>> Process cgclear (pid: 2300, ti=f34e6000 task=f35baab0 task.ti=f34e6000)
>> Stack:
>> ffffffff 00000001 00000000 f34e7eb8 00000100 00000000 c0a3f1e0 c05a5af5
>> <0> f343ca00 f343cc00 00000000 00000000 00000000 00000000 00000000 00000000
>> <0> c0a3e7c0 f35172c0 00000005 00000000 f35172d0 f35baab0 00000000 f35172c0
>> Call Trace:
>> [<c05a5af5>] ? may_link+0xc5/0x130
>> [<c049b246>] ? cgroup_rmdir+0x96/0x3f0
>> [<c0473f20>] ? autoremove_wake_function+0x0/0x40
>> [<c05339ce>] ? vfs_rmdir+0x9e/0xd0
>> [<c0536806>] ? do_rmdir+0xc6/0xe0
>> [<c0506702>] ? do_munmap+0x1f2/0x2c0
>> [<c04adecc>] ? audit_syscall_entry+0x21c/0x240
>> [<c04adbe6>] ? audit_syscall_exit+0x216/0x240
>> [<c0409adf>] ? sysenter_do_call+0x12/0x28
>> Code: 89 7c 24 20 8b 54 95 08 31 ff 89 44 24 14 81 c2 00 01 00 00 89 54 24 10 e9 83 00 00 00 8d 76 00 8b 44 24 18 89 f2 e8 7d 12 30 00<8b> 73 08 8b 7c 24 24 8b 07 8b 40 18 85 c0 89 44 24 08 0f 84 c5
>> EIP: [<c05235b3>] mem_cgroup_force_empty+0x123/0x4a0 SS:ESP 0068:f34e7e84
>> CR2: 00000000fffffffc
> Code: 89 7c 24 20 8b 54 95 08 31 ff 89 44 24 14 81 c2 00 01 00 00 89 54 24 10 e9 83 00 00 00 8d 76 00 8b 44 24 18 89 f2 e8 7d 12 30 00<8b> 73 08 8b 7c 24 24 8b 07 8b 40 18 85 c0 89 44 24 08 0f 84 c5
> All code
> ========
> 0: 89 7c 24 20 mov %edi,0x20(%esp)
> 4: 8b 54 95 08 mov 0x8(%ebp,%edx,4),%edx
> 8: 31 ff xor %edi,%edi
> a: 89 44 24 14 mov %eax,0x14(%esp)
> e: 81 c2 00 01 00 00 add $0x100,%edx
> 14: 89 54 24 10 mov %edx,0x10(%esp)
> 18: e9 83 00 00 00 jmp 0xa0
> 1d: 8d 76 00 lea 0x0(%esi),%esi
> 20: 8b 44 24 18 mov 0x18(%esp),%eax
> 24: 89 f2 mov %esi,%edx
> 26: e8 7d 12 30 00 call 0x3012a8
> 2b:* 8b 73 08 mov 0x8(%ebx),%esi<-- trapping instruction
> 2e: 8b 7c 24 24 mov 0x24(%esp),%edi
> 32: 8b 07 mov (%edi),%eax
> 34: 8b 40 18 mov 0x18(%eax),%eax
> 37: 85 c0 test %eax,%eax
> 39: 89 44 24 08 mov %eax,0x8(%esp)
> 3d: 0f .byte 0xf
> 3e: 84 c5 test %al,%ch
>
> Could you send your config file. I cannot find out much from this.

Attached. It's released RHEL6.1 kernel although the problem still exists
for recent versions as well.


Attachments:
config-2.6.32-131.0.15.el6.i686 (102.50 kB)

2011-06-01 23:10:07

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world

>pc = list_entry(list->prev, struct page_cgroup, lru);

Hmm, I disagree your patch is a fix for mainline. At least, a cgroup
before completion of
create() is not populated to userland and you never be able to rmdir()
it because you can't
find it.


>26: e8 7d 12 30 00 call 0x3012a8
>2b:* 8b 73 08 mov 0x8(%ebx),%esi <-- trapping
instruction
>2e: 8b 7c 24 24 mov 0x24(%esp),%edi
>32: 8b 07 mov (%edi),%eax

Hm, what is the call 0x3012a8 ?

Thanks,
-Kame

2011-06-03 12:35:59

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world

On 06/02/2011 01:10 AM, Hiroyuki Kamezawa wrote:
>> pc = list_entry(list->prev, struct page_cgroup, lru);
> Hmm, I disagree your patch is a fix for mainline. At least, a cgroup
> before completion of
> create() is not populated to userland and you never be able to rmdir()
> it because you can't
> find it.
>
>
> >26: e8 7d 12 30 00 call 0x3012a8
> >2b:* 8b 73 08 mov 0x8(%ebx),%esi<-- trapping
> instruction
> >2e: 8b 7c 24 24 mov 0x24(%esp),%edi
> >32: 8b 07 mov (%edi),%eax
>
> Hm, what is the call 0x3012a8 ?
>
pc = list_entry(list->prev, struct page_cgroup, lru);
if (busy == pc) {
list_move(&pc->lru, list);
busy = 0;
spin_unlock_irqrestore(&zone->lru_lock, flags);
continue;
}
spin_unlock_irqrestore(&zone->lru_lock, flags); <----
is call 0x3012a8
ret = mem_cgroup_move_parent(pc, mem, GFP_KERNEL);

and mov 0x8(%ebx),%esi
is dereferencing of 'pc' in inlined mem_cgroup_move_parent

I've looked at vmcore once more and indeed there isn't any parallel task
that touches cgroups code path.
Will investigate if it is xen to blame for incorrect data in place.

Thanks very much for your opinion.
> Thanks,
> -Kame

2011-06-03 13:00:50

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world

2011/6/3 Igor Mammedov <[email protected]>:
> On 06/02/2011 01:10 AM, Hiroyuki Kamezawa wrote:
>>>
>>> pc = list_entry(list->prev, struct page_cgroup, lru);
>>
>> Hmm, I disagree your patch is a fix for mainline. At least, a cgroup
>> before completion of
>> create() is not populated to userland and you never be able to rmdir()
>> it because you can't
>> find it.
>>
>>
>> ?>26: ? e8 7d 12 30 00 ? ? ? ? ?call ? 0x3012a8
>> ?>2b:* ?8b 73 08 ? ? ? ? ? ? ? ?mov ? ?0x8(%ebx),%esi<-- trapping
>> instruction
>> ?>2e: ? 8b 7c 24 24 ? ? ? ? ? ? mov ? ?0x24(%esp),%edi
>> ?>32: ? 8b 07 ? ? ? ? ? ? ? ? ? mov ? ?(%edi),%eax
>>
>> Hm, what is the call 0x3012a8 ?
>>
> ? ? ? ? ? ? ? ?pc = list_entry(list->prev, struct page_cgroup, lru);
> ? ? ? ? ? ? ? ?if (busy == pc) {
> ? ? ? ? ? ? ? ? ? ? ? ?list_move(&pc->lru, list);
> ? ? ? ? ? ? ? ? ? ? ? ?busy = 0;
> ? ? ? ? ? ? ? ? ? ? ? ?spin_unlock_irqrestore(&zone->lru_lock, flags);
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ?spin_unlock_irqrestore(&zone->lru_lock, flags); <---- is
> ?call 0x3012a8
> ? ? ? ? ? ? ? ?ret = mem_cgroup_move_parent(pc, mem, GFP_KERNEL);
>
> and ?mov 0x8(%ebx),%esi
> is dereferencing of 'pc' in inlined mem_cgroup_move_parent
>
Ah, thank you for input..then panicd at accessing pc->page and "pc"
was 0xfffffff4.
it means list->prev was NULL.

> I've looked at vmcore once more and indeed there isn't any parallel task
> that touches cgroups code path.
> Will investigate if it is xen to blame for incorrect data in place.
>
> Thanks very much for your opinion.

What curious to me is that the fact "list->prev" is NULL.
I can see why you doubt the initialization code ....the list pointer never
contains NULL once it's used....
it smells like memory corruption or some to me. If you have vmcore,
what the problematic mem_cgroup_per_zone(node) contains ?

Thanks,
-Kame

2011-06-07 13:26:31

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world

Sorry for late reply,

On 06/03/2011 03:00 PM, Hiroyuki Kamezawa wrote:
> 2011/6/3 Igor Mammedov<[email protected]>:
>> On 06/02/2011 01:10 AM, Hiroyuki Kamezawa wrote:
>>>> pc = list_entry(list->prev, struct page_cgroup, lru);
>>> Hmm, I disagree your patch is a fix for mainline. At least, a cgroup
>>> before completion of
>>> create() is not populated to userland and you never be able to rmdir()
>>> it because you can't
>>> find it.
>>>
>>>
>>> >26: e8 7d 12 30 00 call 0x3012a8
>>> >2b:* 8b 73 08 mov 0x8(%ebx),%esi<-- trapping
>>> instruction
>>> >2e: 8b 7c 24 24 mov 0x24(%esp),%edi
>>> >32: 8b 07 mov (%edi),%eax
>>>
>>> Hm, what is the call 0x3012a8 ?
>>>
>> pc = list_entry(list->prev, struct page_cgroup, lru);
>> if (busy == pc) {
>> list_move(&pc->lru, list);
>> busy = 0;
>> spin_unlock_irqrestore(&zone->lru_lock, flags);
>> continue;
>> }
>> spin_unlock_irqrestore(&zone->lru_lock, flags);<---- is
>> call 0x3012a8
>> ret = mem_cgroup_move_parent(pc, mem, GFP_KERNEL);
>>
>> and mov 0x8(%ebx),%esi
>> is dereferencing of 'pc' in inlined mem_cgroup_move_parent
>>
> Ah, thank you for input..then panicd at accessing pc->page and "pc"
> was 0xfffffff4.
> it means list->prev was NULL.
>
yes, that's the case.
>> I've looked at vmcore once more and indeed there isn't any parallel task
>> that touches cgroups code path.
>> Will investigate if it is xen to blame for incorrect data in place.
>>
>> Thanks very much for your opinion.
> What curious to me is that the fact "list->prev" is NULL.
> I can see why you doubt the initialization code ....the list pointer never
> contains NULL once it's used....
> it smells like memory corruption or some to me. If you have vmcore,
> what the problematic mem_cgroup_per_zone(node) contains ?

it has all zeros except for last field:

crash> rd f3446a00 62
f3446a00: 00000000 00000000 00000000 00000000 ................
f3446a10: 00000000 00000000 00000000 00000000 ................
f3446a20: 00000000 00000000 00000000 00000000 ................
f3446a30: 00000000 00000000 00000000 00000000 ................
f3446a40: 00000000 00000000 00000000 00000000 ................
f3446a50: 00000000 00000000 00000000 00000000 ................
f3446a60: 00000000 00000000 00000000 00000000 ................
f3446a70: 00000000 00000000 f36ef800 f3446a7c ..........n.|jD.
f3446a80: f3446a7c f3446a84 f3446a84 f3446a8c |jD..jD..jD..jD.
f3446a90: f3446a8c f3446a94 f3446a94 f3446a9c .jD..jD..jD..jD.
f3446aa0: f3446a9c 00000000 00000000 00000000 .jD.............
f3446ab0: 00000000 00000000 00000000 00000000 ................
f3446ac0: 00000000 00000000 00000000 00000000 ................
f3446ad0: 00000000 00000000 00000000 00000000 ................
f3446ae0: 00000000 00000000 00000000 00000000 ................
f3446af0: 00000000 f36ef800

crash> struct mem_cgroup f36ef800
struct mem_cgroup {
...
info = {
nodeinfo = {0xf3446a00}
},
...

It looks like a very targeted corruption of the first zone except of
the last field, while the second zone and the rest are perfectly
normal (i.e. have empty initialized lists).


PS:
It most easily reproduced only on xen hvm 32bit guest under heavy
vcpus contention for real cpus resources (i.e. I had to overcommit
cpus and run several cpu hog tasks on host to make guest crash on
reboot cycle).
And from last experiments, crash happens only on on hosts that
doesn't have hap feature or if hap is disabled in hypervisor.

> Thanks,
> -Kame

2011-06-08 03:42:29

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world

On Tue, 07 Jun 2011 15:25:59 +0200
Igor Mammedov <[email protected]> wrote:

> Sorry for late reply,
>
> On 06/03/2011 03:00 PM, Hiroyuki Kamezawa wrote:
> > 2011/6/3 Igor Mammedov<[email protected]>:
> >> On 06/02/2011 01:10 AM, Hiroyuki Kamezawa wrote:
> >>>> pc = list_entry(list->prev, struct page_cgroup, lru);
> >>> Hmm, I disagree your patch is a fix for mainline. At least, a cgroup
> >>> before completion of
> >>> create() is not populated to userland and you never be able to rmdir()
> >>> it because you can't
> >>> find it.
> >>>
> >>>
> >>> >26: e8 7d 12 30 00 call 0x3012a8
> >>> >2b:* 8b 73 08 mov 0x8(%ebx),%esi<-- trapping
> >>> instruction
> >>> >2e: 8b 7c 24 24 mov 0x24(%esp),%edi
> >>> >32: 8b 07 mov (%edi),%eax
> >>>
> >>> Hm, what is the call 0x3012a8 ?
> >>>
> >> pc = list_entry(list->prev, struct page_cgroup, lru);
> >> if (busy == pc) {
> >> list_move(&pc->lru, list);
> >> busy = 0;
> >> spin_unlock_irqrestore(&zone->lru_lock, flags);
> >> continue;
> >> }
> >> spin_unlock_irqrestore(&zone->lru_lock, flags);<---- is
> >> call 0x3012a8
> >> ret = mem_cgroup_move_parent(pc, mem, GFP_KERNEL);
> >>
> >> and mov 0x8(%ebx),%esi
> >> is dereferencing of 'pc' in inlined mem_cgroup_move_parent
> >>
> > Ah, thank you for input..then panicd at accessing pc->page and "pc"
> > was 0xfffffff4.
> > it means list->prev was NULL.
> >
> yes, that's the case.
> >> I've looked at vmcore once more and indeed there isn't any parallel task
> >> that touches cgroups code path.
> >> Will investigate if it is xen to blame for incorrect data in place.
> >>
> >> Thanks very much for your opinion.
> > What curious to me is that the fact "list->prev" is NULL.
> > I can see why you doubt the initialization code ....the list pointer never
> > contains NULL once it's used....
> > it smells like memory corruption or some to me. If you have vmcore,
> > what the problematic mem_cgroup_per_zone(node) contains ?
>
> it has all zeros except for last field:
>
> crash> rd f3446a00 62
> f3446a00: 00000000 00000000 00000000 00000000 ................
> f3446a10: 00000000 00000000 00000000 00000000 ................
> f3446a20: 00000000 00000000 00000000 00000000 ................
> f3446a30: 00000000 00000000 00000000 00000000 ................
> f3446a40: 00000000 00000000 00000000 00000000 ................
> f3446a50: 00000000 00000000 00000000 00000000 ................
> f3446a60: 00000000 00000000 00000000 00000000 ................
> f3446a70: 00000000 00000000 f36ef800 f3446a7c ..........n.|jD.
> f3446a80: f3446a7c f3446a84 f3446a84 f3446a8c |jD..jD..jD..jD.
> f3446a90: f3446a8c f3446a94 f3446a94 f3446a9c .jD..jD..jD..jD.
> f3446aa0: f3446a9c 00000000 00000000 00000000 .jD.............
> f3446ab0: 00000000 00000000 00000000 00000000 ................
> f3446ac0: 00000000 00000000 00000000 00000000 ................
> f3446ad0: 00000000 00000000 00000000 00000000 ................
> f3446ae0: 00000000 00000000 00000000 00000000 ................
> f3446af0: 00000000 f36ef800
>
> crash> struct mem_cgroup f36ef800
> struct mem_cgroup {
> ...
> info = {
> nodeinfo = {0xf3446a00}
> },
> ...
>
> It looks like a very targeted corruption of the first zone except of
> the last field, while the second zone and the rest are perfectly
> normal (i.e. have empty initialized lists).
>

Hmm, ok, thank you. Then, mem_cgroup_pre_zone[] was initialized once.
In this kind of case, I tend to check slab header of memory object f3446a00,
or check whether f3446a00 is an alive slab object or not.

Thanks,
-Kame
>
> PS:
> It most easily reproduced only on xen hvm 32bit guest under heavy
> vcpus contention for real cpus resources (i.e. I had to overcommit
> cpus and run several cpu hog tasks on host to make guest crash on
> reboot cycle).
> And from last experiments, crash happens only on on hosts that
> doesn't have hap feature or if hap is disabled in hypervisor.
>
> > Thanks,
> > -Kame
>
>

2011-06-08 21:10:45

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world


The original patch:

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4707,7 +4707,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
if (!pn)
return 1;

- mem->info.nodeinfo[node] = pn;
for (zone = 0; zone < MAX_NR_ZONES; zone++) {
mz = &pn->zoneinfo[zone];
for_each_lru(l)
@@ -4716,6 +4715,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
mz->on_tree = false;
mz->mem = mem;
}
+ mem->info.nodeinfo[node] = pn;
return 0;
}

looks like a really good idea. But it needs a new changelog and I'd be
a bit reluctant to merge it as it appears that the aptch removes our
only known way of reproducing a bug.

So for now I think I'll queue the patch up unchangelogged so the issue
doesn't get forgotten about.

2011-06-08 23:51:21

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world

On Wed, 8 Jun 2011 14:09:51 -0700
Andrew Morton <[email protected]> wrote:

>
> The original patch:
>
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4707,7 +4707,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> if (!pn)
> return 1;
>
> - mem->info.nodeinfo[node] = pn;
> for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> mz = &pn->zoneinfo[zone];
> for_each_lru(l)
> @@ -4716,6 +4715,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> mz->on_tree = false;
> mz->mem = mem;
> }
> + mem->info.nodeinfo[node] = pn;
> return 0;
> }
>
> looks like a really good idea. But it needs a new changelog and I'd be
> a bit reluctant to merge it as it appears that the aptch removes our
> only known way of reproducing a bug.
>
> So for now I think I'll queue the patch up unchangelogged so the issue
> doesn't get forgotten about.
>

Hmm, queued as clean up ? If so, I'll Ack.

Thanks,
-Kame

2011-06-09 08:12:33

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world

On 06/08/2011 05:35 AM, KAMEZAWA Hiroyuki wrote:
> On Tue, 07 Jun 2011 15:25:59 +0200
> Igor Mammedov<[email protected]> wrote:
>
>> Sorry for late reply,
>>
>> On 06/03/2011 03:00 PM, Hiroyuki Kamezawa wrote:
>>> 2011/6/3 Igor Mammedov<[email protected]>:
>>>> On 06/02/2011 01:10 AM, Hiroyuki Kamezawa wrote:
>>>>>> pc = list_entry(list->prev, struct page_cgroup, lru);
>>>>> Hmm, I disagree your patch is a fix for mainline. At least, a cgroup
>>>>> before completion of
>>>>> create() is not populated to userland and you never be able to rmdir()
>>>>> it because you can't
>>>>> find it.
>>>>>
>>>>>
>>>>> >26: e8 7d 12 30 00 call 0x3012a8
>>>>> >2b:* 8b 73 08 mov 0x8(%ebx),%esi<-- trapping
>>>>> instruction
>>>>> >2e: 8b 7c 24 24 mov 0x24(%esp),%edi
>>>>> >32: 8b 07 mov (%edi),%eax
>>>>>
>>>>> Hm, what is the call 0x3012a8 ?
>>>>>
>>>> pc = list_entry(list->prev, struct page_cgroup, lru);
>>>> if (busy == pc) {
>>>> list_move(&pc->lru, list);
>>>> busy = 0;
>>>> spin_unlock_irqrestore(&zone->lru_lock, flags);
>>>> continue;
>>>> }
>>>> spin_unlock_irqrestore(&zone->lru_lock, flags);<---- is
>>>> call 0x3012a8
>>>> ret = mem_cgroup_move_parent(pc, mem, GFP_KERNEL);
>>>>
>>>> and mov 0x8(%ebx),%esi
>>>> is dereferencing of 'pc' in inlined mem_cgroup_move_parent
>>>>
>>> Ah, thank you for input..then panicd at accessing pc->page and "pc"
>>> was 0xfffffff4.
>>> it means list->prev was NULL.
>>>
>> yes, that's the case.
>>>> I've looked at vmcore once more and indeed there isn't any parallel task
>>>> that touches cgroups code path.
>>>> Will investigate if it is xen to blame for incorrect data in place.
>>>>
>>>> Thanks very much for your opinion.
>>> What curious to me is that the fact "list->prev" is NULL.
>>> I can see why you doubt the initialization code ....the list pointer never
>>> contains NULL once it's used....
>>> it smells like memory corruption or some to me. If you have vmcore,
>>> what the problematic mem_cgroup_per_zone(node) contains ?
>> it has all zeros except for last field:
>>
>> crash> rd f3446a00 62
>> f3446a00: 00000000 00000000 00000000 00000000 ................
>> f3446a10: 00000000 00000000 00000000 00000000 ................
>> f3446a20: 00000000 00000000 00000000 00000000 ................
>> f3446a30: 00000000 00000000 00000000 00000000 ................
>> f3446a40: 00000000 00000000 00000000 00000000 ................
>> f3446a50: 00000000 00000000 00000000 00000000 ................
>> f3446a60: 00000000 00000000 00000000 00000000 ................
>> f3446a70: 00000000 00000000 f36ef800 f3446a7c ..........n.|jD.
>> f3446a80: f3446a7c f3446a84 f3446a84 f3446a8c |jD..jD..jD..jD.
>> f3446a90: f3446a8c f3446a94 f3446a94 f3446a9c .jD..jD..jD..jD.
>> f3446aa0: f3446a9c 00000000 00000000 00000000 .jD.............
>> f3446ab0: 00000000 00000000 00000000 00000000 ................
>> f3446ac0: 00000000 00000000 00000000 00000000 ................
>> f3446ad0: 00000000 00000000 00000000 00000000 ................
>> f3446ae0: 00000000 00000000 00000000 00000000 ................
>> f3446af0: 00000000 f36ef800
>>
>> crash> struct mem_cgroup f36ef800
>> struct mem_cgroup {
>> ...
>> info = {
>> nodeinfo = {0xf3446a00}
>> },
>> ...
>>
>> It looks like a very targeted corruption of the first zone except of
>> the last field, while the second zone and the rest are perfectly
>> normal (i.e. have empty initialized lists).
>>
> Hmm, ok, thank you. Then, mem_cgroup_pre_zone[] was initialized once.
> In this kind of case, I tend to check slab header of memory object f3446a00,
> or check whether f3446a00 is an alive slab object or not.
It looks like f3446a00 alive/allocated object

crash> kmem f3446a00
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
f7000c80 size-512 512 2251 2616 327 4k
SLAB MEMORY TOTAL ALLOCATED FREE
f3da6540 f3446000 8 1 7
FREE / [ALLOCATED]
[f3446a00]

PAGE PHYSICAL MAPPING INDEX CNT FLAGS
c1fa58c0 33446000 0 70 1 2800080


However I have a related crash that can lead to not initialized lists of
the first entry
(i.e. to what we see at f3446a00), debug kernel sometimes will crash at
alloc_mem_cgroup_per_zone_info:

XXX: pn: f208dc00, phy: 3208dc00
XXX: pn: f2e85a00, phy: 32e85a00
BUG: unable to handle kernel paging request at 9b74e240
IP: [<c080b95f>] mem_cgroup_create0x+0xef/0x350
*pdpt = 0000000033542001 *pde = 0000000000000000
Oops: 0002 [#1] SMP
...

Pid: 1823, comm: libvirtd Tainted: G ---------------- T
(2.6.32.700565 #21) HVM domU
EIP: 0060:[<c080b95f>] EFLAGS: 00210297 CPU: 3
EIP is at mem_cgroup_create+0xef/0x350
EAX: 9b74e240 EBX: f2e85a00 ECX: 00000001 EDX: 00000001
ESI: a88c8840 EDI: a88c8840 EBP: f201deb4 ESP: f201de8c
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process libvirtd (pid: 1823, ti=f201c000 task=f3642ab0 task.ti=f201c000)
Stack:
c09579b2 f2e85a00 32e85a00 f3455800 00000000 f2e85a00 f2c14ac0 c0a5a820
<0> fffffff4 f2c14ac0 f201def8 c049d3a7 00000000 00000000 00000000 000001ed
<0> f2c14ac8 f5fa4400 f24fe954 f3502000 f2c14e40 f24f5608 f3502010 f2c14ac0
Call Trace:
[<c049d3a7>] cgroup_mkdir+0xf7/0x450
[<c05318e3>] vfs_mkdir+0x93/0xf0
[<c0533787>] ? lookup_hash+0x27/0x30
[<c053390e>] sys_mkdirat+0xde/0x100
[<c04b5d4d>] ? call_rcu_sched+0xd/0x10
[<c04b5d58>] ? call_rcu+0x8/0x10
[<c047ab9f>] ? __put_cred+0x2f/0x50
[<c0524ded>] ? sys_faccessat+0x14d/0x180
[<c0523fb7>] ? filp_close+0x47/0x70
[<c0533950>] sys_mkdir+0x20/0x30
[<c0409b5f>] sysenter_do_call+0x12/0x28


static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
{
...
memset(pn, 0, sizeof(*pn));

for (zone = 0; zone< MAX_NR_ZONES; zone++) {
mz =&pn->zoneinfo[zone];
for_each_lru(l)
INIT_LIST_HEAD(&mz->lists[l]);<- crash here
mz->usage_in_excess = 0;
mz->on_tree = false;
mz->mem = mem;
}
...


crash> dis 0xc080b93e 15
0xc080b93e<mem_cgroup_create+206>: movl $0x0,-0x18(%ebp)
0xc080b945<mem_cgroup_create+213>: mov %esi,-0x1c(%ebp)
0xc080b948<mem_cgroup_create+216>: imul $0x7c,-0x18(%ebp),%edi
0xc080b94c<mem_cgroup_create+220>: xor %ecx,%ecx
0xc080b94e<mem_cgroup_create+222>: xor %edx,%edx
0xc080b950<mem_cgroup_create+224>: lea (%edi,%edx,8),%esi
0xc080b953<mem_cgroup_create+227>: add $0x1,%ecx
0xc080b956<mem_cgroup_create+230>: lea (%ebx,%esi,1),%eax
0xc080b959<mem_cgroup_create+233>: add $0x1,%edx
0xc080b95c<mem_cgroup_create+236>: cmp $0x5,%ecx
0xc080b95f<mem_cgroup_create+239>: mov %eax,(%ebx,%esi,1)
0xc080b962<mem_cgroup_create+242>: mov %eax,0x4(%eax)
0xc080b965<mem_cgroup_create+245>: jne 0xc080b950
0xc080b967<mem_cgroup_create+247>: mov -0x14(%ebp),%eax
0xc080b96a<mem_cgroup_create+250>: movl $0x0,0x6c(%eax)

EDI on the first iteration should be 0 however it is a88c8840 according to Oops
dump and looking at -0x18(%ebp) in core shows 0 as it should be:

crash> x/xw 0xf201deb4-0x18
0xf201de9c: 0x00000000

so it looks like EDI is incorrectly restored by Xen or at the moment when 0xc080b948
was executed -0x18(%ebp) had that weird value.

It is possible that invalid EDI value and following

0xc080b950<mem_cgroup_create+224>: lea (%edi,%edx,8),%esi

<https://bugzilla.redhat.com/show_bug.cgi?id=700565#c36>lead to some
accessible page and writes

0xc080b95f<mem_cgroup_create+239>: mov %eax,(%ebx,%esi,1)
0xc080b962<mem_cgroup_create+242>: mov %eax,0x4(%eax)

silently go to that page. Than after init lists loop it uses correct pn offset from
-0x14(%ebp) and initialises the rest fields of structure on the correct page.

mz->usage_in_excess = 0;
mz->on_tree = false;
mz->mem = mem;

0xc080b967<mem_cgroup_create+247>: mov -0x14(%ebp),%eax<-
0xc080b96a<mem_cgroup_create+250>: movl $0x0,0x6c(%eax)
0xc080b971<mem_cgroup_create+257>: movl $0x0,0x70(%eax)
0xc080b978<mem_cgroup_create+264>: movb $0x0,0x74(%eax)
0xc080b97c<mem_cgroup_create+268>: mov -0x1c(%ebp),%edx
0xc080b97f<mem_cgroup_create+271>: mov %edx,0x78(%eax)
0xc080b982<mem_cgroup_create+274>: add $0x7c,%eax
0xc080b985<mem_cgroup_create+277>: addl $0x1,-0x18(%ebp)
0xc080b989<mem_cgroup_create+281>: cmpl $0x4,-0x18(%ebp)
0xc080b98d<mem_cgroup_create+285>: mov %eax,-0x14(%ebp)
0xc080b990<mem_cgroup_create+288>: jne 0xc080b948

which could lead to the 0-ed list entries of the first zone
and the originally reported Oops in mem_cgroup_force_empty.
Afterwards it looks like:

0xc080b985<mem_cgroup_create+277>: addl $0x1,-0x18(%ebp)

-0x18(%ebp) is read correctly and the rest of 3 mz entries are initialized as
expected.

So question is why and how
0xc080b948<mem_cgroup_create+216>: imul $0x7c,-0x18(%ebp),%edi
may be screwed up

PS:
However, memory search for the went astray writes of the first entry
i.e. sequesnce f3446a00 f3446a00 in a couple of vmcores didn't give
any positive results.


> Thanks,
> -Kame
>> PS:
>> It most easily reproduced only on xen hvm 32bit guest under heavy
>> vcpus contention for real cpus resources (i.e. I had to overcommit
>> cpus and run several cpu hog tasks on host to make guest crash on
>> reboot cycle).
>> And from last experiments, crash happens only on on hosts that
>> doesn't have hap feature or if hap is disabled in hypervisor.
>>
>>> Thanks,
>>> -Kame
>>

2011-06-09 12:37:03

by Stefano Stabellini

[permalink] [raw]
Subject: Possible shadow bug (was: Re: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world)

CC'ing xen-devel and Tim.

This is a comment from a previous email in the thread:

> It most easily reproduced only on xen hvm 32bit guest under heavy vcpus
> contention for real cpus resources (i.e. I had to overcommit cpus and
> run several cpu hog tasks on host to make guest crash on reboot cycle).
> And from last experiments, crash happens only on on hosts that doesn't
> have hap feature or if hap is disabled in hypervisor.

it makes me think that it is a shadow pagetables bug; see details below.
You can find more details on it following this thread on the lkml.




On Thu, 9 Jun 2011, Igor Mammedov wrote:
> On 06/08/2011 05:35 AM, KAMEZAWA Hiroyuki wrote:
> > On Tue, 07 Jun 2011 15:25:59 +0200
> > Igor Mammedov<[email protected]> wrote:
> >
> >> Sorry for late reply,
> >>
> >> On 06/03/2011 03:00 PM, Hiroyuki Kamezawa wrote:
> >>> 2011/6/3 Igor Mammedov<[email protected]>:
> >>>> On 06/02/2011 01:10 AM, Hiroyuki Kamezawa wrote:
> >>>>>> pc = list_entry(list->prev, struct page_cgroup, lru);
> >>>>> Hmm, I disagree your patch is a fix for mainline. At least, a cgroup
> >>>>> before completion of
> >>>>> create() is not populated to userland and you never be able to rmdir()
> >>>>> it because you can't
> >>>>> find it.
> >>>>>
> >>>>>
> >>>>> >26: e8 7d 12 30 00 call 0x3012a8
> >>>>> >2b:* 8b 73 08 mov 0x8(%ebx),%esi<-- trapping
> >>>>> instruction
> >>>>> >2e: 8b 7c 24 24 mov 0x24(%esp),%edi
> >>>>> >32: 8b 07 mov (%edi),%eax
> >>>>>
> >>>>> Hm, what is the call 0x3012a8 ?
> >>>>>
> >>>> pc = list_entry(list->prev, struct page_cgroup, lru);
> >>>> if (busy == pc) {
> >>>> list_move(&pc->lru, list);
> >>>> busy = 0;
> >>>> spin_unlock_irqrestore(&zone->lru_lock, flags);
> >>>> continue;
> >>>> }
> >>>> spin_unlock_irqrestore(&zone->lru_lock, flags);<---- is
> >>>> call 0x3012a8
> >>>> ret = mem_cgroup_move_parent(pc, mem, GFP_KERNEL);
> >>>>
> >>>> and mov 0x8(%ebx),%esi
> >>>> is dereferencing of 'pc' in inlined mem_cgroup_move_parent
> >>>>
> >>> Ah, thank you for input..then panicd at accessing pc->page and "pc"
> >>> was 0xfffffff4.
> >>> it means list->prev was NULL.
> >>>
> >> yes, that's the case.
> >>>> I've looked at vmcore once more and indeed there isn't any parallel task
> >>>> that touches cgroups code path.
> >>>> Will investigate if it is xen to blame for incorrect data in place.
> >>>>
> >>>> Thanks very much for your opinion.
> >>> What curious to me is that the fact "list->prev" is NULL.
> >>> I can see why you doubt the initialization code ....the list pointer never
> >>> contains NULL once it's used....
> >>> it smells like memory corruption or some to me. If you have vmcore,
> >>> what the problematic mem_cgroup_per_zone(node) contains ?
> >> it has all zeros except for last field:
> >>
> >> crash> rd f3446a00 62
> >> f3446a00: 00000000 00000000 00000000 00000000 ................
> >> f3446a10: 00000000 00000000 00000000 00000000 ................
> >> f3446a20: 00000000 00000000 00000000 00000000 ................
> >> f3446a30: 00000000 00000000 00000000 00000000 ................
> >> f3446a40: 00000000 00000000 00000000 00000000 ................
> >> f3446a50: 00000000 00000000 00000000 00000000 ................
> >> f3446a60: 00000000 00000000 00000000 00000000 ................
> >> f3446a70: 00000000 00000000 f36ef800 f3446a7c ..........n.|jD.
> >> f3446a80: f3446a7c f3446a84 f3446a84 f3446a8c |jD..jD..jD..jD.
> >> f3446a90: f3446a8c f3446a94 f3446a94 f3446a9c .jD..jD..jD..jD.
> >> f3446aa0: f3446a9c 00000000 00000000 00000000 .jD.............
> >> f3446ab0: 00000000 00000000 00000000 00000000 ................
> >> f3446ac0: 00000000 00000000 00000000 00000000 ................
> >> f3446ad0: 00000000 00000000 00000000 00000000 ................
> >> f3446ae0: 00000000 00000000 00000000 00000000 ................
> >> f3446af0: 00000000 f36ef800
> >>
> >> crash> struct mem_cgroup f36ef800
> >> struct mem_cgroup {
> >> ...
> >> info = {
> >> nodeinfo = {0xf3446a00}
> >> },
> >> ...
> >>
> >> It looks like a very targeted corruption of the first zone except of
> >> the last field, while the second zone and the rest are perfectly
> >> normal (i.e. have empty initialized lists).
> >>
> > Hmm, ok, thank you. Then, mem_cgroup_pre_zone[] was initialized once.
> > In this kind of case, I tend to check slab header of memory object f3446a00,
> > or check whether f3446a00 is an alive slab object or not.
> It looks like f3446a00 alive/allocated object
>
> crash> kmem f3446a00
> CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
> f7000c80 size-512 512 2251 2616 327 4k
> SLAB MEMORY TOTAL ALLOCATED FREE
> f3da6540 f3446000 8 1 7
> FREE / [ALLOCATED]
> [f3446a00]
>
> PAGE PHYSICAL MAPPING INDEX CNT FLAGS
> c1fa58c0 33446000 0 70 1 2800080
>
>
> However I have a related crash that can lead to not initialized lists of
> the first entry
> (i.e. to what we see at f3446a00), debug kernel sometimes will crash at
> alloc_mem_cgroup_per_zone_info:
>
> XXX: pn: f208dc00, phy: 3208dc00
> XXX: pn: f2e85a00, phy: 32e85a00
> BUG: unable to handle kernel paging request at 9b74e240
> IP: [<c080b95f>] mem_cgroup_create0x+0xef/0x350
> *pdpt = 0000000033542001 *pde = 0000000000000000
> Oops: 0002 [#1] SMP
> ...
>
> Pid: 1823, comm: libvirtd Tainted: G ---------------- T
> (2.6.32.700565 #21) HVM domU
> EIP: 0060:[<c080b95f>] EFLAGS: 00210297 CPU: 3
> EIP is at mem_cgroup_create+0xef/0x350
> EAX: 9b74e240 EBX: f2e85a00 ECX: 00000001 EDX: 00000001
> ESI: a88c8840 EDI: a88c8840 EBP: f201deb4 ESP: f201de8c
> DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> Process libvirtd (pid: 1823, ti=f201c000 task=f3642ab0 task.ti=f201c000)
> Stack:
> c09579b2 f2e85a00 32e85a00 f3455800 00000000 f2e85a00 f2c14ac0 c0a5a820
> <0> fffffff4 f2c14ac0 f201def8 c049d3a7 00000000 00000000 00000000 000001ed
> <0> f2c14ac8 f5fa4400 f24fe954 f3502000 f2c14e40 f24f5608 f3502010 f2c14ac0
> Call Trace:
> [<c049d3a7>] cgroup_mkdir+0xf7/0x450
> [<c05318e3>] vfs_mkdir+0x93/0xf0
> [<c0533787>] ? lookup_hash+0x27/0x30
> [<c053390e>] sys_mkdirat+0xde/0x100
> [<c04b5d4d>] ? call_rcu_sched+0xd/0x10
> [<c04b5d58>] ? call_rcu+0x8/0x10
> [<c047ab9f>] ? __put_cred+0x2f/0x50
> [<c0524ded>] ? sys_faccessat+0x14d/0x180
> [<c0523fb7>] ? filp_close+0x47/0x70
> [<c0533950>] sys_mkdir+0x20/0x30
> [<c0409b5f>] sysenter_do_call+0x12/0x28
>
>
> static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> {
> ...
> memset(pn, 0, sizeof(*pn));
>
> for (zone = 0; zone< MAX_NR_ZONES; zone++) {
> mz =&pn->zoneinfo[zone];
> for_each_lru(l)
> INIT_LIST_HEAD(&mz->lists[l]);<- crash here
> mz->usage_in_excess = 0;
> mz->on_tree = false;
> mz->mem = mem;
> }
> ...
>
>
> crash> dis 0xc080b93e 15
> 0xc080b93e<mem_cgroup_create+206>: movl $0x0,-0x18(%ebp)
> 0xc080b945<mem_cgroup_create+213>: mov %esi,-0x1c(%ebp)
> 0xc080b948<mem_cgroup_create+216>: imul $0x7c,-0x18(%ebp),%edi
> 0xc080b94c<mem_cgroup_create+220>: xor %ecx,%ecx
> 0xc080b94e<mem_cgroup_create+222>: xor %edx,%edx
> 0xc080b950<mem_cgroup_create+224>: lea (%edi,%edx,8),%esi
> 0xc080b953<mem_cgroup_create+227>: add $0x1,%ecx
> 0xc080b956<mem_cgroup_create+230>: lea (%ebx,%esi,1),%eax
> 0xc080b959<mem_cgroup_create+233>: add $0x1,%edx
> 0xc080b95c<mem_cgroup_create+236>: cmp $0x5,%ecx
> 0xc080b95f<mem_cgroup_create+239>: mov %eax,(%ebx,%esi,1)
> 0xc080b962<mem_cgroup_create+242>: mov %eax,0x4(%eax)
> 0xc080b965<mem_cgroup_create+245>: jne 0xc080b950
> 0xc080b967<mem_cgroup_create+247>: mov -0x14(%ebp),%eax
> 0xc080b96a<mem_cgroup_create+250>: movl $0x0,0x6c(%eax)
>
> EDI on the first iteration should be 0 however it is a88c8840 according to Oops
> dump and looking at -0x18(%ebp) in core shows 0 as it should be:
>
> crash> x/xw 0xf201deb4-0x18
> 0xf201de9c: 0x00000000
>
> so it looks like EDI is incorrectly restored by Xen or at the moment when 0xc080b948
> was executed -0x18(%ebp) had that weird value.
>
> It is possible that invalid EDI value and following
>
> 0xc080b950<mem_cgroup_create+224>: lea (%edi,%edx,8),%esi
>
> <https://bugzilla.redhat.com/show_bug.cgi?id=700565#c36>lead to some
> accessible page and writes
>
> 0xc080b95f<mem_cgroup_create+239>: mov %eax,(%ebx,%esi,1)
> 0xc080b962<mem_cgroup_create+242>: mov %eax,0x4(%eax)
>
> silently go to that page. Than after init lists loop it uses correct pn offset from
> -0x14(%ebp) and initialises the rest fields of structure on the correct page.
>
> mz->usage_in_excess = 0;
> mz->on_tree = false;
> mz->mem = mem;
>
> 0xc080b967<mem_cgroup_create+247>: mov -0x14(%ebp),%eax<-
> 0xc080b96a<mem_cgroup_create+250>: movl $0x0,0x6c(%eax)
> 0xc080b971<mem_cgroup_create+257>: movl $0x0,0x70(%eax)
> 0xc080b978<mem_cgroup_create+264>: movb $0x0,0x74(%eax)
> 0xc080b97c<mem_cgroup_create+268>: mov -0x1c(%ebp),%edx
> 0xc080b97f<mem_cgroup_create+271>: mov %edx,0x78(%eax)
> 0xc080b982<mem_cgroup_create+274>: add $0x7c,%eax
> 0xc080b985<mem_cgroup_create+277>: addl $0x1,-0x18(%ebp)
> 0xc080b989<mem_cgroup_create+281>: cmpl $0x4,-0x18(%ebp)
> 0xc080b98d<mem_cgroup_create+285>: mov %eax,-0x14(%ebp)
> 0xc080b990<mem_cgroup_create+288>: jne 0xc080b948
>
> which could lead to the 0-ed list entries of the first zone
> and the originally reported Oops in mem_cgroup_force_empty.
> Afterwards it looks like:
>
> 0xc080b985<mem_cgroup_create+277>: addl $0x1,-0x18(%ebp)
>
> -0x18(%ebp) is read correctly and the rest of 3 mz entries are initialized as
> expected.
>
> So question is why and how
> 0xc080b948<mem_cgroup_create+216>: imul $0x7c,-0x18(%ebp),%edi
> may be screwed up
>
> PS:
> However, memory search for the went astray writes of the first entry
> i.e. sequesnce f3446a00 f3446a00 in a couple of vmcores didn't give
> any positive results.
>
>
> > Thanks,
> > -Kame
> >> PS:
> >> It most easily reproduced only on xen hvm 32bit guest under heavy
> >> vcpus contention for real cpus resources (i.e. I had to overcommit
> >> cpus and run several cpu hog tasks on host to make guest crash on
> >> reboot cycle).
> >> And from last experiments, crash happens only on on hosts that
> >> doesn't have hap feature or if hap is disabled in hypervisor.
> >>
> >>> Thanks,
> >>> -Kame
> >>
>

2011-06-09 15:01:40

by Tim Deegan

[permalink] [raw]
Subject: Re: [Xen-devel] Possible shadow bug (was: Re: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world)

At 13:40 +0100 on 09 Jun (1307626812), Stefano Stabellini wrote:
> CC'ing xen-devel and Tim.
>
> This is a comment from a previous email in the thread:
>
> > It most easily reproduced only on xen hvm 32bit guest under heavy vcpus
> > contention for real cpus resources (i.e. I had to overcommit cpus and
> > run several cpu hog tasks on host to make guest crash on reboot cycle).
> > And from last experiments, crash happens only on on hosts that doesn't
> > have hap feature or if hap is disabled in hypervisor.
>
> it makes me think that it is a shadow pagetables bug; see details below.
> You can find more details on it following this thread on the lkml.

Oh dear. I'm having a look at the linux code now to try and understand
the behaviour. In the meantime, what version of Xen was this on? If
you're willing to try recompiling Xen with some small patches that
disable the "cleverer" parts of the shadow pagetable code that might
indicate something. (Of course, it might just change the timing to
obscure a real linux bug too.)

The only time I've seen a corruption like this, with a mapping
transiently going to the wrong frame, it turned out to be caused by
32-bit pagetable-handling code writing a PAE PTE with a single 64-bit
write (which is not atomic on x86-32), and the TLB happening to see the
intermediate, half-written entry. I doubt that there's any bug like
that in linux, though, or we'd surely have seen it before now.

Cheers,

Tim.

--
Tim Deegan <[email protected]>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)

2011-06-09 16:48:42

by Igor Mammedov

[permalink] [raw]
Subject: Re: [Xen-devel] Possible shadow bug

On 06/09/2011 05:01 PM, Tim Deegan wrote:
> At 13:40 +0100 on 09 Jun (1307626812), Stefano Stabellini wrote:
>> CC'ing xen-devel and Tim.
>>
>> This is a comment from a previous email in the thread:
>>
>>> It most easily reproduced only on xen hvm 32bit guest under heavy vcpus
>>> contention for real cpus resources (i.e. I had to overcommit cpus and
>>> run several cpu hog tasks on host to make guest crash on reboot cycle).
>>> And from last experiments, crash happens only on on hosts that doesn't
>>> have hap feature or if hap is disabled in hypervisor.
>> it makes me think that it is a shadow pagetables bug; see details below.
>> You can find more details on it following this thread on the lkml.
> Oh dear. I'm having a look at the linux code now to try and understand
> the behaviour. In the meantime, what version of Xen was this on? If
It's rhel5.6 xen. I've tried to test on SLES 11 that has 4.0.1 xen, however
wasn't able to reproduce problem. (I'm not sure if hap was turned off in
this
case). More detailed info can be found at RHBZ#700565

> you're willing to try recompiling Xen with some small patches that
> disable the "cleverer" parts of the shadow pagetable code that might
> indicate something. (Of course, it might just change the timing to
> obscure a real linux bug too.)
>
Haven't got to this part yet. But looks like it's the only option left.

> The only time I've seen a corruption like this, with a mapping
> transiently going to the wrong frame, it turned out to be caused by
> 32-bit pagetable-handling code writing a PAE PTE with a single 64-bit
> write (which is not atomic on x86-32), and the TLB happening to see the
> intermediate, half-written entry. I doubt that there's any bug like
> that in linux, though, or we'd surely have seen it before now.
>
> Cheers,
>
> Tim.
>


--
Thanks,
Igor

2011-06-10 10:01:44

by Tim Deegan

[permalink] [raw]
Subject: Re: [Xen-devel] Possible shadow bug

Hi,

At 18:47 +0200 on 09 Jun (1307645229), Igor Mammedov wrote:
> It's rhel5.6 xen. I've tried to test on SLES 11 that has 4.0.1 xen, however
> wasn't able to reproduce problem. (I'm not sure if hap was turned
> off in this case). More detailed info can be found at RHBZ#700565

The best way to be sure whether HAP is in use is to connect to the
serial line, hit ^A^A^A to switch input to Xen, and hit 'q' to dump
per-domain state. The printout for the guest domain should either say
"paging assistance: shadow refcounts translate external"
or
"paging assistance: hap refcounts translate external".

(If you don't have serial you can get the same info by running
"xm debug-keys q" and then "xm dmesg" to read the output.)

> >you're willing to try recompiling Xen with some small patches that
> >disable the "cleverer" parts of the shadow pagetable code that might
> >indicate something. (Of course, it might just change the timing to
> >obscure a real linux bug too.)
> >
> Haven't got to this part yet. But looks like it's the only option left.

Actually, looking at the disassembly you posted, it looks more like it
might be an emulator bug in Xen; if Xen finds itself emulating the IMUL
instruction and either gets the logic wrong or does the memory access
wrong, it could cause that failure. And one reason that Xen emulates
instructions is if the memory operand is on a pagetable that's shadowed
(which might be a page that was recently a pagetable).

ISTR that even though the RHEL xen reports a 3.0.x version it has quite
a lot of backports in it. Does it have this patch?
http://hg.uk.xensource.com/xen-3.1-testing.hg/rev/e8fca4c42d05

Cheers,

Tim.

--
Tim Deegan <[email protected]>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)

2011-06-10 10:10:18

by Tim Deegan

[permalink] [raw]
Subject: Re: [Xen-devel] Possible shadow bug

At 11:01 +0100 on 10 Jun (1307703699), Tim Deegan wrote:
> Actually, looking at the disassembly you posted, it looks more like it
> might be an emulator bug in Xen; if Xen finds itself emulating the IMUL
> instruction and either gets the logic wrong or does the memory access
> wrong, it could cause that failure. And one reason that Xen emulates
> instructions is if the memory operand is on a pagetable that's shadowed
> (which might be a page that was recently a pagetable).
>
> ISTR that even though the RHEL xen reports a 3.0.x version it has quite
> a lot of backports in it. Does it have this patch?
> http://hg.uk.xensource.com/xen-3.1-testing.hg/rev/e8fca4c42d05

Oops, that URL doesn't work; I meant this:
http://xenbits.xen.org/xen-3.1-testing.hg/rev/e8fca4c42d05

Tim.

--
Tim Deegan <[email protected]>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)

2011-06-10 12:24:07

by Pasi Kärkkäinen

[permalink] [raw]
Subject: Re: [Xen-devel] Possible shadow bug

On Fri, Jun 10, 2011 at 11:10:11AM +0100, Tim Deegan wrote:
> At 11:01 +0100 on 10 Jun (1307703699), Tim Deegan wrote:
> > Actually, looking at the disassembly you posted, it looks more like it
> > might be an emulator bug in Xen; if Xen finds itself emulating the IMUL
> > instruction and either gets the logic wrong or does the memory access
> > wrong, it could cause that failure. And one reason that Xen emulates
> > instructions is if the memory operand is on a pagetable that's shadowed
> > (which might be a page that was recently a pagetable).
> >
> > ISTR that even though the RHEL xen reports a 3.0.x version it has quite
> > a lot of backports in it. Does it have this patch?
> > http://hg.uk.xensource.com/xen-3.1-testing.hg/rev/e8fca4c42d05
>
> Oops, that URL doesn't work; I meant this:
> http://xenbits.xen.org/xen-3.1-testing.hg/rev/e8fca4c42d05
>

RHEL5 Xen (hypervisor) reports version as 3.1.2-xyz..

-- Pasi

2011-06-10 12:40:52

by Tim Deegan

[permalink] [raw]
Subject: Re: [Xen-devel] Possible shadow bug

At 14:48 +0300 on 10 Jun (1307717301), Pasi K?rkk?inen wrote:
> On Fri, Jun 10, 2011 at 11:10:11AM +0100, Tim Deegan wrote:
> > At 11:01 +0100 on 10 Jun (1307703699), Tim Deegan wrote:
> > > ISTR that even though the RHEL xen reports a 3.0.x version it has quite
> > > a lot of backports in it. Does it have this patch?
> > > http://hg.uk.xensource.com/xen-3.1-testing.hg/rev/e8fca4c42d05
> >
> > Oops, that URL doesn't work; I meant this:
> > http://xenbits.xen.org/xen-3.1-testing.hg/rev/e8fca4c42d05
> >
>
> RHEL5 Xen (hypervisor) reports version as 3.1.2-xyz..

Based on a quick scrobble through the CentOS 5.6 SRPMs it looks like a
3.1.0 hypervisor with a bunch of extra patches, but not this one. This
is very likely the cause of the crash in mem_cgroup_create(), and
probably the corruptions too. That would explain why they didn't happen
on a 4.0.x SLES11 Xen, but not really why the original patch in this
thread made it go away.

Cheers,

Tim.

--
Tim Deegan <[email protected]>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)

2011-06-10 13:55:59

by Igor Mammedov

[permalink] [raw]
Subject: Re: [Xen-devel] Possible shadow bug

On 06/10/2011 12:10 PM, Tim Deegan wrote:
> At 11:01 +0100 on 10 Jun (1307703699), Tim Deegan wrote:
>> Actually, looking at the disassembly you posted, it looks more like it
>> might be an emulator bug in Xen; if Xen finds itself emulating the IMUL
>> instruction and either gets the logic wrong or does the memory access
>> wrong, it could cause that failure. And one reason that Xen emulates
>> instructions is if the memory operand is on a pagetable that's shadowed
>> (which might be a page that was recently a pagetable).
>>
>> ISTR that even though the RHEL xen reports a 3.0.x version it has quite
>> a lot of backports in it. Does it have this patch?
>> http://hg.uk.xensource.com/xen-3.1-testing.hg/rev/e8fca4c42d05
> Oops, that URL doesn't work; I meant this:
> http://xenbits.xen.org/xen-3.1-testing.hg/rev/e8fca4c42d05
>
> Tim.
>
Tim, Thank you very much!
We were missing that cs and it solved problem.

--
Thanks,
Igor

2011-06-10 15:39:45

by Igor Mammedov

[permalink] [raw]
Subject: Re: [Xen-devel] Possible shadow bug

On 06/10/2011 02:40 PM, Tim Deegan wrote:
> At 14:48 +0300 on 10 Jun (1307717301), Pasi K?rkk?inen wrote:
>> On Fri, Jun 10, 2011 at 11:10:11AM +0100, Tim Deegan wrote:
>>> At 11:01 +0100 on 10 Jun (1307703699), Tim Deegan wrote:
>>>> ISTR that even though the RHEL xen reports a 3.0.x version it has quite
>>>> a lot of backports in it. Does it have this patch?
>>>> http://hg.uk.xensource.com/xen-3.1-testing.hg/rev/e8fca4c42d05
>>> Oops, that URL doesn't work; I meant this:
>>> http://xenbits.xen.org/xen-3.1-testing.hg/rev/e8fca4c42d05
>>>
>> RHEL5 Xen (hypervisor) reports version as 3.1.2-xyz..
> Based on a quick scrobble through the CentOS 5.6 SRPMs it looks like a
> 3.1.0 hypervisor with a bunch of extra patches, but not this one. This
> is very likely the cause of the crash in mem_cgroup_create(), and
> probably the corruptions too. That would explain why they didn't happen
> on a 4.0.x SLES11 Xen, but not really why the original patch in this
> thread made it go away.
>
Maybe it changes timing so that imul is executed with correct memory
content?
Putting extra printk inside zone loop or flushing tlb before it also
make problem
go away. Or may be problem just becomes invisible and memory is corrupted at
another place.

PS:
Well, never mind. I do not know what I'm talking about.
> Cheers,
>
> Tim.
>


--
Thanks,
Igor

2011-06-10 16:58:13

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] memcg: do not expose uninitialized mem_cgroup_per_node to world

On 06/08/2011 11:09 PM, Andrew Morton wrote:
> The original patch:
>
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4707,7 +4707,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> if (!pn)
> return 1;
>
> - mem->info.nodeinfo[node] = pn;
> for (zone = 0; zone< MAX_NR_ZONES; zone++) {
> mz =&pn->zoneinfo[zone];
> for_each_lru(l)
> @@ -4716,6 +4715,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> mz->on_tree = false;
> mz->mem = mem;
> }
> + mem->info.nodeinfo[node] = pn;
> return 0;
> }
>
> looks like a really good idea. But it needs a new changelog and I'd be
> a bit reluctant to merge it as it appears that the aptch removes our
> only known way of reproducing a bug.
>
> So for now I think I'll queue the patch up unchangelogged so the issue
> doesn't get forgotten about.
>
Problem was in rhel's xen hv.
It was missing fix for imul emulation.
Details here
http://lists.xensource.com/archives/html/xen-devel/2011-06/msg00801.html
Thanks to Tim Deegan and everyone who was involved in the discussion.

--
Thanks,
Igor