Memory hotplug would online new node in runtime, then reap timer will add this new node as
a reap node. In such case, for each existing kmem_list, we need to ensure that the alien
cache entry corresponding to this new added node is NULL. Otherwise, it might cause BUG
when reap_alien() affecting the new added node.
Signed-off-by: Haicheng Li <[email protected]>
---
mm/slab.c | 7 +++----
1 files changed, 3 insertions(+), 4 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index 7dfa481..a9486a0 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -966,18 +966,17 @@ static void *alternate_node_alloc(struct kmem_cache *, gfp_t);
static struct array_cache **alloc_alien_cache(int node, int limit, gfp_t gfp)
{
struct array_cache **ac_ptr;
- int memsize = sizeof(void *) * nr_node_ids;
+ int memsize = sizeof(void *) * MAX_NUMNODES;
int i;
if (limit > 1)
limit = 12;
ac_ptr = kmalloc_node(memsize, gfp, node);
if (ac_ptr) {
+ memset(ac_ptr, 0, memsize);
for_each_node(i) {
- if (i == node || !node_online(i)) {
- ac_ptr[i] = NULL;
+ if (i == node || !node_online(i))
continue;
- }
ac_ptr[i] = alloc_arraycache(node, limit, 0xbaadf00d, gfp);
if (!ac_ptr[i]) {
for (i--; i >= 0; i--)
--
1.6.0.rc1
Haicheng Li <[email protected]> writes:
> Memory hotplug would online new node in runtime, then reap timer will
> add this new node as a reap node. In such case, for each existing
> kmem_list, we need to ensure that the alien cache entry corresponding
> to this new added node is NULL. Otherwise, it might cause BUG when
> reap_alien() affecting the new added node.
>
> Signed-off-by: Haicheng Li <[email protected]>
Acked-by: Andi Kleen <[email protected]>
IMHO a 2.6.33 and even stable candidate
-Andi
> ---
> mm/slab.c | 7 +++----
> 1 files changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/mm/slab.c b/mm/slab.c
> index 7dfa481..a9486a0 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -966,18 +966,17 @@ static void *alternate_node_alloc(struct kmem_cache *, gfp_t);
> static struct array_cache **alloc_alien_cache(int node, int limit, gfp_t gfp)
> {
> struct array_cache **ac_ptr;
> - int memsize = sizeof(void *) * nr_node_ids;
> + int memsize = sizeof(void *) * MAX_NUMNODES;
> int i;
>
> if (limit > 1)
> limit = 12;
> ac_ptr = kmalloc_node(memsize, gfp, node);
> if (ac_ptr) {
> + memset(ac_ptr, 0, memsize);
> for_each_node(i) {
> - if (i == node || !node_online(i)) {
> - ac_ptr[i] = NULL;
> + if (i == node || !node_online(i))
> continue;
> - }
> ac_ptr[i] = alloc_arraycache(node, limit, 0xbaadf00d, gfp);
> if (!ac_ptr[i]) {
> for (i--; i >= 0; i--)
> --
> 1.6.0.rc1
--
[email protected] -- Speaking for myself only.
On Tue, 22 Dec 2009, Haicheng Li wrote:
> struct array_cache **ac_ptr;
> - int memsize = sizeof(void *) * nr_node_ids;
> + int memsize = sizeof(void *) * MAX_NUMNODES;
> int i;
Why does the alien cache pointer array size have to be increased? node ids
beyond nr_node_ids cannot be used.
>
> if (limit > 1)
> limit = 12;
> ac_ptr = kmalloc_node(memsize, gfp, node);
Use kzalloc to ensure zeroed memory.
> if (ac_ptr) {
> + memset(ac_ptr, 0, memsize);
> for_each_node(i) {
> - if (i == node || !node_online(i)) {
> - ac_ptr[i] = NULL;
> + if (i == node || !node_online(i))
> continue;
> - }
> ac_ptr[i] = alloc_arraycache(node, limit, 0xbaadf00d,
> gfp);
> if (!ac_ptr[i]) {
> for (i--; i >= 0; i--)
>
On Tue, 2009-12-22 at 20:38 +0800, Haicheng Li wrote:
> ac_ptr = kmalloc_node(memsize, gfp, node);
> if (ac_ptr) {
> + memset(ac_ptr, 0, memsize);
Please use kzalloc_node here.
I'm not sure what's going on with nr_node_id vs MAX_NUMNODES, so I think
we need to see an answer to Christoph's question before going forward
with this.
--
http://selenic.com : development and support for Mercurial and Linux
Christoph & Matt,
Thanks for the review. Node ids beyond nr_node_ids could be used in the case of
memory hotadding.
Let me explain here:
Firstly, original nr_node_ids = 1 + nid of highest POSSIBLE node.
Secondly, consider hotplug-adding the memories that are on a new_added node:
1. when acpi event is triggered:
acpi_memory_device_add() -> acpi_memory_enable_device() -> add_memory() -> node_set_online()
The node_state[N_ONLINE] is updated with this new node added.
And the id of this new node is beyond nr_node_ids.
2. Then as reap_timer is scheduled in:
cache_reap() -> next_reap_node() -> node = next_node(node, node_online_map)
then the new_added node would be selected as the reap node of this cpu, for example CPU X.
3. when reap_timer of CPU X is triggered again:
cache_reap() -> reap_alien()
it would access this new added node as reap_node of CPU X.
I have caught this BUG in our memory-hotadd testing as below:
the test scenario is that there are originally 2 nodes enabled on the machine,
then hot-add memory on the 3rd node.
the BUG is:
[ 141.667487] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
[ 141.667782] IP: [<ffffffff810b8a64>] cache_reap+0x71/0x236
[ 141.667969] PGD 0
[ 141.668129] Oops: 0000 [#1] SMP
[ 141.668357] last sysfs file: /sys/class/scsi_host/host4/proc_name
[ 141.668469] CPU
[ 141.668630] Modules linked in: ipv6 autofs4 rfcomm l2cap crc16 bluetooth rfkill
binfmt_misc dm_mirror dm_region_hash dm_log dm_multipath dm_mod video output sbs sbshc fan
battery ac parport_pc lp parport joydev usbhid sr_mod cdrom thermal processor thermal_sys
container button rtc_cmos rtc_core rtc_lib i2c_i801 i2c_core pcspkr uhci_hcd ohci_hcd
ehci_hcd usbcore
[ 141.671659] Pid: 126, comm: events/27 Not tainted 2.6.32 #9 Server
[ 141.671771] RIP: 0010:[<ffffffff810b8a64>] [<ffffffff810b8a64>] cache_reap+0x71/0x236
[ 141.671981] RSP: 0018:ffff88027e81bdf0 EFLAGS: 00010206
[ 141.672089] RAX: 0000000000000002 RBX: 0000000000000078 RCX: ffff88047d86e580
[ 141.672204] RDX: ffff88047dfcbc00 RSI: ffff88047f13f6c0 RDI: ffff88047d9136c0
[ 141.672319] RBP: ffff88027e81be30 R08: 0000000000000001 R09: 0000000000000001
[ 141.672433] R10: 0000000000000000 R11: 0000000000000086 R12: ffff88047d87c200
[ 141.672548] R13: ffff88047d87d680 R14: ffffffff810b89f3 R15: 0000000000000002
[ 141.672663] FS: 0000000000000000(0000) GS:ffff88028b5a0000(0000) knlGS:0000000000000000
[ 141.672807] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 141.672917] CR2: 0000000000000078 CR3: 0000000001001000 CR4: 00000000000006e0
[ 141.673032] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 141.673147] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 141.673262] Process events/27 (pid: 126, threadinfo ffff88027e81a000, task
ffff88027f3ea040)
[ 141.673406] Stack:
[ 141.673503] ffff88027e81be30 ffff88028b5b05a0 0000000100000000 ffff88027e81be80
[ 141.673808] <0> ffff88028b5b5b40 ffff88028b5b05a0 ffffffff810b89f3 fffffffff00000c6
[ 141.674265] <0> ffff88027e81bec0 ffffffff81057394 ffffffff8105733e ffffffff81369f3a
[ 141.674813] Call Trace:
[ 141.674915] [<ffffffff810b89f3>] ? cache_reap+0x0/0x236
[ 141.675028] [<ffffffff81057394>] worker_thread+0x17a/0x27b
[ 141.675138] [<ffffffff8105733e>] ? worker_thread+0x124/0x27b
[ 141.675256] [<ffffffff81369f3a>] ? thread_return+0x3e/0xee
[ 141.675369] [<ffffffff8105a244>] ? autoremove_wake_function+0x0/0x38
[ 141.675482] [<ffffffff8105721a>] ? worker_thread+0x0/0x27b
[ 141.675593] [<ffffffff8105a146>] kthread+0x7d/0x87
[ 141.675707] [<ffffffff81012daa>] child_rip+0xa/0x20
[ 141.675817] [<ffffffff81012710>] ? restore_args+0x0/0x30
[ 141.675927] [<ffffffff8105a0c9>] ? kthread+0x0/0x87
[ 141.676035] [<ffffffff81012da0>] ? child_rip+0x0/0x20
[ 141.676142] Code: a4 c5 68 08 00 00 65 48 8b 04 25 00 e4 00 00 48 8b 04 18 49 8b 4c 24
78 48 85 c9 74 5b 41 89 c7 48 98 48 8b 1c c1 48 85 db 74 4d <83> 3b 00 74 48 48 83 3d ff
d4 65 00 00 75 04 0f 0b eb fe fa 66
[ 141.680610] RIP [<ffffffff810b8a64>] cache_reap+0x71/0x236
[ 141.680785] RSP <ffff88027e81bdf0>
[ 141.680886] CR2: 0000000000000078
[ 141.681016] ---[ end trace b1e17069ef81fe83 ]--
corresponding assembly code is:
ffffffff810b8a3f: 65 48 8b 04 25 00 e4 mov %gs:0xe400,%rax
ffffffff810b8a46: 00 00
ffffffff810b8a48: 48 8b 04 18 mov (%rax,%rbx,1),%rax
ffffffff810b8a4c: 49 8b 4c 24 78 mov 0x78(%r12),%rcx
ffffffff810b8a51: 48 85 c9 test %rcx,%rcx
ffffffff810b8a54: 74 5b je ffffffff810b8ab1 <cache_reap+0xbe>
ffffffff810b8a56: 41 89 c7 mov %eax,%r15d
ffffffff810b8a59: 48 98 cltq
ffffffff810b8a5b: 48 8b 1c c1 mov (%rcx,%rax,8),%rbx
ffffffff810b8a5f: 48 85 db test %rbx,%rbx
ffffffff810b8a62: 74 4d je ffffffff810b8ab1 <cache_reap+0xbe>
ffffffff810b8a64: 83 3b 00 cmpl $0x0,(%rbx)
here (0xffffffff810b8a64) this is the oops point, corresponding to code $KSRC/mm/slab.c:1035:
1025 /*
1026 * Called from cache_reap() to regularly drain alien caches round robin.
1027 */
1028 static void reap_alien(struct kmem_cache *cachep, struct kmem_list3 *l3)
1029 {
1030 int node = __get_cpu_var(reap_node);
1031
1032 if (l3->alien) {
1033 struct array_cache *ac = l3->alien[node];
1034
1035 if (ac && ac->avail && spin_trylock_irq(&ac->lock)) {
1036 __drain_alien_cache(cachep, ac, node);
1037 spin_unlock_irq(&ac->lock);
1038 }
1039 }
1040 }
RAX: 0000000000000002 -> node
RBX: 0000000000000078 -> ac
(%rbx) -> ac->avail
The value of ac is random and invalid, ac->avail dereference causes the oops.
the reap_node (3rd node) is the new added node by mem hotadd. however, for old kmem_list,
its l3->alien has only 2 cache entries (nr_node_ids = 2), so l3->alien[2] is invalid.
Christoph Lameter wrote:
> On Tue, 22 Dec 2009, Haicheng Li wrote:
>
>> struct array_cache **ac_ptr;
>> - int memsize = sizeof(void *) * nr_node_ids;
>> + int memsize = sizeof(void *) * MAX_NUMNODES;
>> int i;
>
> Why does the alien cache pointer array size have to be increased? node ids
> beyond nr_node_ids cannot be used.
>
>
>> if (limit > 1)
>> limit = 12;
>> ac_ptr = kmalloc_node(memsize, gfp, node);
>
> Use kzalloc to ensure zeroed memory.
>
>> if (ac_ptr) {
>> + memset(ac_ptr, 0, memsize);
>> for_each_node(i) {
>> - if (i == node || !node_online(i)) {
>> - ac_ptr[i] = NULL;
>> + if (i == node || !node_online(i))
>> continue;
>> - }
>> ac_ptr[i] = alloc_arraycache(node, limit, 0xbaadf00d,
>> gfp);
>> if (!ac_ptr[i]) {
>> for (i--; i >= 0; i--)
>>
>
Memory hotplug would online new node in runtime, then reap timer will add
this new node as a reap node. In such case, for each existing kmem_list,
we need to ensure that the alien cache entry corresponding to this new
added node is NULL.
Otherwise, it might cause BUG when reap_alien() affecting the new added node.
V2: use kzalloc_node() to ensure zeroed memory.
CC: Pekka Enberg <[email protected]>
Acked-by: Andi Kleen <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Reviewed-by: Matt Mackall <[email protected]>
Signed-off-by: Haicheng Li <[email protected]>
---
mm/slab.c | 8 +++-----
1 files changed, 3 insertions(+), 5 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index 7dfa481..000e9ed 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -966,18 +966,16 @@ static void *alternate_node_alloc(struct kmem_cache *, gfp_t);
static struct array_cache **alloc_alien_cache(int node, int limit, gfp_t gfp)
{
struct array_cache **ac_ptr;
- int memsize = sizeof(void *) * nr_node_ids;
+ int memsize = sizeof(void *) * MAX_NUMNODES;
int i;
if (limit > 1)
limit = 12;
- ac_ptr = kmalloc_node(memsize, gfp, node);
+ ac_ptr = kzalloc_node(memsize, gfp, node);
if (ac_ptr) {
for_each_node(i) {
- if (i == node || !node_online(i)) {
- ac_ptr[i] = NULL;
+ if (i == node || !node_online(i))
continue;
- }
ac_ptr[i] = alloc_arraycache(node, limit, 0xbaadf00d, gfp);
if (!ac_ptr[i]) {
for (i--; i >= 0; i--)
--
1.5.3.8
Le 23/12/2009 07:52, Haicheng Li a ?crit :
> Christoph & Matt,
>
> Thanks for the review. Node ids beyond nr_node_ids could be used in the
> case of
> memory hotadding.
>
> Let me explain here:
> Firstly, original nr_node_ids = 1 + nid of highest POSSIBLE node.
>
> Secondly, consider hotplug-adding the memories that are on a new_added
> node:
> 1. when acpi event is triggered:
> acpi_memory_device_add() -> acpi_memory_enable_device() -> add_memory()
> -> node_set_online()
>
> The node_state[N_ONLINE] is updated with this new node added.
> And the id of this new node is beyond nr_node_ids.
>
Then, this is a violation of the first statement :
nr_node_ids = 1 + nid of highest POSSIBLE node.
If your system allows hotplugging of new nodes, then POSSIBLE nodes should include them
at boot time.
Same thing for cpus and nr_cpus_ids. If a cpu is added, then its id MUST be < nr_cpus_ids
> Then, this is a violation of the first statement :
>
> nr_node_ids = 1 + nid of highest POSSIBLE node.
>
> If your system allows hotplugging of new nodes, then POSSIBLE nodes should include them
> at boot time.
Agreed, nr_node_ids must be possible nodes.
It should have been set up by the SRAT parser (modulo regressions)
Haicheng, did you verify with printk it's really incorrect at this point?
-Andi
--
[email protected] -- Speaking for myself only.
Andi Kleen wrote:
>> Then, this is a violation of the first statement :
>>
>> nr_node_ids = 1 + nid of highest POSSIBLE node.
>>
>> If your system allows hotplugging of new nodes, then POSSIBLE nodes should include them
>> at boot time.
>
> Agreed, nr_node_ids must be possible nodes.
>
> It should have been set up by the SRAT parser (modulo regressions)
>
> Haicheng, did you verify with printk it's really incorrect at this point?
Yup. See below debug patch & Oops info.
If we can make sure that SRAT parser must be able to detect out all possible node (even
the node, cpu+mem, is not populated on the motherboard), it would be ACPI Parser issue or
BIOS issue rather than a slab issue. In such case, I think this patch might become a
workaround for buggy system board; and we might need to look into ACPI SRAT parser code as
well:).
---
diff --git a/mm/slab.c b/mm/slab.c
index 7dfa481..3a4e1f4 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1032,6 +1032,9 @@ static void reap_alien(struct kmem_cache *cachep, struct kmem_list3 *l3)
if (l3->alien) {
struct array_cache *ac = l3->alien[node];
+ if (node >= nr_node_ids)
+ printk("node=%d, nr_node_ids=%d, ac=%p\n",
+ node, nr_node_ids, ac);
if (ac && ac->avail && spin_trylock_irq(&ac->lock)) {
__drain_alien_cache(cachep, ac, node);
spin_unlock_irq(&ac->lock);
---
[ 151.732864] node=3, nr_node_ids=2, ac=(null)
[ 151.732873] node=3, nr_node_ids=2, ac=(null)
[ 151.732882] node=3, nr_node_ids=2, ac=(null)
[ 151.732889] node=3, nr_node_ids=2, ac=(null)
[ 151.732897] node=3, nr_node_ids=2, ac=000000004b31f78f
[ 151.732941] BUG: unable to handle kernel paging request at 000000004b31f78f
[ 151.741026] IP: [<ffffffff810bd460>] cache_reap+0x8d/0x252
[ 151.747363] PGD 0
[ 151.749793] Oops: 0000 [#1] SMP
[ 151.753658] last sysfs file: /sys/kernel/kexec_crash_loaded
[ 151.759990] CPU
[ 151.762509] Modules linked in: ipv6 autofs4 rfcomm l2cap crc16 bluetooth rfkill
binfmt_misc dm_mirror dm_region_hash dm_log dm_multipath dm_mod video output sbs sbshc fan
battery ac parport_pc lp parport joydev usbhid sr_mod cdrom processor thermal thermal_sys
container button rtc_cmos rtc_core rtc_lib i2c_i801 i2c_core pcspkr uhci_hcd ohci_hcd
ehci_hcd usbcore
[ 151.802035] Pid: 120, comm: events/21 Not tainted 2.6.32-haicheng-cpuhp #34 Server
[ 151.810911] RIP: 0010:[<ffffffff810bd460>] [<ffffffff810bd460>] cache_reap+0x8d/0x252
[ 151.815485] RSP: 0018:ffff88027e81ddf0 EFLAGS: 00010202
[ 151.815491] RAX: 000000000000003d RBX: 000000004b31f78f RCX: 0000000000000000
[ 151.815496] RDX: ffff88027f3f5040 RSI: 0000000000000001 RDI: 0000000000000286
[ 151.815503] RBP: ffff88027e81de30 R08: 0000000000000002 R09: ffffffff8105ee06
[ 151.815507] R10: ffff88027e81dbe0 R11: ffffffff81066722 R12: ffff88047f223080
[ 151.815513] R13: ffff88047dd201c0 R14: 0000000000000003 R15: fffffffff00000c6
[ 151.815518] FS: 0000000000000000(0000) GS:ffff88028b540000(0000) knlGS:0000000000000000
[ 151.815524] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 151.815528] CR2: 000000004b31f78f CR3: 0000000001001000 CR4: 00000000000006e0
[ 151.815533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 151.815538] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 151.815547] Process events/21 (pid: 120, threadinfo ffff88027e81c000, task
ffff88027f3f5040)
[ 151.815550] Stack:
[ 151.817896] ffff88027e81de30 ffff88028b5517c0 0000000100000000 ffff88027e81de80
[ 151.817908] <0> ffff88028b556d80 ffff88028b5517c0 ffffffff810bd3d3 fffffffff00000c6
[ 151.817918] <0> ffff88027e81dec0 ffffffff81058d0c ffffffff81058cb6 ffffffff81376fea
[ 151.817928] Call Trace:
[ 151.820772] [<ffffffff810bd3d3>] ? cache_reap+0x0/0x252
[ 151.820786] [<ffffffff81058d0c>] worker_thread+0x17a/0x27b
[ 151.820793] [<ffffffff81058cb6>] ? worker_thread+0x124/0x27b
[ 151.820806] [<ffffffff81376fea>] ? thread_return+0x3e/0xee
[ 151.820816] [<ffffffff8105bbbc>] ? autoremove_wake_function+0x0/0x38
[ 151.820827] [<ffffffff81058b92>] ? worker_thread+0x0/0x27b
[ 151.820833] [<ffffffff8105babe>] kthread+0x7d/0x87
[ 151.820848] [<ffffffff81012daa>] child_rip+0xa/0x20
[ 151.820857] [<ffffffff81012710>] ? restore_args+0x0/0x30
[ 151.820863] [<ffffffff8105ba41>] ? kthread+0x0/0x87
[ 151.820874] [<ffffffff81012da0>] ? child_rip+0x0/0x20
[ 151.820879] Code: 77 48 63 c6 41 89 f6 48 8b 1c c2 8b 15 be 28 6e 00 39 d6 7c 11 48 89
d9 48 c7 c7 97 98 4c 81 31 c0 e8 23 bf f8 ff 48 85 db 74 4d <83> 3b 00 74 48 48 83 3d 83
ab 66 00 00 75 04 0f 0b eb fe fa 66
[ 151.845235] RIP [<ffffffff810bd460>] cache_reap+0x8d/0x252
[ 151.845255] RSP <ffff88027e81ddf0>
[ 151.845260] CR2: 000000004b31f78f
[ 151.845415] ---[ end trace be6e21fde5d02b06 ]---
On Wed, 23 Dec 2009, Haicheng Li wrote:
> > It should have been set up by the SRAT parser (modulo regressions)
> >
> > Haicheng, did you verify with printk it's really incorrect at this point?
>
> Yup. See below debug patch & Oops info.
>
> If we can make sure that SRAT parser must be able to detect out all possible
> node (even the node, cpu+mem, is not populated on the motherboard), it would
> be ACPI Parser issue or BIOS issue rather than a slab issue. In such case, I
> think this patch might become a workaround for buggy system board; and we
> might need to look into ACPI SRAT parser code as well:).
Right. Lets fix the SRAT / ACPI issue. Code elsewhere also dimensions
arrays to nr_node_ids.
On Wed, 23 Dec 2009, Haicheng Li wrote:
> @@ -966,18 +966,16 @@ static void *alternate_node_alloc(struct kmem_cache *,
> gfp_t);
> static struct array_cache **alloc_alien_cache(int node, int limit, gfp_t gfp)
> {
> struct array_cache **ac_ptr;
> - int memsize = sizeof(void *) * nr_node_ids;
> + int memsize = sizeof(void *) * MAX_NUMNODES;
> int i;
Remove this change and I will ack the patch.