There is a race between the assignment of `table->data` and write value
to the pointer of `table->data` in the __do_proc_doulongvec_minmax().
Fix this by duplicating the `table`, and only update the duplicate of
it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to
simplify the code.
The following oops was seen:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
Code: Bad RIP value.
...
Call Trace:
? set_max_huge_pages+0x3da/0x4f0
? alloc_pool_huge_page+0x150/0x150
? proc_doulongvec_minmax+0x46/0x60
? hugetlb_sysctl_handler_common+0x1c7/0x200
? nr_hugepages_store+0x20/0x20
? copy_fd_bitmaps+0x170/0x170
? hugetlb_sysctl_handler+0x1e/0x20
? proc_sys_call_handler+0x2f1/0x300
? unregister_sysctl_table+0xb0/0xb0
? __fd_install+0x78/0x100
? proc_sys_write+0x14/0x20
? __vfs_write+0x4d/0x90
? vfs_write+0xef/0x240
? ksys_write+0xc0/0x160
? __ia32_sys_read+0x50/0x50
? __close_fd+0x129/0x150
? __x64_sys_write+0x43/0x50
? do_syscall_64+0x6c/0x200
? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
Signed-off-by: Muchun Song <[email protected]>
---
mm/hugetlb.c | 27 +++++++++++++++++++++------
1 file changed, 21 insertions(+), 6 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a301c2d672bf..818d6125af49 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3454,6 +3454,23 @@ static unsigned int allowed_mems_nr(struct hstate *h)
}
#ifdef CONFIG_SYSCTL
+static int proc_hugetlb_doulongvec_minmax(struct ctl_table *table, int write,
+ void *buffer, size_t *length,
+ loff_t *ppos, unsigned long *out)
+{
+ struct ctl_table dup_table;
+
+ /*
+ * In order to avoid races with __do_proc_doulongvec_minmax(), we
+ * can duplicate the @table and alter the duplicate of it.
+ */
+ dup_table = *table;
+ dup_table.data = out;
+ dup_table.maxlen = sizeof(unsigned long);
+
+ return proc_doulongvec_minmax(&dup_table, write, buffer, length, ppos);
+}
+
static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
struct ctl_table *table, int write,
void *buffer, size_t *length, loff_t *ppos)
@@ -3465,9 +3482,8 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
if (!hugepages_supported())
return -EOPNOTSUPP;
- table->data = &tmp;
- table->maxlen = sizeof(unsigned long);
- ret = proc_doulongvec_minmax(table, write, buffer, length, ppos);
+ ret = proc_hugetlb_doulongvec_minmax(table, write, buffer, length, ppos,
+ &tmp);
if (ret)
goto out;
@@ -3510,9 +3526,8 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
if (write && hstate_is_gigantic(h))
return -EINVAL;
- table->data = &tmp;
- table->maxlen = sizeof(unsigned long);
- ret = proc_doulongvec_minmax(table, write, buffer, length, ppos);
+ ret = proc_hugetlb_doulongvec_minmax(table, write, buffer, length, ppos,
+ &tmp);
if (ret)
goto out;
--
2.11.0
On Sat, 22 Aug 2020 17:53:28 +0800 Muchun Song <[email protected]> wrote:
> There is a race between the assignment of `table->data` and write value
> to the pointer of `table->data` in the __do_proc_doulongvec_minmax().
Where does __do_proc_doulongvec_minmax() write to table->data?
I think you're saying that there is a race between the assignment of
ctl_table->table in hugetlb_sysctl_handler_common() and the assignment
of the same ctl_table->table in hugetlb_overcommit_handler()?
Or not, maybe I'm being thick. Can you please describe the race more
carefully and completely?
> Fix this by duplicating the `table`, and only update the duplicate of
> it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to
> simplify the code.
>
On 8/24/20 1:59 PM, Andrew Morton wrote:
> On Sat, 22 Aug 2020 17:53:28 +0800 Muchun Song <[email protected]> wrote:
>
>> There is a race between the assignment of `table->data` and write value
>> to the pointer of `table->data` in the __do_proc_doulongvec_minmax().
>
> Where does __do_proc_doulongvec_minmax() write to table->data?
>
> I think you're saying that there is a race between the assignment of
> ctl_table->table in hugetlb_sysctl_handler_common() and the assignment
> of the same ctl_table->table in hugetlb_overcommit_handler()?
>
> Or not, maybe I'm being thick. Can you please describe the race more
> carefully and completely?
>
I too am looking at this now and do not completely understand the race.
It could be that:
hugetlb_sysctl_handler_common
...
table->data = &tmp;
and, do_proc_doulongvec_minmax()
...
return __do_proc_doulongvec_minmax(table->data, table, write, ...
with __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, ...
...
i = (unsigned long *) data;
...
*i = val;
So, __do_proc_doulongvec_minmax can be dereferencing and writing to the pointer
in one thread when hugetlb_sysctl_handler_common is setting it in another?
Another confusing part of the message is the stack trace which includes
...
? set_max_huge_pages+0x3da/0x4f0
? alloc_pool_huge_page+0x150/0x150
which are 'downstream' from these routines. I don't understand why these
are in the trace.
If the race is with the pointer set and dereference/write, then this type of
fix is OK. However, if you really have two 'sysadmin type' global operations
racing then one or both are not going to get what they expected. Instead of
changing the code to 'handle the race', I think it might be acceptable to just
put a big semaphore around it.
--
Mike Kravetz
Hi Andrew and Mike,
On Sat, Aug 22, 2020 at 5:53 PM Muchun Song <[email protected]> wrote:
>
> There is a race between the assignment of `table->data` and write value
> to the pointer of `table->data` in the __do_proc_doulongvec_minmax().
> Fix this by duplicating the `table`, and only update the duplicate of
> it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to
> simplify the code.
I am sorry, I didn't expose more details about how the race happened.
CPU0: CPU1:
proc_sys_write
hugetlb_sysctl_handler proc_sys_call_handler
hugetlb_sysctl_handler_common hugetlb_sysctl_handler
table->data = &tmp; hugetlb_sysctl_handler_common
table->data = &tmp;
proc_doulongvec_minmax
do_proc_doulongvec_minmax sysctl_head_finish
__do_proc_doulongvec_minmax
i = table->data;
*i = val; // corrupt CPU1 stack
>
> The following oops was seen:
>
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> #PF: supervisor instruction fetch in kernel mode
> #PF: error_code(0x0010) - not-present page
> Code: Bad RIP value.
Here we can see the "Bad RIP value", so the stack frame is corrupted by
others.
> ...
> Call Trace:
> ? set_max_huge_pages+0x3da/0x4f0
> ? alloc_pool_huge_page+0x150/0x150
> ? proc_doulongvec_minmax+0x46/0x60
> ? hugetlb_sysctl_handler_common+0x1c7/0x200
> ? nr_hugepages_store+0x20/0x20
> ? copy_fd_bitmaps+0x170/0x170
> ? hugetlb_sysctl_handler+0x1e/0x20
> ? proc_sys_call_handler+0x2f1/0x300
> ? unregister_sysctl_table+0xb0/0xb0
> ? __fd_install+0x78/0x100
> ? proc_sys_write+0x14/0x20
> ? __vfs_write+0x4d/0x90
> ? vfs_write+0xef/0x240
> ? ksys_write+0xc0/0x160
> ? __ia32_sys_read+0x50/0x50
> ? __close_fd+0x129/0x150
> ? __x64_sys_write+0x43/0x50
> ? do_syscall_64+0x6c/0x200
> ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
> Signed-off-by: Muchun Song <[email protected]>
> ---
> mm/hugetlb.c | 27 +++++++++++++++++++++------
> 1 file changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a301c2d672bf..818d6125af49 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3454,6 +3454,23 @@ static unsigned int allowed_mems_nr(struct hstate *h)
> }
>
> #ifdef CONFIG_SYSCTL
> +static int proc_hugetlb_doulongvec_minmax(struct ctl_table *table, int write,
> + void *buffer, size_t *length,
> + loff_t *ppos, unsigned long *out)
> +{
> + struct ctl_table dup_table;
> +
> + /*
> + * In order to avoid races with __do_proc_doulongvec_minmax(), we
> + * can duplicate the @table and alter the duplicate of it.
> + */
> + dup_table = *table;
> + dup_table.data = out;
> + dup_table.maxlen = sizeof(unsigned long);
> +
> + return proc_doulongvec_minmax(&dup_table, write, buffer, length, ppos);
> +}
> +
> static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
> struct ctl_table *table, int write,
> void *buffer, size_t *length, loff_t *ppos)
> @@ -3465,9 +3482,8 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
> if (!hugepages_supported())
> return -EOPNOTSUPP;
>
> - table->data = &tmp;
> - table->maxlen = sizeof(unsigned long);
> - ret = proc_doulongvec_minmax(table, write, buffer, length, ppos);
> + ret = proc_hugetlb_doulongvec_minmax(table, write, buffer, length, ppos,
> + &tmp);
> if (ret)
> goto out;
>
> @@ -3510,9 +3526,8 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
> if (write && hstate_is_gigantic(h))
> return -EINVAL;
>
> - table->data = &tmp;
> - table->maxlen = sizeof(unsigned long);
> - ret = proc_doulongvec_minmax(table, write, buffer, length, ppos);
> + ret = proc_hugetlb_doulongvec_minmax(table, write, buffer, length, ppos,
> + &tmp);
> if (ret)
> goto out;
>
> --
> 2.11.0
>
--
Yours,
Muchun
On Tue, Aug 25, 2020 at 5:21 AM Mike Kravetz <[email protected]> wrote:
>
> On 8/24/20 1:59 PM, Andrew Morton wrote:
> > On Sat, 22 Aug 2020 17:53:28 +0800 Muchun Song <[email protected]> wrote:
> >
> >> There is a race between the assignment of `table->data` and write value
> >> to the pointer of `table->data` in the __do_proc_doulongvec_minmax().
> >
> > Where does __do_proc_doulongvec_minmax() write to table->data?
> >
> > I think you're saying that there is a race between the assignment of
> > ctl_table->table in hugetlb_sysctl_handler_common() and the assignment
> > of the same ctl_table->table in hugetlb_overcommit_handler()?
> >
> > Or not, maybe I'm being thick. Can you please describe the race more
> > carefully and completely?
> >
>
> I too am looking at this now and do not completely understand the race.
> It could be that:
>
> hugetlb_sysctl_handler_common
> ...
> table->data = &tmp;
>
> and, do_proc_doulongvec_minmax()
> ...
> return __do_proc_doulongvec_minmax(table->data, table, write, ...
> with __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, ...
> ...
> i = (unsigned long *) data;
> ...
> *i = val;
>
> So, __do_proc_doulongvec_minmax can be dereferencing and writing to the pointer
> in one thread when hugetlb_sysctl_handler_common is setting it in another?
Yes, you are right.
>
> Another confusing part of the message is the stack trace which includes
> ...
> ? set_max_huge_pages+0x3da/0x4f0
> ? alloc_pool_huge_page+0x150/0x150
>
> which are 'downstream' from these routines. I don't understand why these
> are in the trace.
I am also confused. But this issue can be reproduced easily by letting more
than one thread write to `/proc/sys/vm/nr_hugepages`. With this patch applied,
the issue can not be reproduced and disappears.
>
> If the race is with the pointer set and dereference/write, then this type of
> fix is OK. However, if you really have two 'sysadmin type' global operations
> racing then one or both are not going to get what they expected. Instead of
In our team, more than one developer shares one server which is a test server.
I guess that the panic happens when two people write
`/proc/sys/vm/nr_hugepages`.
> changing the code to 'handle the race', I think it might be acceptable to just
> put a big semaphore around it.
It is also a good idea to fix this issue. Thanks.
> --
> Mike Kravetz
--
Yours,
Muchun
> Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
I don't think the Fixes line is correct. The original patch
just used a global variable and didn't have this race.
It must have been added later in some other patch that added
hugetlb_sysctl_handler_common.
-Andi
On 8/24/20 8:01 PM, Muchun Song wrote:
> On Tue, Aug 25, 2020 at 5:21 AM Mike Kravetz <[email protected]> wrote:
>>
>> I too am looking at this now and do not completely understand the race.
>> It could be that:
>>
>> hugetlb_sysctl_handler_common
>> ...
>> table->data = &tmp;
>>
>> and, do_proc_doulongvec_minmax()
>> ...
>> return __do_proc_doulongvec_minmax(table->data, table, write, ...
>> with __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, ...
>> ...
>> i = (unsigned long *) data;
>> ...
>> *i = val;
>>
>> So, __do_proc_doulongvec_minmax can be dereferencing and writing to the pointer
>> in one thread when hugetlb_sysctl_handler_common is setting it in another?
>
> Yes, you are right.
>
>>
>> Another confusing part of the message is the stack trace which includes
>> ...
>> ? set_max_huge_pages+0x3da/0x4f0
>> ? alloc_pool_huge_page+0x150/0x150
>>
>> which are 'downstream' from these routines. I don't understand why these
>> are in the trace.
>
> I am also confused. But this issue can be reproduced easily by letting more
> than one thread write to `/proc/sys/vm/nr_hugepages`. With this patch applied,
> the issue can not be reproduced and disappears.
There certainly is an issue here as one thread can modify data in another.
However, I am having a hard time seeing what causes the 'kernel NULL pointer
dereference'.
I tried to reproduce the issue myself but was unsuccessful. I have 16 threads
writing to /proc/sys/vm/nr_hugepages in an infinite loop. After several hours
running, I did not hit the issue. Just curious, what architecture is the
system? any special config or compiler options?
If you can easily reproduce, can you post the detailed oops message?
The 'NULL pointer' seems strange because after the first assignment to
table->data the value should never be NULL. Certainly it can be modified
by another thread, but I can not see how it can be NULL. At the beginning
of __do_proc_doulongvec_minmax, there is a check for NULL pointer with:
if (!data || !table->maxlen || !*lenp || (*ppos && !write)) {
*lenp = 0;
return 0;
}
I looked at the code my compiler produced for __do_proc_doulongvec_minmax.
It appears to use the same value/register for the pointer throughout the
routine. IOW, I do not see how the pointer can be NULL for the assignment
when the routine does:
*i = val;
Again, your analysis/patch points out a real issue. I just want to get
a better understanding to make sure there is not another issue causing
the NULL pointer dereference.
--
Mike Kravetz
HI Andi,
On Tue, Aug 25, 2020 at 11:34 PM Andi Kleen <[email protected]> wrote:
>
> > Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
>
> I don't think the Fixes line is correct. The original patch
> just used a global variable and didn't have this race.
> It must have been added later in some other patch that added
> hugetlb_sysctl_handler_common.
I don't agree with you. The 'e5ff215941d5' is not used a global
variable. Below is the code snippet of this patch. Thanks.
@@ -1037,8 +1109,19 @@ int hugetlb_sysctl_handler(struct ctl_table
*table, int write,
struct file *file, void __user *buffer,
size_t *length, loff_t *ppos)
{
+ struct hstate *h = &default_hstate;
+ unsigned long tmp;
Here is a local variable of tmp.
+
+ if (!write)
+ tmp = h->max_huge_pages;
+
+ table->data = &tmp;
+ table->maxlen = sizeof(unsigned long);
proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
- max_huge_pages = set_max_huge_pages(max_huge_pages);
+
+ if (write)
+ h->max_huge_pages = set_max_huge_pages(h, tmp);
+
return 0;
}
>
> -Andi
--
Yours,
Muchun
On Wed, Aug 26, 2020 at 8:03 AM Mike Kravetz <[email protected]> wrote:
>
> On 8/24/20 8:01 PM, Muchun Song wrote:
> > On Tue, Aug 25, 2020 at 5:21 AM Mike Kravetz <[email protected]> wrote:
> >>
> >> I too am looking at this now and do not completely understand the race.
> >> It could be that:
> >>
> >> hugetlb_sysctl_handler_common
> >> ...
> >> table->data = &tmp;
> >>
> >> and, do_proc_doulongvec_minmax()
> >> ...
> >> return __do_proc_doulongvec_minmax(table->data, table, write, ...
> >> with __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, ...
> >> ...
> >> i = (unsigned long *) data;
> >> ...
> >> *i = val;
> >>
> >> So, __do_proc_doulongvec_minmax can be dereferencing and writing to the pointer
> >> in one thread when hugetlb_sysctl_handler_common is setting it in another?
> >
> > Yes, you are right.
> >
> >>
> >> Another confusing part of the message is the stack trace which includes
> >> ...
> >> ? set_max_huge_pages+0x3da/0x4f0
> >> ? alloc_pool_huge_page+0x150/0x150
> >>
> >> which are 'downstream' from these routines. I don't understand why these
> >> are in the trace.
> >
> > I am also confused. But this issue can be reproduced easily by letting more
> > than one thread write to `/proc/sys/vm/nr_hugepages`. With this patch applied,
> > the issue can not be reproduced and disappears.
>
> There certainly is an issue here as one thread can modify data in another.
> However, I am having a hard time seeing what causes the 'kernel NULL pointer
> dereference'.
If you write 0 to '/proc/sys/vm/nr_hugepages', you will get the
kernel NULL pointer dereference, address: 0000000000000000
If you write 1024 to '/proc/sys/vm/nr_hugepages', you will get the
kernel NULL pointer dereference, address: 0000000000000400
The address of dereference is the value which you write to the
'/proc/sys/vm/nr_hugepages'.
>
> I tried to reproduce the issue myself but was unsuccessful. I have 16 threads
> writing to /proc/sys/vm/nr_hugepages in an infinite loop. After several hours
> running, I did not hit the issue. Just curious, what architecture is the
> system? any special config or compiler options?
>
> If you can easily reproduce, can you post the detailed oops message?
>
> The 'NULL pointer' seems strange because after the first assignment to
> table->data the value should never be NULL. Certainly it can be modified
> by another thread, but I can not see how it can be NULL. At the beginning
> of __do_proc_doulongvec_minmax, there is a check for NULL pointer with:
CPU0: CPU1:
proc_sys_write
hugetlb_sysctl_handler proc_sys_call_handler
hugetlb_sysctl_handler_common hugetlb_sysctl_handler
table->data = &tmp; hugetlb_sysctl_handler_common
table->data = &tmp;
proc_doulongvec_minmax
do_proc_doulongvec_minmax sysctl_head_finish
__do_proc_doulongvec_minmax
i = table->data;
*i = val; // corrupt CPU1 stack
If the val is 0, you will see the NULL.
>
> if (!data || !table->maxlen || !*lenp || (*ppos && !write)) {
> *lenp = 0;
> return 0;
> }
>
> I looked at the code my compiler produced for __do_proc_doulongvec_minmax.
> It appears to use the same value/register for the pointer throughout the
> routine. IOW, I do not see how the pointer can be NULL for the assignment
> when the routine does:
>
> *i = val;
>
> Again, your analysis/patch points out a real issue. I just want to get
> a better understanding to make sure there is not another issue causing
> the NULL pointer dereference.
Below is my test script. There are 8 threads to execute the following script.
In my qemu, it is easy to panic. Thanks.
#!/bin/sh
while :
do
echo 128 > /proc/sys/vm/nr_hugepages
echo 0 > /proc/sys/vm/nr_hugepages
done
> --
> Mike Kravetz
--
Yours,
Muchun
On 8/25/20 7:47 PM, Muchun Song wrote:
>
> CPU0: CPU1:
> proc_sys_write
> hugetlb_sysctl_handler proc_sys_call_handler
> hugetlb_sysctl_handler_common hugetlb_sysctl_handler
> table->data = &tmp; hugetlb_sysctl_handler_common
> table->data = &tmp;
> proc_doulongvec_minmax
> do_proc_doulongvec_minmax sysctl_head_finish
> __do_proc_doulongvec_minmax
> i = table->data;
> *i = val; // corrupt CPU1 stack
Thanks Muchun!
Can you please add this to the commit message.
Also, when looking closer at the patch I do not think setting table->maxlen
is necessary in these routines. maxlen is set when the hugetlb ctl_table
entries are defined and initialized. This is not something you introduced.
The unnecessary assignments are in the existing code. However, there is no
need to carry them forward.
--
Mike Kravetz
On Fri, Aug 28, 2020 at 5:51 AM Mike Kravetz <[email protected]> wrote:
>
> On 8/25/20 7:47 PM, Muchun Song wrote:
> >
> > CPU0: CPU1:
> > proc_sys_write
> > hugetlb_sysctl_handler proc_sys_call_handler
> > hugetlb_sysctl_handler_common hugetlb_sysctl_handler
> > table->data = &tmp; hugetlb_sysctl_handler_common
> > table->data = &tmp;
> > proc_doulongvec_minmax
> > do_proc_doulongvec_minmax sysctl_head_finish
> > __do_proc_doulongvec_minmax
> > i = table->data;
> > *i = val; // corrupt CPU1 stack
>
> Thanks Muchun!
> Can you please add this to the commit message.
OK, I will do that. Thanks.
>
> Also, when looking closer at the patch I do not think setting table->maxlen
> is necessary in these routines. maxlen is set when the hugetlb ctl_table
> entries are defined and initialized. This is not something you introduced.
> The unnecessary assignments are in the existing code. However, there is no
> need to carry them forward.
Yeah, I agree with you. I will remove the unnecessary assignment of
table->maxlen.
>
> --
> Mike Kravetz
--
Yours,
Muchun