There are some memory leaks due to a missing put_task_struct().
Fixes: 7f26482a872c ("locking/percpu-rwsem: Remove the embedded rwsem")
Signed-off-by: Qian Cai <[email protected]>
---
kernel/locking/percpu-rwsem.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index a008a1ba21a7..6f487e5d923f 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -123,8 +123,10 @@ static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry,
struct percpu_rw_semaphore *sem = key;
/* concurrent against percpu_down_write(), can get stolen */
- if (!__percpu_rwsem_trylock(sem, reader))
+ if (!__percpu_rwsem_trylock(sem, reader)) {
+ put_task_struct(p);
return 1;
+ }
list_del_init(&wq_entry->entry);
smp_store_release(&wq_entry->private, NULL);
--
2.21.0 (Apple Git-122.2)
On Thu, Mar 26, 2020 at 11:10:57PM -0400, Qian Cai wrote:
> There are some memory leaks due to a missing put_task_struct().
This is an absolutely inadequate changelog. There is no explaning what
the actual race is and why this patch is correct.
> Fixes: 7f26482a872c ("locking/percpu-rwsem: Remove the embedded rwsem")
> Signed-off-by: Qian Cai <[email protected]>
> ---
> kernel/locking/percpu-rwsem.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
> index a008a1ba21a7..6f487e5d923f 100644
> --- a/kernel/locking/percpu-rwsem.c
> +++ b/kernel/locking/percpu-rwsem.c
> @@ -123,8 +123,10 @@ static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry,
> struct percpu_rw_semaphore *sem = key;
>
> /* concurrent against percpu_down_write(), can get stolen */
> - if (!__percpu_rwsem_trylock(sem, reader))
> + if (!__percpu_rwsem_trylock(sem, reader)) {
> + put_task_struct(p);
> return 1;
> + }
If the trylock fails, someone else got the lock and we remain on the
waitqueue. It seems like a very bad idea to put the task while it
remains on the waitqueue, no?
>
> list_del_init(&wq_entry->entry);
> smp_store_release(&wq_entry->private, NULL);
> --
> 2.21.0 (Apple Git-122.2)
>
> On Mar 27, 2020, at 5:37 AM, Peter Zijlstra <[email protected]> wrote:
>
> If the trylock fails, someone else got the lock and we remain on the
> waitqueue. It seems like a very bad idea to put the task while it
> remains on the waitqueue, no?
Interesting, I thought this was more straightforward to see, but I may be wrong as always. At the beginning of percpu_rwsem_wake_function() it calls get_task_struct(), but if the trylock failed, it will remain in the waitqueue. However, it will run percpu_rwsem_wake_function() again with get_task_struct() to increase the refcount. Can you enlighten me where it will call put_task_struct() in waitqueue or elsewhere to balance the refcount in this case?
> On Mar 27, 2020, at 6:19 AM, Qian Cai <[email protected]> wrote:
>
>
>
>> On Mar 27, 2020, at 5:37 AM, Peter Zijlstra <[email protected]> wrote:
>>
>> If the trylock fails, someone else got the lock and we remain on the
>> waitqueue. It seems like a very bad idea to put the task while it
>> remains on the waitqueue, no?
>
> Interesting, I thought this was more straightforward to see, but I may be wrong as always. At the beginning of percpu_rwsem_wake_function() it calls get_task_struct(), but if the trylock failed, it will remain in the waitqueue. However, it will run percpu_rwsem_wake_function() again with get_task_struct() to increase the refcount. Can you enlighten me where it will call put_task_struct() in waitqueue or elsewhere to balance the refcount in this case?
I am pretty confident that the linux-next commit,
7f26482a872c ("locking/percpu-rwsem: Remove the embedded rwsem”)
Introduced memory leaks,
I put a debugging patch here,
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index a008a1ba21a7..857602ef54f1 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -123,8 +123,10 @@ static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry,
struct percpu_rw_semaphore *sem = key;
/* concurrent against percpu_down_write(), can get stolen */
- if (!__percpu_rwsem_trylock(sem, reader))
+ if (!__percpu_rwsem_trylock(sem, reader)) {
+ printk("KK __percpu_rwsem_trylock\n");
return 1;
+ }
list_del_init(&wq_entry->entry);
smp_store_release(&wq_entry->private, NULL);
Once those printks() triggered, it ends up with task_struct leaks,
unreferenced object 0xc000200df1422280 (size 8192):
comm "read_all", pid 12975, jiffies 4297309144 (age 5351.480s)
hex dump (first 32 bytes):
02 00 00 00 00 00 00 00 10 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000f5c5fa2d>] copy_process+0x26c/0x1920
[<0000000099229290>] _do_fork+0xac/0xb20
[<00000000d40a7825>] __do_sys_clone+0x98/0xe0
[<00000000c7cd06a4>] ppc_clone+0x8/0xc
unreferenced object 0xc00020047ef8eb80 (size 120):
comm "read_all", pid 12975, jiffies 4297309144 (age 5351.480s)
hex dump (first 32 bytes):
02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<000000004def8a44>] prepare_creds+0x38/0x110
[<0000000037a68116>] copy_creds+0xbc/0x1d0
[<0000000016b7471c>] copy_process+0x454/0x1920
[<0000000099229290>] _do_fork+0xac/0xb20
[<00000000d40a7825>] __do_sys_clone+0x98/0xe0
[<00000000c7cd06a4>] ppc_clone+0x8/0xc
unreferenced object 0xc000200d96f80800 (size 1384):
comm "read_all", pid 12975, jiffies 4297309144 (age 5351.480s)
hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
10 08 f8 96 0d 20 00 c0 10 08 f8 96 0d 20 00 c0 ..... ....... ..
backtrace:
[<000000008894d13b>] copy_process+0xa40/0x1920
[<0000000099229290>] _do_fork+0xac/0xb20
[<00000000d40a7825>] __do_sys_clone+0x98/0xe0
[<00000000c7cd06a4>] ppc_clone+0x8/0xc
unreferenced object 0xc000001e91ba4000 (size 16384):
comm "read_all", pid 12982, jiffies 4297309462 (age 5348.300s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 08 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<000000009689397b>] kzalloc.constprop.48+0x1c/0x30
[<000000001753eb18>] task_numa_fault+0xac8/0x1260
[<0000000047bb80b1>] __handle_mm_fault+0x12cc/0x1b00
[<00000000c0a4c8ba>] handle_mm_fault+0x298/0x450
[<000000003465b20d>] __do_page_fault+0x2b8/0xf90
[<000000005037fec9>] handle_page_fault+0x10/0x30
unreferenced object 0xc0002015fe4aaa80 (size 8192):
comm "read_all", pid 13157, jiffies 4297353979 (age 4903.130s)
hex dump (first 32 bytes):
02 00 00 00 00 00 00 00 10 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000f5c5fa2d>] copy_process+0x26c/0x1920
[<0000000099229290>] _do_fork+0xac/0xb20
[<00000000d40a7825>] __do_sys_clone+0x98/0xe0
[<00000000c7cd06a4>] ppc_clone+0x8/0xc
unreferenced object 0xc00020047ef8f080 (size 120):
comm "read_all", pid 13157, jiffies 4297353979 (age 4903.130s)
hex dump (first 32 bytes):
02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<000000004def8a44>] prepare_creds+0x38/0x110
[<0000000037a68116>] copy_creds+0xbc/0x1d0
[<0000000016b7471c>] copy_process+0x454/0x1920
[<0000000099229290>] _do_fork+0xac/0xb20
[<00000000d40a7825>] __do_sys_clone+0x98/0xe0
[<00000000c7cd06a4>] ppc_clone+0x8/0xc
unreferenced object 0xc0002012a9388f00 (size 1384):
comm "read_all", pid 13157, jiffies 4297353979 (age 4903.130s)
hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
10 8f 38 a9 12 20 00 c0 10 8f 38 a9 12 20 00 c0 ..8.. ....8.. ..
backtrace:
[<000000008894d13b>] copy_process+0xa40/0x1920
[<0000000099229290>] _do_fork+0xac/0xb20
[<00000000d40a7825>] __do_sys_clone+0x98/0xe0
[<00000000c7cd06a4>] ppc_clone+0x8/0xc
unreferenced object 0xc000001c86704000 (size 16384):
comm "read_all", pid 13164, jiffies 4297354081 (age 4902.110s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 08 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<000000009689397b>] kzalloc.constprop.48+0x1c/0x30
[<000000001753eb18>] task_numa_fault+0xac8/0x1260
[<0000000047bb80b1>] __handle_mm_fault+0x12cc/0x1b00
[<00000000c0a4c8ba>] handle_mm_fault+0x298/0x450
[<000000003465b20d>] __do_page_fault+0x2b8/0xf90
[<000000005037fec9>] handle_page_fault+0x10/0x30
On Fri, Mar 27, 2020 at 06:19:37AM -0400, Qian Cai wrote:
>
>
> > On Mar 27, 2020, at 5:37 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > If the trylock fails, someone else got the lock and we remain on the
> > waitqueue. It seems like a very bad idea to put the task while it
> > remains on the waitqueue, no?
>
> Interesting, I thought this was more straightforward to see,
It is indeed as straight forward as you explain; but when doing 10
things at once, and having just dug through some low-level arch assembly
code for the previous email, even obvious things might sometimes need
a little explaining :/
So please, always try and err on the side of a little verbose when
writing Changelogs, esp. when concerning locking / concurrency, you
really can't be clear enough.
> but I may
> be wrong as always. At the beginning of percpu_rwsem_wake_function()
> it calls get_task_struct(), but if the trylock failed, it will remain
> in the waitqueue. However, it will run percpu_rwsem_wake_function()
> again with get_task_struct() to increase the refcount. Can you
> enlighten me where it will call put_task_struct() in waitqueue or
> elsewhere to balance the refcount in this case?
See, had that explaination been part of the Changelog, my brain would've
probably been able to kick itself in gear and actually spot the problem.
Yes, you're right.
That said, I wonder if we can just move the get_task_struct() call like
below; after all the race we're guarding against is percpu_rwsem_wait()
observing !private, terminating the wait and doing a quick exit() while
percpu_rwsem_wake_function() then does wake_up_process(p) as a
use-after-free.
Hmm?
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index a008a1ba21a7..8bbafe3e5203 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -118,14 +118,15 @@ static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry,
unsigned int mode, int wake_flags,
void *key)
{
- struct task_struct *p = get_task_struct(wq_entry->private);
bool reader = wq_entry->flags & WQ_FLAG_CUSTOM;
struct percpu_rw_semaphore *sem = key;
+ struct task_struct *p;
/* concurrent against percpu_down_write(), can get stolen */
if (!__percpu_rwsem_trylock(sem, reader))
return 1;
+ p = get_task_struct(wq_entry->private);
list_del_init(&wq_entry->entry);
smp_store_release(&wq_entry->private, NULL);
> On Mar 30, 2020, at 7:18 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Mar 27, 2020 at 06:19:37AM -0400, Qian Cai wrote:
>>
>>
>>> On Mar 27, 2020, at 5:37 AM, Peter Zijlstra <[email protected]> wrote:
>>>
>>> If the trylock fails, someone else got the lock and we remain on the
>>> waitqueue. It seems like a very bad idea to put the task while it
>>> remains on the waitqueue, no?
>>
>> Interesting, I thought this was more straightforward to see,
>
> It is indeed as straight forward as you explain; but when doing 10
> things at once, and having just dug through some low-level arch assembly
> code for the previous email, even obvious things might sometimes need
> a little explaining :/
>
> So please, always try and err on the side of a little verbose when
> writing Changelogs, esp. when concerning locking / concurrency, you
> really can't be clear enough.
>
>> but I may
>> be wrong as always. At the beginning of percpu_rwsem_wake_function()
>> it calls get_task_struct(), but if the trylock failed, it will remain
>> in the waitqueue. However, it will run percpu_rwsem_wake_function()
>> again with get_task_struct() to increase the refcount. Can you
>> enlighten me where it will call put_task_struct() in waitqueue or
>> elsewhere to balance the refcount in this case?
>
> See, had that explaination been part of the Changelog, my brain would've
> probably been able to kick itself in gear and actually spot the problem.
>
> Yes, you're right.
>
> That said, I wonder if we can just move the get_task_struct() call like
> below; after all the race we're guarding against is percpu_rwsem_wait()
> observing !private, terminating the wait and doing a quick exit() while
> percpu_rwsem_wake_function() then does wake_up_process(p) as a
> use-after-free.
Looks good to me. If no one has any objection, I’ll dust-out the commit log
and send out a v2 for it.
>
> Hmm?
>
> diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
> index a008a1ba21a7..8bbafe3e5203 100644
> --- a/kernel/locking/percpu-rwsem.c
> +++ b/kernel/locking/percpu-rwsem.c
> @@ -118,14 +118,15 @@ static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry,
> unsigned int mode, int wake_flags,
> void *key)
> {
> - struct task_struct *p = get_task_struct(wq_entry->private);
> bool reader = wq_entry->flags & WQ_FLAG_CUSTOM;
> struct percpu_rw_semaphore *sem = key;
> + struct task_struct *p;
>
> /* concurrent against percpu_down_write(), can get stolen */
> if (!__percpu_rwsem_trylock(sem, reader))
> return 1;
>
> + p = get_task_struct(wq_entry->private);
> list_del_init(&wq_entry->entry);
> smp_store_release(&wq_entry->private, NULL);
>