LinuxLists.cc - [PATCH] mm,oom_reaper: avoid run queue_oom

2023-11-22 12:47:27

Subject: [PATCH] mm,oom_reaper: avoid run queue_oom_reaper if task is not oom

The function queue_oom_reaper tests and sets tsk->signal->oom_mm->flags.
However, it is necessary to check if 'tsk' is an OOM victim before
executing 'queue_oom_reaper' because the variable may be NULL.

We encountered such an issue, and the log is as follows:
[3701:11_see]Out of memory: Killed process 3154 (system_server)
total-vm:23662044kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB,
UID:1000 pgtables:4056kB oom_score_adj:-900
[3701:11_see][RB/E]rb_sreason_str_set: sreason_str set null_pointer
[3701:11_see][RB/E]rb_sreason_str_set: sreason_str set unknown_addr
[3701:11_see]Unable to handle kernel NULL pointer dereference at virtual
address 0000000000000328
[3701:11_see]user pgtable: 4k pages, 39-bit VAs, pgdp=00000000821de000
[3701:11_see][0000000000000328] pgd=0000000000000000,
p4d=0000000000000000,pud=0000000000000000
[3701:11_see]tracing off
[3701:11_see]Internal error: Oops: 96000005 [#1] PREEMPT SMP
[3701:11_see]Call trace:
[3701:11_see] queue_oom_reaper+0x30/0x170
[3701:11_see] __oom_kill_process+0x590/0x860
[3701:11_see] oom_kill_process+0x140/0x274
[3701:11_see] out_of_memory+0x2f4/0x54c
[3701:11_see] __alloc_pages_slowpath+0x5d8/0xaac
[3701:11_see] __alloc_pages+0x774/0x800
[3701:11_see] wp_page_copy+0xc4/0x116c
[3701:11_see] do_wp_page+0x4bc/0x6fc
[3701:11_see] handle_pte_fault+0x98/0x2a8
[3701:11_see] __handle_mm_fault+0x368/0x700
[3701:11_see] do_handle_mm_fault+0x160/0x2cc
[3701:11_see] do_page_fault+0x3e0/0x818
[3701:11_see] do_mem_abort+0x68/0x17c
[3701:11_see] el0_da+0x3c/0xa0
[3701:11_see] el0t_64_sync_handler+0xc4/0xec
[3701:11_see] el0t_64_sync+0x1b4/0x1b8
[3701:11_see]tracing off

Signed-off-by: Gao Xu <[email protected]>
---
mm/oom_kill.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e6071fde..3754ab4b6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -984,7 +984,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
}
rcu_read_unlock();

- if (can_oom_reap)
+ if (can_oom_reap && tsk_is_oom_victim(victim))
queue_oom_reaper(victim);

mmdrop(mm);
--
2.17.1

2023-11-22 21:47:46

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] mm,oom_reaper: avoid run queue_oom_reaper if task is not oom

On Wed, 22 Nov 2023 12:46:44 +0000 gaoxu <[email protected]> wrote:

> The function queue_oom_reaper tests and sets tsk->signal->oom_mm->flags.
> However, it is necessary to check if 'tsk' is an OOM victim before
> executing 'queue_oom_reaper' because the variable may be NULL.
>
> We encountered such an issue, and the log is as follows:
> [3701:11_see]Out of memory: Killed process 3154 (system_server)
> total-vm:23662044kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB,
> UID:1000 pgtables:4056kB oom_score_adj:-900
> [3701:11_see][RB/E]rb_sreason_str_set: sreason_str set null_pointer
> [3701:11_see][RB/E]rb_sreason_str_set: sreason_str set unknown_addr
> [3701:11_see]Unable to handle kernel NULL pointer dereference at virtual
> address 0000000000000328

Well that isn't good. How frequently does this happen and can you
suggest why some quite old code is suddenly causing problems? What is
your workload doing that others' do not do?

> [3701:11_see]user pgtable: 4k pages, 39-bit VAs, pgdp=00000000821de000
> [3701:11_see][0000000000000328] pgd=0000000000000000,
> p4d=0000000000000000,pud=0000000000000000
> [3701:11_see]tracing off
> [3701:11_see]Internal error: Oops: 96000005 [#1] PREEMPT SMP
> [3701:11_see]Call trace:
> [3701:11_see] queue_oom_reaper+0x30/0x170
> [3701:11_see] __oom_kill_process+0x590/0x860
> [3701:11_see] oom_kill_process+0x140/0x274
> [3701:11_see] out_of_memory+0x2f4/0x54c
> [3701:11_see] __alloc_pages_slowpath+0x5d8/0xaac
> [3701:11_see] __alloc_pages+0x774/0x800
> [3701:11_see] wp_page_copy+0xc4/0x116c
> [3701:11_see] do_wp_page+0x4bc/0x6fc
> [3701:11_see] handle_pte_fault+0x98/0x2a8
> [3701:11_see] __handle_mm_fault+0x368/0x700
> [3701:11_see] do_handle_mm_fault+0x160/0x2cc
> [3701:11_see] do_page_fault+0x3e0/0x818
> [3701:11_see] do_mem_abort+0x68/0x17c
> [3701:11_see] el0_da+0x3c/0xa0
> [3701:11_see] el0t_64_sync_handler+0xc4/0xec
> [3701:11_see] el0t_64_sync+0x1b4/0x1b8
> [3701:11_see]tracing off
>
> Signed-off-by: Gao Xu <[email protected]>

I'll queue this for -stable backporting, assuming review is agreeable.
Can we please identify a suitable Fixes: target to tell -stable
maintainers which kernels need the fix? It looks like this goes back a
long way.

2023-11-23 08:53:47

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH] mm,oom_reaper: avoid run queue_oom_reaper if task is not oom

On Wed 22-11-23 12:46:44, gaoxu wrote:
> The function queue_oom_reaper tests and sets tsk->signal->oom_mm->flags.
> However, it is necessary to check if 'tsk' is an OOM victim before
> executing 'queue_oom_reaper' because the variable may be NULL.
>
> We encountered such an issue, and the log is as follows:
> [3701:11_see]Out of memory: Killed process 3154 (system_server)
> total-vm:23662044kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB,
> UID:1000 pgtables:4056kB oom_score_adj:-900

> [3701:11_see][RB/E]rb_sreason_str_set: sreason_str set null_pointer
> [3701:11_see][RB/E]rb_sreason_str_set: sreason_str set unknown_addr

What are these?

> [3701:11_see]Unable to handle kernel NULL pointer dereference at virtual
> address 0000000000000328
> [3701:11_see]user pgtable: 4k pages, 39-bit VAs, pgdp=00000000821de000
> [3701:11_see][0000000000000328] pgd=0000000000000000,
> p4d=0000000000000000,pud=0000000000000000
> [3701:11_see]tracing off
> [3701:11_see]Internal error: Oops: 96000005 [#1] PREEMPT SMP
> [3701:11_see]Call trace:
> [3701:11_see] queue_oom_reaper+0x30/0x170

Could you resolve this offset into the code line please?

> [3701:11_see] __oom_kill_process+0x590/0x860
> [3701:11_see] oom_kill_process+0x140/0x274
> [3701:11_see] out_of_memory+0x2f4/0x54c
> [3701:11_see] __alloc_pages_slowpath+0x5d8/0xaac
> [3701:11_see] __alloc_pages+0x774/0x800
> [3701:11_see] wp_page_copy+0xc4/0x116c
> [3701:11_see] do_wp_page+0x4bc/0x6fc
> [3701:11_see] handle_pte_fault+0x98/0x2a8
> [3701:11_see] __handle_mm_fault+0x368/0x700
> [3701:11_see] do_handle_mm_fault+0x160/0x2cc
> [3701:11_see] do_page_fault+0x3e0/0x818
> [3701:11_see] do_mem_abort+0x68/0x17c
> [3701:11_see] el0_da+0x3c/0xa0
> [3701:11_see] el0t_64_sync_handler+0xc4/0xec
> [3701:11_see] el0t_64_sync+0x1b4/0x1b8
> [3701:11_see]tracing off
>
> Signed-off-by: Gao Xu <[email protected]>
> ---
> mm/oom_kill.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e6071fde..3754ab4b6 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -984,7 +984,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
> }
> rcu_read_unlock();
>
> - if (can_oom_reap)
> + if (can_oom_reap && tsk_is_oom_victim(victim))
> queue_oom_reaper(victim);

I do not understand. We always do send SIGKILL and call
mark_oom_victim(victim); on victim task when reaching out here. How can
tsk_is_oom_victim can ever be false?

>
> mmdrop(mm);
> --
> 2.17.1
>
>

--
Michal Hocko
SUSE Labs

2023-11-24 02:52:56

by gaoxu

[permalink] [raw]

Subject: 回复: [PATCH] mm,oom_reaper: avoid run queue_ oom_reaper if task is not oom

On Web, 22 Nov 2023 21:47:44 +0000 Andrew Morton wrote:
> On Wed, 22 Nov 2023 12:46:44 +0000 gaoxu <[email protected]> wrote:

>> The function queue_oom_reaper tests and sets tsk->signal->oom_mm->flags.
>> However, it is necessary to check if 'tsk' is an OOM victim before
>> executing 'queue_oom_reaper' because the variable may be NULL.
>>
>> We encountered such an issue, and the log is as follows:
>> [3701:11_see]Out of memory: Killed process 3154 (system_server)
>> total-vm:23662044kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB,
>> UID:1000 pgtables:4056kB oom_score_adj:-900
>> [3701:11_see][RB/E]rb_sreason_str_set: sreason_str set null_pointer
>> [3701:11_see][RB/E]rb_sreason_str_set: sreason_str set unknown_addr
>> [3701:11_see]Unable to handle kernel NULL pointer dereference at
>> virtual address 0000000000000328

> Well that isn't good. How frequently does this happen and can you suggest why some quite old code is suddenly causing problems?
> What is your workload doing that others' do not do?
This is a low probability issue. We conducted monkey testing for a month,
and this problem occurred only once.
The cause of the OOM error is the process surfaceflinger has encountered dma-buf memory leak.

I have not found the root cause of this problem.
The physical memory of the process killed by OOM has been released, indicating that the issue may have occurred due to a concurrency problem
between process termination and OOM kill.
oom kill log??
Out of memory: Killed process 3154 (system_server) total-vm:23662044kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB,
UID:1000 pgtables:4056kB oom_score_adj:-900

>> [3701:11_see]user pgtable: 4k pages, 39-bit VAs, pgdp=00000000821de000
>> [3701:11_see][0000000000000328] pgd=0000000000000000,
>> p4d=0000000000000000,pud=0000000000000000
>> [3701:11_see]tracing off
>> [3701:11_see]Internal error: Oops: 96000005 [#1] PREEMPT SMP
>> [3701:11_see]Call trace:
>> [3701:11_see] queue_oom_reaper+0x30/0x170 [3701:11_see]
>> __oom_kill_process+0x590/0x860 [3701:11_see]
>> oom_kill_process+0x140/0x274 [3701:11_see] out_of_memory+0x2f4/0x54c
>> [3701:11_see] __alloc_pages_slowpath+0x5d8/0xaac
>> [3701:11_see] __alloc_pages+0x774/0x800 [3701:11_see]
>> wp_page_copy+0xc4/0x116c [3701:11_see] do_wp_page+0x4bc/0x6fc
>> [3701:11_see] handle_pte_fault+0x98/0x2a8 [3701:11_see]
>> __handle_mm_fault+0x368/0x700 [3701:11_see]
>> do_handle_mm_fault+0x160/0x2cc [3701:11_see] do_page_fault+0x3e0/0x818
>> [3701:11_see] do_mem_abort+0x68/0x17c [3701:11_see] el0_da+0x3c/0xa0
>> [3701:11_see] el0t_64_sync_handler+0xc4/0xec [3701:11_see]
>> el0t_64_sync+0x1b4/0x1b8 [3701:11_see]tracing off
>>
>> Signed-off-by: Gao Xu <[email protected]>

> I'll queue this for -stable backporting, assuming review is agreeable.
> Can we please identify a suitable Fixes: target to tell -stable maintainers which kernels need the fix? It looks like this goes back a long way.
The problem occurred on Linux version 5.15.78, There is no difference between the latest kernel version code and Linux version 5.15.78 in the
Function __oom_kill_process, so this problem is likely common to both versions.

2023-11-24 03:16:08

by gaoxu

[permalink] [raw]

Subject: 回复: [PATCH] mm,oom_reaper: avoid run queue_ oom_reaper if task is not oom

On Thu, 24 Nov 2023 08:51 Michal Hocko <[email protected]> wrote:
> On Wed 22-11-23 12:46:44, gaoxu wrote:
>> The function queue_oom_reaper tests and sets tsk->signal->oom_mm->flags.
>> However, it is necessary to check if 'tsk' is an OOM victim before
>> executing 'queue_oom_reaper' because the variable may be NULL.
>>
>> We encountered such an issue, and the log is as follows:
>> [3701:11_see]Out of memory: Killed process 3154 (system_server)
>> total-vm:23662044kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB,
>> UID:1000 pgtables:4056kB oom_score_adj:-900
>
>> [3701:11_see][RB/E]rb_sreason_str_set: sreason_str set null_pointer
>> [3701:11_see][RB/E]rb_sreason_str_set: sreason_str set unknown_addr
>
> What are these?
This is a log message that we added ourselves.

>> [3701:11_see]Unable to handle kernel NULL pointer dereference at
>> virtual address 0000000000000328 [3701:11_see]user pgtable: 4k pages,
>> 39-bit VAs, pgdp=00000000821de000 [3701:11_see][0000000000000328]
>> pgd=0000000000000000,
>> p4d=0000000000000000,pud=0000000000000000
>> [3701:11_see]tracing off
>> [3701:11_see]Internal error: Oops: 96000005 [#1] PREEMPT SMP
>> [3701:11_see]Call trace:
>> [3701:11_see] queue_oom_reaper+0x30/0x170
>
> Could you resolve this offset into the code line please?
Due to the additional code we added for log purposes, the line numbers may not correspond to the original Linux code.

static void queue_oom_reaper(struct task_struct *tsk)
{
/* mm is already queued? */
if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) //a null pointer exception occurred
return;
...
}
>> [3701:11_see] __oom_kill_process+0x590/0x860 [3701:11_see]
>> oom_kill_process+0x140/0x274 [3701:11_see] out_of_memory+0x2f4/0x54c
>> [3701:11_see] __alloc_pages_slowpath+0x5d8/0xaac
>> [3701:11_see] __alloc_pages+0x774/0x800 [3701:11_see]
>> wp_page_copy+0xc4/0x116c [3701:11_see] do_wp_page+0x4bc/0x6fc
>> [3701:11_see] handle_pte_fault+0x98/0x2a8 [3701:11_see]
>> __handle_mm_fault+0x368/0x700 [3701:11_see]
>> do_handle_mm_fault+0x160/0x2cc [3701:11_see] do_page_fault+0x3e0/0x818
>> [3701:11_see] do_mem_abort+0x68/0x17c [3701:11_see] el0_da+0x3c/0xa0
>> [3701:11_see] el0t_64_sync_handler+0xc4/0xec [3701:11_see]
>> el0t_64_sync+0x1b4/0x1b8 [3701:11_see]tracing off
>>
>> Signed-off-by: Gao Xu <[email protected]>
>> ---
>> mm/oom_kill.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 9e6071fde..3754ab4b6
>> 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -984,7 +984,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
>> }
>> rcu_read_unlock();
>>
>> - if (can_oom_reap)
>> + if (can_oom_reap && tsk_is_oom_victim(victim))
>> queue_oom_reaper(victim);
>
> I do not understand. We always do send SIGKILL and call mark_oom_victim(victim); on victim task when reaching out here. How can tsk_is_oom_victim can ever be false?
This is a low-probability issue, as it only occurred once during the monkey testing.
I haven't been able to find the root cause either.

>>
>> mmdrop(mm);
>> --
>> 2.17.1
>>
>>
>
>--
> Michal Hocko
> SUSE Labs

2023-11-24 09:34:18

by Michal Hocko

[permalink] [raw]

Subject: Re: 回复: [PATCH] mm,oo m_reaper: avoid run queue_oom_reaper if task is not oom

On Fri 24-11-23 02:52:34, gaoxu wrote:
> On Web, 22 Nov 2023 21:47:44 +0000 Andrew Morton wrote:
> > On Wed, 22 Nov 2023 12:46:44 +0000 gaoxu <[email protected]> wrote:
>
> >> The function queue_oom_reaper tests and sets tsk->signal->oom_mm->flags.
> >> However, it is necessary to check if 'tsk' is an OOM victim before
> >> executing 'queue_oom_reaper' because the variable may be NULL.
> >>
> >> We encountered such an issue, and the log is as follows:
> >> [3701:11_see]Out of memory: Killed process 3154 (system_server)
> >> total-vm:23662044kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB,
> >> UID:1000 pgtables:4056kB oom_score_adj:-900
> >> [3701:11_see][RB/E]rb_sreason_str_set: sreason_str set null_pointer
> >> [3701:11_see][RB/E]rb_sreason_str_set: sreason_str set unknown_addr
> >> [3701:11_see]Unable to handle kernel NULL pointer dereference at
> >> virtual address 0000000000000328
>
> > Well that isn't good. How frequently does this happen and can you suggest why some quite old code is suddenly causing problems?
> > What is your workload doing that others' do not do?
> This is a low probability issue. We conducted monkey testing for a month,
> and this problem occurred only once.
> The cause of the OOM error is the process surfaceflinger has encountered dma-buf memory leak.
>
> I have not found the root cause of this problem.
> The physical memory of the process killed by OOM has been released, indicating that the issue may have occurred due to a concurrency problem
> between process termination and OOM kill.
> oom kill log：
> Out of memory: Killed process 3154 (system_server) total-vm:23662044kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB,
> UID:1000 pgtables:4056kB oom_score_adj:-900
>
> >> [3701:11_see]user pgtable: 4k pages, 39-bit VAs, pgdp=00000000821de000
> >> [3701:11_see][0000000000000328] pgd=0000000000000000,
> >> p4d=0000000000000000,pud=0000000000000000
> >> [3701:11_see]tracing off
> >> [3701:11_see]Internal error: Oops: 96000005 [#1] PREEMPT SMP
> >> [3701:11_see]Call trace:
> >> [3701:11_see] queue_oom_reaper+0x30/0x170 [3701:11_see]
> >> __oom_kill_process+0x590/0x860 [3701:11_see]
> >> oom_kill_process+0x140/0x274 [3701:11_see] out_of_memory+0x2f4/0x54c
> >> [3701:11_see] __alloc_pages_slowpath+0x5d8/0xaac
> >> [3701:11_see] __alloc_pages+0x774/0x800 [3701:11_see]
> >> wp_page_copy+0xc4/0x116c [3701:11_see] do_wp_page+0x4bc/0x6fc
> >> [3701:11_see] handle_pte_fault+0x98/0x2a8 [3701:11_see]
> >> __handle_mm_fault+0x368/0x700 [3701:11_see]
> >> do_handle_mm_fault+0x160/0x2cc [3701:11_see] do_page_fault+0x3e0/0x818
> >> [3701:11_see] do_mem_abort+0x68/0x17c [3701:11_see] el0_da+0x3c/0xa0
> >> [3701:11_see] el0t_64_sync_handler+0xc4/0xec [3701:11_see]
> >> el0t_64_sync+0x1b4/0x1b8 [3701:11_see]tracing off
> >>
> >> Signed-off-by: Gao Xu <[email protected]>
>
> > I'll queue this for -stable backporting, assuming review is agreeable.
> > Can we please identify a suitable Fixes: target to tell -stable maintainers which kernels need the fix? It looks like this goes back a long way.
> The problem occurred on Linux version 5.15.78, There is no difference between the latest kernel version code and Linux version 5.15.78 in the
> Function __oom_kill_process, so this problem is likely common to both versions.

__oom_kill_process is not the only involved part. The exit path plays a
really huge role there as well. I do understand that this was one off
and likely hard to reproduce but without knowing that the current Linus
tree can trigger this, we cannot really do much, I am afraid.

--
Michal Hocko
SUSE Labs

2023-11-24 09:37:46

by Michal Hocko

[permalink] [raw]

Subject: Re: 回复: [PATCH] mm,oo m_reaper: avoid run queue_oom_reaper if task is not oom

On Fri 24-11-23 03:15:46, gaoxu wrote:
[...]
> >> [3701:11_see]Unable to handle kernel NULL pointer dereference at
> >> virtual address 0000000000000328 [3701:11_see]user pgtable: 4k pages,
> >> 39-bit VAs, pgdp=00000000821de000 [3701:11_see][0000000000000328]
> >> pgd=0000000000000000,
> >> p4d=0000000000000000,pud=0000000000000000
> >> [3701:11_see]tracing off
> >> [3701:11_see]Internal error: Oops: 96000005 [#1] PREEMPT SMP
> >> [3701:11_see]Call trace:
> >> [3701:11_see] queue_oom_reaper+0x30/0x170
> >
> > Could you resolve this offset into the code line please?
> Due to the additional code we added for log purposes, the line numbers may not correspond to the original Linux code.
>
> static void queue_oom_reaper(struct task_struct *tsk)
> {
> /* mm is already queued? */
> if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) //a null pointer exception occurred
> return;

Did you manage to narrow it down to which of the dereference this
corresponds to? Is it tsk->signal == NULL or signal->oom_mm == NULL.
The faulting address doesn't match neither with my configs.

[...]

> >> --- a/mm/oom_kill.c
> >> +++ b/mm/oom_kill.c
> >> @@ -984,7 +984,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
> >> }
> >> rcu_read_unlock();
> >>
> >> - if (can_oom_reap)
> >> + if (can_oom_reap && tsk_is_oom_victim(victim))
> >> queue_oom_reaper(victim);
> >
> > I do not understand. We always do send SIGKILL and call mark_oom_victim(victim); on victim task when reaching out here. How can tsk_is_oom_victim can ever be false?
> This is a low-probability issue, as it only occurred once during the monkey testing.
> I haven't been able to find the root cause either.

OK, was there any non-standard code running during this test?
In any case I do not see how this patch could be correct. If, for some
reason we managed to release the signal structure or something else then
we need to understand whether this is a locking or reference counting
issue. I do not really see how this would be possible. But this check
right here doesn't really make sense.

Andrew please drop the patch from your tree.
--
Michal Hocko
SUSE Labs

2023-11-25 06:47:54

by gaoxu

[permalink] [raw]

Subject: 回复: 回复: [PATCH] mm,oom_reaper: avoid run queue_oom_reaper if task is not oom

On Fri, 24 Nov 2023 09:31 Michal Hocko wrote:
>On Fri 24-11-23 03:15:46, gaoxu wrote:
>[...]
>> >> [3701:11_see]Unable to handle kernel NULL pointer dereference at
>> >> virtual address 0000000000000328 [3701:11_see]user pgtable: 4k
>> >> pages, 39-bit VAs, pgdp=00000000821de000
>> >> [3701:11_see][0000000000000328] pgd=0000000000000000,
>> >> p4d=0000000000000000,pud=0000000000000000
>> >> [3701:11_see]tracing off
>> >> [3701:11_see]Internal error: Oops: 96000005 [#1] PREEMPT SMP
>> >> [3701:11_see]Call trace:
>> >> [3701:11_see] queue_oom_reaper+0x30/0x170
>> >
>> > Could you resolve this offset into the code line please?
>> Due to the additional code we added for log purposes, the line numbers may not correspond to the original Linux code.
>>
>> static void queue_oom_reaper(struct task_struct *tsk) {
>> /* mm is already queued? */
>> if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) //a null pointer exception occurred
>> return;
>
>Did you manage to narrow it down to which of the dereference this corresponds to? Is it tsk->signal == NULL or signal->oom_mm == NULL.
>The faulting address doesn't match neither with my configs.

[...]

>> >> --- a/mm/oom_kill.c
>> >> +++ b/mm/oom_kill.c
>> >> @@ -984,7 +984,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
>> >> }
>> >> rcu_read_unlock();
>> >>
>> >> - if (can_oom_reap)
>> >> + if (can_oom_reap && tsk_is_oom_victim(victim))
>> >> queue_oom_reaper(victim);
>> >
>> > I do not understand. We always do send SIGKILL and call mark_oom_victim(victim); on victim task when reaching out here. How can tsk_is_oom_victim can ever be false?
>> This is a low-probability issue, as it only occurred once during the monkey testing.
>> I haven't been able to find the root cause either.
>
>OK, was there any non-standard code running during this test?
>In any case I do not see how this patch could be correct. If, for some reason we managed to release the signal structure or something else then we need to understand whether this is a locking or reference counting issue. I do not really see how this would be possible. But this check right here doesn't really make sense.

there was no any non-standard code running during this test.
The cause of the OOM error is the process surfaceflinger has encountered dma-buf memory leak.
This problem is likely caused by concurrency. I will try to create a concurrent scenario of oom or kill process to reproduce the issue,
and if discover anything, I will send it here.
Thank you, Michal and Andrew, for analyzing and discussing the issue.

>Andrew please drop the patch from your tree.