Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Message-ID: <43a6c470-9fc2-6195-9a25-5321d17540e5@redhat.com>
Date:   Mon, 17 Jan 2022 17:56:28 -0500
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.2.0
Subject: Re: [PATCH v3] mm/oom: do not oom reap task with an unresolved robust
 futex
Content-Language: en-US
To:     Waiman Long <longman@redhat.com>, Michal Hocko <mhocko@suse.com>
Cc:     linux-mm@kvack.org, linux-kernel@vger.kernel.org,
        akpm@linux-foundation.org, jsavitz@redhat.com,
        peterz@infradead.org, tglx@linutronix.de, mingo@redhat.com,
        dvhart@infradead.org, dave@stgolabs.net, andrealmeid@collabora.com
References: <20220114180135.83308-1-npache@redhat.com>
 <YeUuWcNArnDhOjFY@dhcp22.suse.cz>
 <ad639326-bea8-9bfb-23e3-4e2b216d9645@redhat.com>
From:   Nico Pache <npache@redhat.com>
In-Reply-To: <ad639326-bea8-9bfb-23e3-4e2b216d9645@redhat.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk


On 1/17/22 11:05, Waiman Long wrote:
> On 1/17/22 03:52, Michal Hocko wrote:
>> On Fri 14-01-22 13:01:35, Nico Pache wrote:
>>> In the case that two or more processes share a futex located within
>>> a shared mmaped region, such as a process that shares a lock between
>>> itself and child processes, we have observed that when a process holding
>>> the lock is oom killed, at least one waiter is never alerted to this new
>>> development and simply continues to wait.
>>>
>>> This is visible via pthreads by checking the __owner field of the
>>> pthread_mutex_t structure within a waiting process, perhaps with gdb.
>>>
>>> We identify reproduction of this issue by checking a waiting process of
>>> a test program and viewing the contents of the pthread_mutex_t, taking note
>>> of the value in the owner field, and then checking dmesg to see if the
>>> owner has already been killed.
>> I believe we really need to find out why the original holder of the
>> futex is not woken up to release the lock when exiting.
> 
> For a robust futex lock holder or waiter that is to be killed, it is not the
> responsibility of the task itself to wake up and release the lock. It is the
> kernel that recognizes that the task is holding or waiting for the robust futex
> and clean thing up.
> 
> 
>>> As mentioned by Michal in his patchset introducing the oom reaper,
>>> commit aac4536355496 ("mm, oom: introduce oom reaper"), the purpose of the
>>> oom reaper is to try and free memory more quickly; however, In the case
>>> that a robust futex is being used, we want to avoid utilizing the
>>> concurrent oom reaper. This is due to a race that can occur between the
>>> SIGKILL handling the robust futex, and the oom reaper freeing the memory
>>> needed to maintain the robust list.
>> OOM reaper is only unmapping private memory. It doesn't touch a shared
>> mappings. So how could it interfere?
>>
> The futex itself may be in shared memory, however the robust list entry can be
> in private memory. So when the robust list is being scanned in this case, we can
> be in a use-after-free situation.

I believe this is true.  The userspace allocation for the pthread occurs as a
private mapping:
https://elixir.bootlin.com/glibc/latest/source/nptl/allocatestack.c#L368

>>> In the case that the oom victim is utilizing a robust futex, and the
>>> SIGKILL has not yet handled the futex death, the tsk->robust_list should
>>> be non-NULL. This issue can be tricky to reproduce, but with the
>>> modifications of this patch, we have found it to be impossible to
>>> reproduce.
>> We really need a deeper analysis of the udnerlying problem because right
>> now I do not really see why the oom reaper should interfere with shared
>> futex.
> As I said above, the robust list processing can involve private memory.
>>
>>> Add a check for tsk->robust_list is non-NULL in wake_oom_reaper() to return
>>> early and prevent waking the oom reaper.
>>>
>>> Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
>>>
>>> Co-developed-by: Joel Savitz <jsavitz@redhat.com>
>>> Signed-off-by: Joel Savitz <jsavitz@redhat.com>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>>   mm/oom_kill.c | 15 +++++++++++++++
>>>   1 file changed, 15 insertions(+)
>>>
>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>>> index 1ddabefcfb5a..3cdaac9c7de5 100644
>>> --- a/mm/oom_kill.c
>>> +++ b/mm/oom_kill.c
>>> @@ -667,6 +667,21 @@ static void wake_oom_reaper(struct task_struct *tsk)
>>>       if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
>>>           return;
>>>   +#ifdef CONFIG_FUTEX
>>> +    /*
>>> +     * If the ooming task's SIGKILL has not finished handling the
>>> +     * robust futex it is not correct to reap the mm concurrently.
>>> +     * Do not wake the oom reaper when the task still contains a
>>> +     * robust list.
>>> +     */
>>> +    if (tsk->robust_list)
>>> +        return;
>>> +#ifdef CONFIG_COMPAT
>>> +    if (tsk->compat_robust_list)
>>> +        return;
>>> +#endif
>>> +#endif
>> If this turns out to be really needed, which I do not really see at the
>> moment, then this is not the right way to handle this situation. The oom
>> victim could get stuck and the oom killer wouldn't be able to move
>> forward. If anything the victim would need to get MMF_OOM_SKIP set.

I will try this, but I don't immediately see any difference between this return
case and setting the bit, passing the oom_reaper_list, then skipping it based on
the flag. Do you mind explaining how this could lead to the oom killer getting
stuck?

Cheers,
-- Nico
> 
> There can be other way to do that, but letting the normal kill signal processing
> finishing its job and properly invoke futex_cleanup() is certainly one possible
> solution.
> 
> Cheers,
> Longman
>