Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2049EC433FE for ; Mon, 3 Jan 2022 12:11:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231143AbiACMLn (ORCPT ); Mon, 3 Jan 2022 07:11:43 -0500 Received: from smtp-out2.suse.de ([195.135.220.29]:57604 "EHLO smtp-out2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230049AbiACMLm (ORCPT ); Mon, 3 Jan 2022 07:11:42 -0500 Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out2.suse.de (Postfix) with ESMTP id E71E21F38B; Mon, 3 Jan 2022 12:11:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1641211900; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=TTdWlpJiyz+unQlc+Mpojg65so9ozifMkXUPRCYLBWk=; b=HN4nwCpsxpykgXN8qLhWazCRr/kac2pKlzvFMQ8MD7Z4sw8fdGlIQUu+PhTtbFBgWlRgXx MVWPuiKeWlJvBpEICv31Yny/Hf0KtP0cYtLm05EnEPL/8cwN6spHsAOdKxtnSLU0qNFG87 qXJ4BFYng8ay3+onsAWV9GAo2OqaM3g= Received: from suse.cz (unknown [10.100.201.86]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id B0D9EA3B81; Mon, 3 Jan 2022 12:11:39 +0000 (UTC) Date: Mon, 3 Jan 2022 13:11:39 +0100 From: Michal Hocko To: Suren Baghdasaryan Cc: Johannes Weiner , Andrew Morton , David Rientjes , Matthew Wilcox , Roman Gushchin , Rik van Riel , Minchan Kim , "Kirill A. Shutemov" , Andrea Arcangeli , Christian Brauner , Christoph Hellwig , Oleg Nesterov , David Hildenbrand , Jann Horn , Shakeel Butt , Andy Lutomirski , Christian Brauner , Florian Weimer , Jan Engelhardt , Tim Murray , linux-mm , LKML , kernel-team Subject: Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 30-12-21 09:29:40, Suren Baghdasaryan wrote: > On Thu, Dec 30, 2021 at 12:24 AM Michal Hocko wrote: > > > > On Wed 29-12-21 21:59:55, Suren Baghdasaryan wrote: > > [...] > > > After some more digging I think there are two acceptable options: > > > > > > 1. Call unlock_range() under mmap_write_lock and then downgrade it to > > > read lock so that both exit_mmap() and __oom_reap_task_mm() can unmap > > > vmas in parallel like this: > > > > > > if (mm->locked_vm) { > > > mmap_write_lock(mm); > > > unlock_range(mm->mmap, ULONG_MAX); > > > mmap_write_downgrade(mm); > > > } else > > > mmap_read_lock(mm); > > > ... > > > unmap_vmas(&tlb, vma, 0, -1); > > > mmap_read_unlock(mm); > > > mmap_write_lock(mm); > > > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > > > ... > > > mm->mmap = NULL; > > > mmap_write_unlock(mm); > > > > > > This way exit_mmap() might block __oom_reap_task_mm() but for a much > > > shorter time during unlock_range() call. > > > > IIRC unlock_range depends on page lock at some stage and that can mean > > this will block for a long time or for ever when the holder of the lock > > depends on a memory allocation. This was the primary problem why the oom > > reaper skips over mlocked vmas. > > Oh, I missed that detail. I thought __oom_reap_task_mm() skips locked > vmas only to avoid destroying pgds from under follow_page(). > > > > > > 2. Introduce another vm_flag mask similar to VM_LOCKED which is set > > > before munlock_vma_pages_range() clears VM_LOCKED so that > > > __oom_reap_task_mm() can identify vmas being unlocked and skip them. > > > > > > Option 1 seems cleaner to me because it keeps the locking pattern > > > around unlock_range() in exit_mmap() consistent with all other places > > > it is used (in mremap() and munmap()) with mmap_write_lock taken. > > > WDYT? > > > > It would be really great to make unlock_range oom reaper aware IMHO. > > What exactly do you envision? Say unlock_range() knows that it's > racing with __oom_reap_task_mm() and that calling follow_page() is > unsafe without locking, what should it do? My original plan was to make the page lock conditional and use trylocking from the oom reaper (aka lockless context). It is OK to simply bail out and leave some mlocked memory behind if there is a contention on a specific page. The overall objective is to free as much memory as possible, not all of it. IIRC Hugh was not a fan of this approach and he has mentioned that the lock might not be even really needed and that the area would benefit from a clean up rather than oom reaper specific hacks. I do tend to agree with that. I just never managed to find any spare time for that though and heavily mlocked oom victims tend to be really rare. -- Michal Hocko SUSE Labs