Message-ID: <919f547e-beb7-34b7-7835-9e1625600323@suse.cz>
Date:   Fri, 26 Nov 2021 15:50:15 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.3.2
Content-Language: en-US
To:     NeilBrown <neilb@suse.de>,
        Andrew Morton <akpm@linux-foundation.org>
Cc:     Uladzislau Rezki <urezki@gmail.com>,
        Michal Hocko <mhocko@kernel.org>,
        Dave Chinner <david@fromorbit.com>,
        Christoph Hellwig <hch@lst.de>, linux-fsdevel@vger.kernel.org,
        linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>,
        Ilya Dryomov <idryomov@gmail.com>,
        Jeff Layton <jlayton@kernel.org>,
        Michal Hocko <mhocko@suse.com>
References: <20211122153233.9924-1-mhocko@kernel.org>
 <20211122153233.9924-3-mhocko@kernel.org> <YZ06nna7RirAI+vJ@pc638.lan>
 <20211123170238.f0f780ddb800f1316397f97c@linux-foundation.org>
 <163772381628.1891.9102201563412921921@noble.neil.brown.name>
 <20211123194833.4711add38351d561f8a1ae3e@linux-foundation.org>
 <163773141164.1891.1440920123016055540@noble.neil.brown.name>
From:   Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
In-Reply-To: <163773141164.1891.1440920123016055540@noble.neil.brown.name>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Precedence: bulk

On 11/24/21 06:23, NeilBrown wrote:
>> 
>> I forget why radix_tree_preload used a cpu-local store rather than a
>> per-task one.
>> 
>> Plus "what order pages would you like" and "on which node" and "in
>> which zone", etc...
> 
> "what order" - only order-0 I hope.  I'd hazard a guess that 90% of
> current NOFAIL allocations only need one page (providing slub is used -
> slab seems to insist on high-order pages sometimes).

Yeah AFAIK SLUB can prefer higher orders than SLAB, but also allows fallback
to smallest order that's enough (thus 0 unless the objects are larger than a
page).

> "which node" - whichever.  Unless __GFP_HARDWALL is set, alloc_page()
> will fall-back to "whichever" anyway, and NOFAIL with HARDWALL is
> probably a poor choice.
> "which zone" - NORMAL.  I cannot find any NOFAIL allocations that want
> DMA.  fs/ntfs asks for __GFP_HIGHMEM with NOFAIL, but that that doesn't
> *requre* highmem.
> 
> Of course, before designing this interface too precisely we should check
> if anyone can use it.  From a quick through the some of the 100-ish
> users of __GFP_NOFAIL I'd guess that mempools would help - the
> preallocation should happen at init-time, not request-time.  Maybe if we
> made mempools even more light weight .... though that risks allocating a
> lot of memory that will never get used.
> 
> This brings me back to the idea that
>     alloc_page(wait and reclaim allowed)
> should only fail on OOM_KILL.  That way kernel threads are safe, and
> user-threads are free to return ENOMEM knowing it won't get to

Hm I thought that's already pretty much the case of the "too small to fail"
of today. IIRC there's exactly that gotcha that OOM KILL can result in such
allocation failure. But I believe that approach is rather fragile. If you
encounter such an allocation not checking the resulting page != NULL, you
can only guess which one is true:

- the author simply forgot to check at all
- the author relied on "too small to fail" without realizing the gotcha
- at the time of writing the code was verified that it can be only run in
kernel thread context, not user and
  - it is still true
  - it stopped being true at some later point
  - might be hard to even decide which is the case

IIRC at some point we tried to abolish the "too small to fail" rule because
of this, but Linus denied that. But the opposite - make it hard guarantee in
all cases - also didn't happen, so...

> user-space.  If user-thread code really needs NOFAIL, it punts to a
> workqueue and waits - aborting the wait if it is killed, while the work
> item still runs eventually.
> 
> NeilBrown
>