Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp3886199ybb; Mon, 6 Apr 2020 18:22:56 -0700 (PDT) X-Google-Smtp-Source: APiQypLlGk3IQ1zA4DkJqnF6z/4sIAohrvivCVQhwq7moxapXSpD7TxJGY7BhylNDEUSirKLT02/ X-Received: by 2002:aca:d489:: with SMTP id l131mr1830716oig.5.1586222576269; Mon, 06 Apr 2020 18:22:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1586222576; cv=none; d=google.com; s=arc-20160816; b=C2onupswNBbpW2J0qyUInH5fh2ITLTp7jMrWA1Dyu/ZuRV4X11p5aiNCXHy2eSNQQc npu2rZ+J8mn1SRBcz8DG8lTMF72idZP8el+J69gO9k8MGgIyaddaew/61X2eBbdj8wm0 +pvBU3WyIClnvwwDidvlrYZvltBAH7moVcUgKJm2dMq4FZ76nnOtVzml4LoexYD2YcHR JfMRGGoLnJ47g57Cv1MlUATBB6zn1mtAU2+DCpwAy53jkCqkvXGEL8sj1C13mu2SU+Oa AsfVsTGaJpALN5Hay3AhxbnGH+DFsgGgX1wWP2WoYn9JJTj+OwpUHwC/4Bh9OVnpEcGx Z1jw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:dkim-signature:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=LYXDUK+GJft3mnfsdxuky7tVzUNlHTrLGp5zc2o/mDo=; b=OfKGXEjMpsqhLWPK5temyeBwg3XRIBTzkwHZGl9LjiHFNpHmXzb/RAP6YlWuWdWtjG Ql8+mWkC66nMTFFYk1sDCcn7jpah/Vss8f5LGnTx5RNstsgMyNyEUk/+yG8U6vP3uJCz wXqOL8ymskqGqivcwEwfQz85HaOaSFG1olDYbkRRZwbxs64XZQjguaG6tSzSmTSDCITH 0dLXxcZ6VTeRM8A2+ogCKvQKVq8Ki3Z92Nbrot/44Oe6nxbuJ4ORDnVHzMAtcTur9Bfo lUKPENa0OW1IugA8fCTTMcivOEJTpqCFCUk6ALKXWNNwxRPsW4MRQAy1M8O+KmQdds3/ z8yg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@nvidia.com header.s=n1 header.b="Jl/QqwDK"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w83si17343oie.244.2020.04.06.18.22.43; Mon, 06 Apr 2020 18:22:56 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@nvidia.com header.s=n1 header.b="Jl/QqwDK"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726339AbgDGBVm (ORCPT + 99 others); Mon, 6 Apr 2020 21:21:42 -0400 Received: from hqnvemgate25.nvidia.com ([216.228.121.64]:18303 "EHLO hqnvemgate25.nvidia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726277AbgDGBVl (ORCPT ); Mon, 6 Apr 2020 21:21:41 -0400 Received: from hqpgpgate101.nvidia.com (Not Verified[216.228.121.13]) by hqnvemgate25.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Mon, 06 Apr 2020 18:20:48 -0700 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate101.nvidia.com (PGP Universal service); Mon, 06 Apr 2020 18:21:40 -0700 X-PGP-Universal: processed; by hqpgpgate101.nvidia.com on Mon, 06 Apr 2020 18:21:40 -0700 Received: from DRHQMAIL107.nvidia.com (10.27.9.16) by HQMAIL111.nvidia.com (172.20.187.18) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Tue, 7 Apr 2020 01:21:40 +0000 Received: from [10.2.60.145] (172.20.13.39) by DRHQMAIL107.nvidia.com (10.27.9.16) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Tue, 7 Apr 2020 01:21:39 +0000 Subject: Re: [PATCH 1/2] mm: clarify __GFP_MEMALLOC usage To: NeilBrown , Michal Hocko CC: David Rientjes , Andrew Morton , Joel Fernandes , "Paul E. McKenney" , , LKML References: <20200403083543.11552-1-mhocko@kernel.org> <20200403083543.11552-2-mhocko@kernel.org> <87blo8xnz2.fsf@notabene.neil.brown.name> <20200406070137.GC19426@dhcp22.suse.cz> <4f861f07-4b47-8ddc-f783-10201ea302d3@nvidia.com> <875zecw1n6.fsf@notabene.neil.brown.name> From: John Hubbard X-Nvconfidentiality: public Message-ID: <34cb5019-9fab-4dfd-db48-c3f88fe5614c@nvidia.com> Date: Mon, 6 Apr 2020 18:21:39 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 MIME-Version: 1.0 In-Reply-To: <875zecw1n6.fsf@notabene.neil.brown.name> X-Originating-IP: [172.20.13.39] X-ClientProxiedBy: HQMAIL107.nvidia.com (172.20.187.13) To DRHQMAIL107.nvidia.com (10.27.9.16) Content-Type: text/plain; charset="utf-8"; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1586222448; bh=LYXDUK+GJft3mnfsdxuky7tVzUNlHTrLGp5zc2o/mDo=; h=X-PGP-Universal:Subject:To:CC:References:From:X-Nvconfidentiality: Message-ID:Date:User-Agent:MIME-Version:In-Reply-To: X-Originating-IP:X-ClientProxiedBy:Content-Type:Content-Language: Content-Transfer-Encoding; b=Jl/QqwDKYXHNIQohfKGwFXrYefebguhhbbIxJJtXWn2OBMU2XSKkvCx9qNgcIb/nS XqJru4pynZpuMwAtLpfjpZIRVoJBQUWu6qG19aN3w4VWYFL8DRqitD2ArHjWsz7lEb ZhCgVC9skTQdSs2jfmiMpjl6BZbe3Jse6/tVWkSplo0FqnJKPUdN6lzaLgenWyWLBB dsQKEHwk1WbcBYwWwnGhiHgHI0kTmPyq4mSYlN7bGrhb5bYSd+nt/Mkd4A5Kna0Ih3 PHKFGIDW2IYetxeHBldIHbi4ayfJd3VGaDRvRkCSLY72eZ6Tt6ah1IGd4iwFDDLr3F 1n19FqbN5jQYQ== Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 4/6/20 6:00 PM, NeilBrown wrote: ... >>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h >>> index e5b817cb86e7..9cacef1a3ee0 100644 >>> --- a/include/linux/gfp.h >>> +++ b/include/linux/gfp.h >>> @@ -110,6 +110,11 @@ struct vm_area_struct; >>> * the caller guarantees the allocation will allow more memory to be freed >>> * very shortly e.g. process exiting or swapping. Users either should >>> * be the MM or co-ordinating closely with the VM (e.g. swap over NFS). >>> + * Users of this flag have to be extremely careful to not deplete the reserve >>> + * completely and implement a throttling mechanism which controls the consumption >>> + * of the reserve based on the amount of freed memory. >>> + * Usage of a pre-allocated pool (e.g. mempool) should be always considered before >>> + * using this flag. > > I think this version is pretty good. > >>> * >>> * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves. >>> * This takes precedence over the %__GFP_MEMALLOC flag if both are set. >>> >> >> Hi Michal and all, >> >> How about using approximately this wording instead? I found Neil's wording to be >> especially helpful so I mixed it in. (Also fixed a couple of slight 80-col overruns.) >> >> diff --git a/include/linux/gfp.h b/include/linux/gfp.h >> index be2754841369..c247a911d8c7 100644 >> --- a/include/linux/gfp.h >> +++ b/include/linux/gfp.h >> @@ -111,6 +111,15 @@ struct vm_area_struct; >> * very shortly e.g. process exiting or swapping. Users either should >> * be the MM or co-ordinating closely with the VM (e.g. swap over NFS). >> * >> + * To be extra clear: users of __GFP_MEMALLOC must be working to free other >> + * memory, and that other memory needs to be freed "soon"; specifically, before >> + * the reserve is exhausted. This generally implies a throttling mechanism that >> + * balances the amount of __GFP_MEMALLOC memory used against the amount that the >> + * caller is about to free. > > I don't like this change. "balances the amount ... is about to free" > does say anything about time, so it doesn't seem to be about throttling. > > I think it is hard to write rules because the rules are a bit spongey. > > With mempools, we have a nice clear rule. When you allocate from a > mempool you must have a clear path to freeing that allocation which will > not block on memory allocation except from a subordinate mempool. This > implies a partial ordering between mempools. When you have layered > block devices the path through the layers from filesystem down to > hardware defines the order. It isn't enforced, but it is quite easy to > reason about. > > GFP_MEMALLOC effectively provides multiple mempools. So it could > theoretically deadlock if multiple long dependency chains > happened. i.e. if 1000 threads each make a GFP_MEMALLOC allocation and > then need to make another one before the first can be freed - then you > hit problems. There is no formal way to guarantee that this doesn't > happen. We just say "be gentle" and minimize the users of this flag, > and keep more memory in reserve than we really need. > Note that 'threads' here might not be Linux tasks. If you have an IO > request that proceed asynchronously, moving from queue to queue and > being handled by different task, then each one is a "thread" for the > purpose of understanding mem-alloc dependency. > > So maybe what I really should focus on is not how quickly things happen, > but how many happen concurrently. The idea of throttling is to allow > previous requests to complete before we start too many more. > > With Swap-over-NFS, some of the things that might need to be allocated > are routing table entries. These scale with the number of NFS servers > rather than the number of IO requests, so they are not going to cause > concurrency problems. > We also need memory to store replies, but these never exceed the number > of pending requests, so there is limited concurrency there. > NFS can send a lot of requests in parallel, but the main limit is the > RPC "slot table" and while that grows dynamically, it does so with > GFP_NOFS, so it can block or fail (I wonder if that should explicitly > disable the use of the reserves). > > So there a limit on concurrency imposed by non-GFP_MEMALLOC allocations > > So ... maybe the documentation should say that boundless concurrency of > allocations (i.e. one module allocating a boundless number of times > before previous allocations are freed) must be avoided. > Well, that's a good discussion that you just wrote, above, and I think it demonstrates that it's hard to describe the situation in just a couple of sentences. With that in mind, perhaps it's best to take the above notes as a starting point, adjust them slightly and drop them into Documentation/core-api/memory-allocation.rst ? Then the comments here could refer to it. thanks, -- John Hubbard NVIDIA