Received: by 2002:a5b:505:0:0:0:0:0 with SMTP id o5csp1417233ybp; Fri, 11 Oct 2019 14:01:36 -0700 (PDT) X-Google-Smtp-Source: APXvYqxW5Av23jhIruGbd6QJ8Nkqa7UTPcczCw++7uzz0SKbmL7eERGwup4WaMXMlIZPJMHEpuXp X-Received: by 2002:a05:6402:29a:: with SMTP id l26mr15476663edv.290.1570827696145; Fri, 11 Oct 2019 14:01:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1570827696; cv=none; d=google.com; s=arc-20160816; b=unhjRwWnr1/6wFTfUfywI/PQKiTgVerTLkqIKZGFGEWlO3D/dI+m+geB7Wj8gjdBGq iM/TCebLE0NcWHvP57cLX9g/j3F38jVKOmOc/DfIPrcjfuNRNc8k+1LuHh8HLesCnQAF fjnejBRCSCupcV9FNIUCBUCCNNH1q2p5bcEj35aUQ0FGGmpT6f5i3U02/oAkrWHlCMX6 wHG0hkCmaccEwzkaAa6crVirqBDoWy5JVRQ4Ol+EVGXqtD9WrWG2j2WcUWe5n7ljXdOx vI9wbSCY3ewkEDU/oAVuNAdb2nnC+BsRp3fGeLPhLPzJHMQPA8H3DplHwKVPbtmfIVfV p7DQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=QCcI0gZx24qP48bYZd6tERlQ1R1wkv8V/GwDJhxA6+4=; b=Ry4Mkqx4KmGoGQEXGdJww4QepUlmW/mV3GuSOQSYrnyM4TgcBYh72D08NIOt2BsA4G k1jOR7zOlROx4NjgU60S4NhRR1w+Fs/uUC7W621px/bj2K4Lk2cMCyo3v5S52dCURijv xXUP8KuyAkyWOOuxL0NPJO9LYJmk/D/VxKgRHTShEmzuuv306F2/1nQbJpjQouWAuKPh +TnggqntzKyEv0QiQUmYUy9eqUE3RaOl7Cz6ok5dpiKaT3KZMPth4yWzNob/gSO9NTnk VS6T90Dfg5ns4VxHODgShwqor4BwMO35wTzpayHITpEMV6OgFC6/mc4OwaekZqWI+J/D +Evw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=O+quGUL9; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id la20si5985717ejb.73.2019.10.11.14.01.12; Fri, 11 Oct 2019 14:01:36 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=O+quGUL9; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729074AbfJKUlt (ORCPT + 99 others); Fri, 11 Oct 2019 16:41:49 -0400 Received: from mail-ot1-f66.google.com ([209.85.210.66]:34165 "EHLO mail-ot1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728943AbfJKUlt (ORCPT ); Fri, 11 Oct 2019 16:41:49 -0400 Received: by mail-ot1-f66.google.com with SMTP id m19so9109019otp.1 for ; Fri, 11 Oct 2019 13:41:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=QCcI0gZx24qP48bYZd6tERlQ1R1wkv8V/GwDJhxA6+4=; b=O+quGUL9GaTo+gzYgH53b4VZvt/bBFsOo+HoyW0Wa5cymiVBv3ZBICK2Ab1z8ywiOW 6UVDcg8sc6Kd9egwHekRJrJLEVqHBAcElTS2vmtilW/W523Gbm1xXC2JdGqy3WcC42zV qewmH9LJH5ZfFOhFpCYjQElqSODyKlrLrWR8vZlyS+UHQmC3h1AdXsq/YCnEgPTAhJgV qRFKcH01zJTADmivj0w3XEJlAU+KrMMsmZlgI/NUpjIM453sU5iM50jTq8axNxiArMBO iErIVWcGoBL9clvTA9iUAnIkCIQjpUW6CTHdEr0lqSvqUBNMxsgd+1SIXCsPFz7cpG6L zLOA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=QCcI0gZx24qP48bYZd6tERlQ1R1wkv8V/GwDJhxA6+4=; b=C1c5a8QBaC0ybqfczlwOtPw+nIZMqEMsi/eW1Llry3cpBz6G5pJlfqfzE0luJVgkog eN3SIkZXQnLUwNTHsESeZE8T0sWOl1SpPFX5njWsa74a2lGJI412VPWwtmRpv/V6NHO6 1uY6u2RoobxXsQWygDKltKTpGtlgsbGDn44m4/5iNkmvvXqJcC+B5Qe3CCyByKxyAuqC 37OzIkfxMgDBGefPOwuA05ROOPLOqvkmlrwlifDQJ2+i75G6lX4rFitCPUmTToDuxsDD dNp6dfSOUnNSgPjFTtRW/b9UdZEv338E54n3ujAv8mMLJelPwTi5ZZQpjxGPpQCJABbZ kXsQ== X-Gm-Message-State: APjAAAV3E+TqZx2kcrnoVrliSqWqF7LvWxw447Rc8h/6sJPqJq0bUqxz nRIC20wZlxFX2EfmQcDl1P2PzT3FiuFTW7YXQsohfg== X-Received: by 2002:a9d:5c07:: with SMTP id o7mr14443071otk.33.1570826507552; Fri, 11 Oct 2019 13:41:47 -0700 (PDT) MIME-Version: 1.0 References: <20190919222421.27408-1-almasrymina@google.com> <3c73d2b7-f8d0-16bf-b0f0-86673c3e9ce3@oracle.com> In-Reply-To: From: Mina Almasry Date: Fri, 11 Oct 2019 13:41:36 -0700 Message-ID: Subject: Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits To: Mike Kravetz Cc: Aneesh Kumar , shuah , David Rientjes , Shakeel Butt , Greg Thelen , Andrew Morton , khalid.aziz@oracle.com, open list , linux-mm@kvack.org, linux-kselftest@vger.kernel.org, cgroups@vger.kernel.org, =?UTF-8?Q?Michal_Koutn=C3=BD?= Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 11, 2019 at 12:10 PM Mina Almasry wrote: > > On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz wrote: > > > > On 9/19/19 3:24 PM, Mina Almasry wrote: > > > Patch series implements hugetlb_cgroup reservation usage and limits, which > > > track hugetlb reservations rather than hugetlb memory faulted in. Details of > > > the approach is 1/7. > > > > Thanks for your continued efforts Mina. > > > > One thing that has bothered me with this approach from the beginning is that > > hugetlb reservations are related to, but somewhat distinct from hugetlb > > allocations. The original (existing) huegtlb cgroup implementation does not > > take reservations into account. This is an issue you are trying to address > > by adding a cgroup support for hugetlb reservations. However, this new > > reservation cgroup ignores hugetlb allocations at fault time. > > > > I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation > > of hugetlb pages. Both the existing cgroup code and the reservation approach > > have what I think are some serious flaws. Consider a system with 100 hugetlb > > pages available. A sysadmin, has two groups A and B and wants to limit hugetlb > > usage to 50 pages each. > > > > With the existing implementation, a task in group A could create a mmap of > > 100 pages in size and reserve all 100 pages. Since the pages are 'reserved', > > nobody in group B can allocate ANY huge pages. This is true even though > > no pages have been allocated in A (or B). > > > > With the reservation implementation, a task in group A could use MAP_NORESERVE > > and allocate all 100 pages without taking any reservations. > > > > As mentioned in your documentation, it would be possible to use both the > > existing (allocation) and new reservation cgroups together. Perhaps if both > > are setup for the 50/50 split things would work a little better. > > > > However, instead of creating a new reservation crgoup how about adding > > reservation support to the existing allocation cgroup support. One could > > even argue that a reservation is an allocation as it sets aside huge pages > > that can only be used for a specific purpose. Here is something that > > may work. > > > > Starting with the existing allocation cgroup. > > - When hugetlb pages are reserved, the cgroup of the task making the > > reservations is charged. Tracking for the charged cgroup is done in the > > reservation map in the same way proposed by this patch set. > > - At page fault time, > > - If a reservation already exists for that specific area do not charge the > > faulting task. No tracking in page, just the reservation map. > > - If no reservation exists, charge the group of the faulting task. Tracking > > of this information is in the page itself as implemented today. > > - When the hugetlb object is removed, compare the reservation map with any > > allocated pages. If cgroup tracking information exists in page, uncharge > > that group. Otherwise, unharge the group (if any) in the reservation map. > > > > Sorry for the late response here. I've been prototyping the > suggestions from this conversation: > > 1. Supporting cgroup-v2 on the current controller seems trivial. > Basically just specifying the dfl files seems to do it, and my tests > on top of cgroup-v2 don't see any problems so far at least. In light > of this I'm not sure it's best to create a new controller per say. > Seems like it would duplicate a lot of code with the current > controller, so I've tentatively just stuck to the plan in my current > patchset, a new counter on the existing controller. > > 2. I've been working on transitioning the new counter to the behavior > Mike specified in the email I'm responding to. So far I have a flow > that works for shared mappings but not private mappings: > > - On reservation, charge the new counter and store the info in the > resv_map. The counter gets uncharged when the resv_map entry gets > removed (works fine). > - On alloc_huge_page(), check if there is a reservation for the page > being allocated. If not, charge the new counter and store the > information in resv_map. The counter still gets uncharged when the > resv_map entry gets removed. > > The above works for all shared mappings and reserved private mappings, > but I' having trouble supporting private NORESERVE mappings. Charging > can work the same as for shared mappings: charge the new counter on > reservation and on allocations that do not have a reservation. But the > question still comes up: where to store the counter to uncharge this > page? I thought of a couple of things that don't seem to work: > > 1. I thought of putting the counter in resv_map->reservation_counter, > so that it gets uncharged on vm_op_close. But, private NORESERVE > mappings don't even have a resv_map allocated for them. > > 2. I thought of detecting on free_huge_page that the page being freed > belonged to a private NORESERVE mapping, and uncharging the > hugetlb_cgroup in the page itself, but free_huge_page only gets a > struct page* and I can't seem to find a way to detect that that the > page comes from a private NORESERVE mapping from only the struct > page*. > > Mike, note your suggestion above to check if the page hugetlb_cgroup > is null doesn't work if we want to keep the current counter working > the same: the page will always have a hugetlb_cgroup that points that > contains the old counter. Any ideas how to apply this new counter > behavior to a private NORESERVE mappings? Is there maybe a flag I can > set on the pages at allocation time that I can read on free time to > know whether to uncharge the hugetlb_cgroup or not? Reading the code and asking around a bit, it seems the pointer to the hugetlb_cgroup is in page[2].private. Is it reasonable to use page[3].private to store the hugetlb_cgroup to uncharge for the new counter and increment HUGETLB_CGROUP_MIN_ORDER to 3? I think that would solve my problem. When allocating a private NORESERVE page, set page[3].private to the hugetlb_cgroup to uncharge, then on free_huge_page, check page[3].private, if it is non-NULL, uncharge the new counter on it.