Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp480127pxb; Mon, 8 Nov 2021 17:03:46 -0800 (PST) X-Google-Smtp-Source: ABdhPJybDohQZwPfA5Jw9mV00ljT7K2cKE6OgxoEo4IHlKtk0IDyI39Dl7PMSFUalFUEmkYl6X0T X-Received: by 2002:a6b:ee10:: with SMTP id i16mr2241594ioh.98.1636419826436; Mon, 08 Nov 2021 17:03:46 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1636419826; cv=none; d=google.com; s=arc-20160816; b=E4UzYI3+DS5OINi6Y+U7isi2h/pR2S8MjTBv0SlabfMRqQnaqkaPJHzEpJqv8KUK6h 2LmCUcj+jZkEUfGz8dd7CJWnJeAd9PzdtBf0khSTcSmvICwnWsieWAevH5aqvMaax75v JyQW0MEgWe0mnoPsB9jIJ2LEA3R504Da/8StYd2yohjciRexaGM4815iGnmaoQ1S5uaK lPRwgkJxI1gBuPGo24kDVywf5peouoMKC4hFmIn9PfVwXnSmdeKDm21WHHaTdhJulzrZ WgvkayHEvlrDeRQ/DPf2TeFnrby6pAjonPP+kFUH5nngjgBwrj9K7/1p0ePKR1APCUOB EbaA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:to:content-transfer-encoding:cc:from:subject :mime-version:message-id:date:dkim-signature; bh=A8slxoWuSLDDOGDe4TgFi7S16ENFGAtvh0ADzhlSYZw=; b=cmDrxttG+GHMVqEtHqx5ZZLf0Z1HTEK1pWMEC8H7jUF9SjQUnc8Q451wJKdyPW9ont O/oXwbd3wYfJGdiAqAIrRXF8cjanm1w4lzppY8EW3/zQRzvxTnJOML7QE3lf1B+oOP+R 6P04rAnrAPG+SYjjjj31LOnLqsjASrBpuYR8REy8YTVat2BCzIhS2T6yVN/zNNh6a+CR Mn77Wx4uTfKGRtFu+nLDBXhaDhFF/BXvQKSH91TABaf8n+M8xzEiBWUrai2sci7EEbks 9ges1kqucxg+wRZGNQw+4Ta6JIZ1iXjkMRcihRBHPAwXs7QsVFCtvvqXLIO6CWcrTd6V HUIQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=n8YkBwJm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id s21si28502200iow.57.2021.11.08.17.03.33; Mon, 08 Nov 2021 17:03:46 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=n8YkBwJm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239712AbhKHVHo (ORCPT + 99 others); Mon, 8 Nov 2021 16:07:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46480 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239710AbhKHVHn (ORCPT ); Mon, 8 Nov 2021 16:07:43 -0500 Received: from mail-pl1-x649.google.com (mail-pl1-x649.google.com [IPv6:2607:f8b0:4864:20::649]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CD234C061570 for ; Mon, 8 Nov 2021 13:04:58 -0800 (PST) Received: by mail-pl1-x649.google.com with SMTP id l3-20020a170902f68300b00142892d0a86so952383plg.13 for ; Mon, 08 Nov 2021 13:04:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:message-id:mime-version:subject:from:cc :content-transfer-encoding; bh=A8slxoWuSLDDOGDe4TgFi7S16ENFGAtvh0ADzhlSYZw=; b=n8YkBwJmoqNOkanThJhXYvj/VBydRb2b/0XFiBA8NdOKBT3EZiDJd9PGB2jF3KHTzA I9PC9+blq91vJJUuaaIyaBsO6rKBL1A2NuLMeigV1K5YU2uHYoV/EG3NL2RFQsqIwztH bSiOB3aaK3MYwHsGzrD5Z60ktV+U/jf+cMVDvb9jn/vQak8Q8nIDCjNhIGdxuzZ/nE7G BsgKsOvunAkV1NMctzJj270iRO5b83vEWVKHcgiv6pyOxm0yTGzOAueGhiM6tP5qcR5K endTnLofcWqdZICVX5pZ9xRHvoAkufCnU4j6xLNzj6gt79sm2MaiXY1S0xokB5JUDu0S ReoA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:message-id:mime-version:subject:from:cc :content-transfer-encoding; bh=A8slxoWuSLDDOGDe4TgFi7S16ENFGAtvh0ADzhlSYZw=; b=LPyHSAO1sIHxDzl4HDJZT7jDck15rp4ply2MkepMOwtva7GPg+27Lgvdv+LP+6E8MM f8h8vtYSI313g7VeWRsOYLAgftBOCefM+tY9yhOilaIVZJYQwKIWslyIbzHmzx7F4I3J 6POzYvRUSu7jsPXqG/qcxShfL0kZS+8Cp/VGO0bwm3/pG/CvOSM/5/zMBdEs3nVey7Rs OU652UGPTQa8SjdmaYRLKYQuf70oZF+7YniSdFEcL61zpeOrm99rNjJlL/Pivtb0oRlE W9VxhsXoBzc2+fuOHDZzjsffBK/0RBoYj4XJUQ+d4ooBnmXDNIC30foaNl6/AHhL4Fou AU4w== X-Gm-Message-State: AOAM532YSh/Flju85q/NOAQ2qqnDmES+yr/Stu+z6i7jE/VQbbXEzEta 8wl2UY68ZHT6YRUIU7+/8WM4IlnnatkObOTtqw== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2cd:202:8717:7707:fb59:664e]) (user=almasrymina job=sendgmr) by 2002:a05:6a00:21c2:b0:44c:fa0b:f72 with SMTP id t2-20020a056a0021c200b0044cfa0b0f72mr2365377pfj.13.1636405498180; Mon, 08 Nov 2021 13:04:58 -0800 (PST) Date: Mon, 8 Nov 2021 13:04:56 -0800 Message-Id: <20211108210456.1745788-1-almasrymina@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.34.0.rc0.344.g81b53c2807-goog Subject: [PATCH v4] hugetlb: Add hugetlb.*.numa_stat file From: Mina Almasry Cc: Mina Almasry , Mike Kravetz , Andrew Morton , Shuah Khan , Miaohe Lin , Oscar Salvador , Michal Hocko , Muchun Song , David Rientjes , Shakeel Butt , Jue Wang , Yang Yao , Joanna Li , Cannon Matthews , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable To: unlisted-recipients:; (no To-header on input) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org For hugetlb backed jobs/VMs it's critical to understand the numa information for the memory backing these jobs to deliver optimal performance. Currently this technically can be queried from /proc/self/numa_maps, but there are significant issues with that. Namely: 1. Memory can be mapped on unmapped. 2. numa_maps are per process and need to be aggregated across all processes in the cgroup. For shared memory this is more involved as the userspace needs to make sure it doesn't double count shared mappings. 3. I believe querying numa_maps needs to hold the mmap_lock which adds to the contention on this lock. For these reasons I propose simply adding hugetlb.*.numa_stat file, which shows the numa information of the cgroup similarly to memory.numa_stat. On cgroup-v2: cat /sys/fs/cgroup/unified/test/hugetlb.2MB.numa_stat total=3D2097152 N0=3D2097152 N1=3D0 On cgroup-v1: cat /sys/fs/cgroup/hugetlb/test/hugetlb.2MB.numa_stat total=3D2097152 N0=3D2097152 N1=3D0 hierarichal_total=3D2097152 N0=3D2097152 N1=3D0 This patch was tested manually by allocating hugetlb memory and querying the hugetlb.*.numa_stat file of the cgroup and its parents. =EF=BF=BC Cc: Mike Kravetz Cc: Andrew Morton Cc: Shuah Khan Cc: Miaohe Lin Cc: Oscar Salvador Cc: Michal Hocko Cc: Muchun Song Cc: David Rientjes Cc: Shakeel Butt Cc: Jue Wang Cc: Yang Yao Cc: Joanna Li Cc: Cannon Matthews Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Mina Almasry --- Changes in v4: - Removed unnecessary braces. - usage is now counted in pages instead of bytes. - Reverted unneeded changes to write_to_hugetlbfs.c Changes in v3: - Fixed typos (sorry!) - Used conventional locations for cgroups mount points in docs/commit message. - Updated docs. - Handle kzalloc_node failure, and proper deallocation of per node data. - Use struct_size() to calculate the struct size. - Use nr_node_ids instead of MAX_NUMNODES. - Updated comments per multi-line comment pattern. Changes in v2: - Fix warning Reported-by: kernel test robot --- .../admin-guide/cgroup-v1/hugetlb.rst | 4 + Documentation/admin-guide/cgroup-v2.rst | 5 + include/linux/hugetlb.h | 4 +- include/linux/hugetlb_cgroup.h | 7 ++ mm/hugetlb_cgroup.c | 113 ++++++++++++++++-- 5 files changed, 122 insertions(+), 11 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentatio= n/admin-guide/cgroup-v1/hugetlb.rst index 338f2c7d7a1c..0fa724d82abb 100644 --- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst +++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst @@ -29,12 +29,14 @@ Brief summary of control files:: hugetlb..max_usage_in_bytes # show max "hugepag= esize" hugetlb usage recorded hugetlb..usage_in_bytes # show current usag= e for "hugepagesize" hugetlb hugetlb..failcnt # show the number o= f allocation failure due to HugeTLB usage limit + hugetlb..numa_stat # show the numa inf= ormation of the hugetlb memory charged to this cgroup For a system supporting three hugepage sizes (64k, 32M and 1G), the contro= l files include:: hugetlb.1GB.limit_in_bytes hugetlb.1GB.max_usage_in_bytes + hugetlb.1GB.numa_stat hugetlb.1GB.usage_in_bytes hugetlb.1GB.failcnt hugetlb.1GB.rsvd.limit_in_bytes @@ -43,6 +45,7 @@ files include:: hugetlb.1GB.rsvd.failcnt hugetlb.64KB.limit_in_bytes hugetlb.64KB.max_usage_in_bytes + hugetlb.64KB.numa_stat hugetlb.64KB.usage_in_bytes hugetlb.64KB.failcnt hugetlb.64KB.rsvd.limit_in_bytes @@ -51,6 +54,7 @@ files include:: hugetlb.64KB.rsvd.failcnt hugetlb.32MB.limit_in_bytes hugetlb.32MB.max_usage_in_bytes + hugetlb.32MB.numa_stat hugetlb.32MB.usage_in_bytes hugetlb.32MB.failcnt hugetlb.32MB.rsvd.limit_in_bytes diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-= guide/cgroup-v2.rst index 4d8c27eca96b..356847f8f008 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -2252,6 +2252,11 @@ HugeTLB Interface Files are local to the cgroup i.e. not hierarchical. The file modified event generated on this file reflects only the local events. + hugetlb..numa_stat + Similar to memory.numa_stat, it shows the numa information of the + hugetlb pages of in this cgroup. Only active in + use hugetlb pages are included. The per-node values are in bytes. + Misc ---- diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 1faebe1cd0ed..0445faaa636e 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -613,8 +613,8 @@ struct hstate { #endif #ifdef CONFIG_CGROUP_HUGETLB /* cgroup control files */ - struct cftype cgroup_files_dfl[7]; - struct cftype cgroup_files_legacy[9]; + struct cftype cgroup_files_dfl[8]; + struct cftype cgroup_files_legacy[10]; #endif char name[HSTATE_NAME_LEN]; }; diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.= h index c137396129db..54ff6ec68ed3 100644 --- a/include/linux/hugetlb_cgroup.h +++ b/include/linux/hugetlb_cgroup.h @@ -36,6 +36,11 @@ enum hugetlb_memory_event { HUGETLB_NR_MEMORY_EVENTS, }; +struct hugetlb_cgroup_per_node { + /* hugetlb usage in bytes over all hstates. */ + unsigned long usage[HUGE_MAX_HSTATE]; +}; + struct hugetlb_cgroup { struct cgroup_subsys_state css; @@ -57,6 +62,8 @@ struct hugetlb_cgroup { /* Handle for "hugetlb.events.local" */ struct cgroup_file events_local_file[HUGE_MAX_HSTATE]; + + struct hugetlb_cgroup_per_node *nodeinfo[]; }; static inline struct hugetlb_cgroup * diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index 5383023d0cca..4717465f5307 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -126,29 +126,58 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup= *h_cgroup, } } +static void hugetlb_cgroup_free(struct hugetlb_cgroup *h_cgroup) +{ + int node; + + for_each_node(node) + kfree(h_cgroup->nodeinfo[node]); + kfree(h_cgroup); +} + static struct cgroup_subsys_state * hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) { struct hugetlb_cgroup *parent_h_cgroup =3D hugetlb_cgroup_from_css(parent= _css); struct hugetlb_cgroup *h_cgroup; + int node; + + h_cgroup =3D kzalloc(struct_size(h_cgroup, nodeinfo, nr_node_ids), + GFP_KERNEL); - h_cgroup =3D kzalloc(sizeof(*h_cgroup), GFP_KERNEL); if (!h_cgroup) return ERR_PTR(-ENOMEM); if (!parent_h_cgroup) root_h_cgroup =3D h_cgroup; + /* + * TODO: this routine can waste much memory for nodes which will + * never be onlined. It's better to use memory hotplug callback + * function. + */ + for_each_node(node) { + /* Set node_to_alloc to -1 for offline nodes. */ + int node_to_alloc =3D + node_state(node, N_NORMAL_MEMORY) ? node : -1; + h_cgroup->nodeinfo[node] =3D + kzalloc_node(sizeof(struct hugetlb_cgroup_per_node), + GFP_KERNEL, node_to_alloc); + if (!h_cgroup->nodeinfo[node]) + goto fail_alloc_nodeinfo; + } + hugetlb_cgroup_init(h_cgroup, parent_h_cgroup); return &h_cgroup->css; + +fail_alloc_nodeinfo: + hugetlb_cgroup_free(h_cgroup); + return ERR_PTR(-ENOMEM); } static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css) { - struct hugetlb_cgroup *h_cgroup; - - h_cgroup =3D hugetlb_cgroup_from_css(css); - kfree(h_cgroup); + hugetlb_cgroup_free(hugetlb_cgroup_from_css(css)); } /* @@ -292,7 +321,8 @@ static void __hugetlb_cgroup_commit_charge(int idx, uns= igned long nr_pages, return; __set_hugetlb_cgroup(page, h_cg, rsvd); - return; + if (!rsvd && h_cg) + h_cg->nodeinfo[page_to_nid(page)]->usage[idx] +=3D nr_pages; } void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, @@ -331,7 +361,8 @@ static void __hugetlb_cgroup_uncharge_page(int idx, uns= igned long nr_pages, if (rsvd) css_put(&h_cg->css); - + else + h_cg->nodeinfo[page_to_nid(page)]->usage[idx] -=3D nr_pages; return; } @@ -421,6 +452,58 @@ enum { RES_RSVD_FAILCNT, }; +static int hugetlb_cgroup_read_numa_stat(struct seq_file *seq, void *dummy= ) +{ + int nid; + struct cftype *cft =3D seq_cft(seq); + int idx =3D MEMFILE_IDX(cft->private); + bool legacy =3D MEMFILE_ATTR(cft->private); + struct hugetlb_cgroup *h_cg =3D hugetlb_cgroup_from_css(seq_css(seq)); + struct cgroup_subsys_state *css; + unsigned long usage; + + if (legacy) { + /* Add up usage across all nodes for the non-hierarchical total. */ + usage =3D 0; + for_each_node_state(nid, N_MEMORY) + usage +=3D h_cg->nodeinfo[nid]->usage[idx]; + seq_printf(seq, "total=3D%lu", usage * PAGE_SIZE); + + /* Simply print the per-node usage for the non-hierarchical total. */ + for_each_node_state(nid, N_MEMORY) + seq_printf(seq, " N%d=3D%lu", nid, + h_cg->nodeinfo[nid]->usage[idx] * PAGE_SIZE); + seq_putc(seq, '\n'); + } + + /* + * The hierarchical total is pretty much the value recorded by the + * counter, so use that. + */ + seq_printf(seq, "%stotal=3D%lu", legacy ? "hierarichal_" : "", + page_counter_read(&h_cg->hugepage[idx]) * PAGE_SIZE); + + /* + * For each node, transverse the css tree to obtain the hierarichal + * node usage. + */ + for_each_node_state(nid, N_MEMORY) { + usage =3D 0; + rcu_read_lock(); + css_for_each_descendant_pre(css, &h_cg->css) { + usage +=3D hugetlb_cgroup_from_css(css) + ->nodeinfo[nid] + ->usage[idx]; + } + rcu_read_unlock(); + seq_printf(seq, " N%d=3D%lu", nid, usage * PAGE_SIZE); + } + + seq_putc(seq, '\n'); + + return 0; +} + static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) { @@ -671,8 +754,14 @@ static void __init __hugetlb_cgroup_file_dfl_init(int = idx) events_local_file[idx]); cft->flags =3D CFTYPE_NOT_ON_ROOT; - /* NULL terminate the last cft */ + /* Add the numa stat file */ cft =3D &h->cgroup_files_dfl[6]; + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf); + cft->seq_show =3D hugetlb_cgroup_read_numa_stat; + cft->flags =3D CFTYPE_NOT_ON_ROOT; + + /* NULL terminate the last cft */ + cft =3D &h->cgroup_files_dfl[7]; memset(cft, 0, sizeof(*cft)); WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys, @@ -742,8 +831,14 @@ static void __init __hugetlb_cgroup_file_legacy_init(i= nt idx) cft->write =3D hugetlb_cgroup_reset; cft->read_u64 =3D hugetlb_cgroup_read_u64; + /* Add the numa stat file */ + cft =3D &h->cgroup_files_dfl[8]; + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf); + cft->private =3D MEMFILE_PRIVATE(idx, 1); + cft->seq_show =3D hugetlb_cgroup_read_numa_stat; + /* NULL terminate the last cft */ - cft =3D &h->cgroup_files_legacy[8]; + cft =3D &h->cgroup_files_legacy[9]; memset(cft, 0, sizeof(*cft)); WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys, -- 2.34.0.rc0.344.g81b53c2807-goog