Received: by 2002:a05:6358:11c7:b0:104:8066:f915 with SMTP id i7csp1019813rwl; Wed, 12 Apr 2023 07:17:19 -0700 (PDT) X-Google-Smtp-Source: AKy350bZeGbszZYkDbQvYoBZGhXsIQhaOgmF/dQ/6C4u9oGP+JOT6SOfbrKd+Ux869K1FsIg4gI8 X-Received: by 2002:a05:6402:128a:b0:504:784e:47d9 with SMTP id w10-20020a056402128a00b00504784e47d9mr5476588edv.6.1681309038742; Wed, 12 Apr 2023 07:17:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681309038; cv=none; d=google.com; s=arc-20160816; b=z5hBYP6HaFhOW/riZWZLXh5vxRfmlPhrnUuj2l0Qq5R64J9bq3mYz4t6S1Ffa9RRJa dFz2jBIaVdF/rFIOgzBuEnaXvAAANL+Gla7nVaU4kU2xQMGszHvT4b3VN62r7C9VtPTB BXZrfqVefeRBAsFj1DdFXOtT23IODIZNb/aURRHHvpm3+Q5H+K3xWgbT38TVKelmF4eZ /LbpbdhPr42WXomgVmjFOZ816XMysKrSxoFA3k2cUBfd3GqDK01nrLK2YwXzKvZjc7GD FaNqmHyRJPZZ7KwC9YFIGp0pVOTbnZSup7m7FujfscMXHfJRCy0+rlatZ1h8TicEvIh1 4q2Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=npZ8ffD8wKTQ6zXucxsr5OS3bboBiNDxBArrHYUiGoU=; b=nvbnIq1HO0nuFTqzkNeMGbYqrjIdckKRaITzP45jxu7513xIK8vpEb7dDNOVpsVLIG YwrRMJoVEWcqlttHpdWVQeW8/FSNEvDNlqVQ+ft/vbR3krwMSw/kI2YOlYh0fugklpBR UJarbr0TkOeuK7WxiTG7YP9bURqZmOejsytjhXNAYdIu04hqMcfyOzM1IfXBamQgbOti hVlPVFUbJkBO7SNrBOH47wGzN8dohrAONemCXHe9dbSrEM0CiEqbRZgBYOd0o2d0Yzw3 DyoNGpetwUyuYt9u1vDiohFT4kgyPLsRgLi77p+q3QHf/PLmnLYgn1J9iqbCEzD6Xha5 oP0Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=RW7pyXIt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id v25-20020aa7d819000000b00504a2a1d130si2387762edq.322.2023.04.12.07.16.53; Wed, 12 Apr 2023 07:17:18 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=RW7pyXIt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229810AbjDLOLt (ORCPT + 99 others); Wed, 12 Apr 2023 10:11:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41416 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231464AbjDLOLn (ORCPT ); Wed, 12 Apr 2023 10:11:43 -0400 Received: from mail-pl1-x62c.google.com (mail-pl1-x62c.google.com [IPv6:2607:f8b0:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D0C959748 for ; Wed, 12 Apr 2023 07:11:38 -0700 (PDT) Received: by mail-pl1-x62c.google.com with SMTP id w11so11528539plp.13 for ; Wed, 12 Apr 2023 07:11:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1681308698; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=npZ8ffD8wKTQ6zXucxsr5OS3bboBiNDxBArrHYUiGoU=; b=RW7pyXIt0im4b2HqCafISgOSxeRW7KVI5ofazRV6UU+HgC/4UjmrrMphk7LM3ZnHKJ HMlf/2W7XVNX8xb7AK5U7X3HBb2FAsHI3zx1e3jxdivf4yOOtJYEglEwQcxfjT3b/Iic q/F422Wr+buwulz/CiF/ojo3L/d+AtxLVKO5MY579wUXATlAfEx8U1lScrVXciC0g+yr dLjWXZRqx/Yjd7Sj6KKcgQzNaPEgp9Um+0rlZ+Czw6/HiAhRcOklj98r/Y1vPRCCgFQ4 75RNsqY9Dq8vSXZpukZ02bQqmqFCTrYXYEyBv9I3ZGf6maL2rUXoRFaCc3jDNlloQFja ePrw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1681308698; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=npZ8ffD8wKTQ6zXucxsr5OS3bboBiNDxBArrHYUiGoU=; b=vu88AX4MdgWMfzmKKmkvhCQ19IhRFEBpynkbemkOcJfyzLVISisBwEnUkhf2HvN7wB KVfvkevxhkkrINgxl0PbB1LBmCgr6TVhCLx1Uy3oRPjRQ+GqctI51NcAiCyi/h+e5+wz fUpQuUGqdehLvX8CWCpTEodCOy9oX3QA4FhXR38nJT+jgqjGi4UGdZucMsR+r9LJXHrO 0/cxRYeDShW6f9Tlo/xYvzQJDn9xbJWIGE9WUW2o74EetKAlS5nHMPxdz8NOiZtWRYds DALVlOq4bhqPUZZBHeEjtj28/xwGWLqHbDb+fzQbGDvPKq69/hr0JKeEqw0YmcQ0GVMp uVCg== X-Gm-Message-State: AAQBX9dPfhcHXFY9Q+CNM7T+k+92ACJHBHAKaFJTIBStgKLCRzcQOAox bet4DVMWo3Zth4oSwgWZy/NDCg== X-Received: by 2002:a17:902:e5d1:b0:19d:1d32:fc7 with SMTP id u17-20020a170902e5d100b0019d1d320fc7mr8089583plf.51.1681308698215; Wed, 12 Apr 2023 07:11:38 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.253]) by smtp.gmail.com with ESMTPSA id 13-20020a170902c24d00b00194caf3e975sm11653502plg.208.2023.04.12.07.11.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 12 Apr 2023 07:11:37 -0700 (PDT) From: Gang Li To: John Hubbard , Jonathan Corbet , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider Cc: linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-doc@vger.kernel.org, Gang Li Subject: [PATCH v6 2/2] sched/numa: add per-process numa_balancing Date: Wed, 12 Apr 2023 22:11:26 +0800 Message-Id: <20230412141127.59741-1-ligang.bdlg@bytedance.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20230412140701.58337-1-ligang.bdlg@bytedance.com> References: <20230412140701.58337-1-ligang.bdlg@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add PR_NUMA_BALANCING in prctl. A large number of page faults will cause performance loss when numa balancing is performing. Thus those processes which care about worst-case performance need numa balancing disabled. Others, on the contrary, allow a temporary performance loss in exchange for higher average performance, so enable numa balancing is better for them. Numa balancing can only be controlled globally by /proc/sys/kernel/numa_balancing. Due to the above case, we want to disable/enable numa_balancing per-process instead. Set per-process numa balancing: prctl(PR_NUMA_BALANCING, PR_SET_NUMA_BALANCING_DISABLE); //disable prctl(PR_NUMA_BALANCING, PR_SET_NUMA_BALANCING_ENABLE); //enable prctl(PR_NUMA_BALANCING, PR_SET_NUMA_BALANCING_DEFAULT); //follow global Get numa_balancing state: prctl(PR_NUMA_BALANCING, PR_GET_NUMA_BALANCING, &ret); cat /proc//status | grep NumaB_mode Cc: linux-api@vger.kernel.org Signed-off-by: Gang Li Acked-by: John Hubbard --- Documentation/filesystems/proc.rst | 2 ++ fs/proc/task_mmu.c | 20 ++++++++++++ include/linux/mm_types.h | 3 ++ include/linux/sched/numa_balancing.h | 45 ++++++++++++++++++++++++++ include/uapi/linux/prctl.h | 8 +++++ kernel/fork.c | 4 +++ kernel/sched/fair.c | 9 +++--- kernel/sys.c | 47 ++++++++++++++++++++++++++++ mm/mprotect.c | 6 ++-- 9 files changed, 138 insertions(+), 6 deletions(-) diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index bfefcbb8f82b..c9897674fc5e 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -193,6 +193,7 @@ read the file /proc/PID/status:: VmLib: 1412 kB VmPTE: 20 kb VmSwap: 0 kB + NumaB_mode: default HugetlbPages: 0 kB CoreDumping: 0 THP_enabled: 1 @@ -275,6 +276,7 @@ It's slow but very precise. VmPTE size of page table entries VmSwap amount of swap used by anonymous private data (shmem swap usage is not included) + NumaB_mode numa balancing mode, set by prctl(PR_NUMA_BALANCING, ...) HugetlbPages size of hugetlb memory portions CoreDumping process's memory is currently being dumped (killing the process may lead to a corrupted core) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 38b19a757281..3f7263226645 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -19,6 +19,8 @@ #include #include #include +#include +#include #include #include @@ -75,6 +77,24 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8); SEQ_PUT_DEC(" kB\nVmSwap:\t", swap); seq_puts(m, " kB\n"); +#ifdef CONFIG_NUMA_BALANCING + seq_puts(m, "NumaB_mode:\t"); + switch (mm->numa_balancing_mode) { + case PR_SET_NUMA_BALANCING_DEFAULT: + seq_puts(m, "default"); + break; + case PR_SET_NUMA_BALANCING_DISABLED: + seq_puts(m, "disabled"); + break; + case PR_SET_NUMA_BALANCING_ENABLED: + seq_puts(m, "enabled"); + break; + default: + seq_puts(m, "unknown"); + break; + } + seq_putc(m, '\n'); +#endif hugetlb_report_usage(m, mm); } #undef SEQ_PUT_DEC diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 3fc9e680f174..bd539d8c1103 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -740,6 +740,9 @@ struct mm_struct { /* numa_scan_seq prevents two threads remapping PTEs. */ int numa_scan_seq; + + /* Controls whether NUMA balancing is active for this mm. */ + int numa_balancing_mode; #endif /* * An operation with batched TLB flushing is going on. Anything diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h index 3988762efe15..fa360d17f52e 100644 --- a/include/linux/sched/numa_balancing.h +++ b/include/linux/sched/numa_balancing.h @@ -8,6 +8,8 @@ */ #include +#include +#include #define TNF_MIGRATED 0x01 #define TNF_NO_GROUP 0x02 @@ -16,12 +18,47 @@ #define TNF_MIGRATE_FAIL 0x10 #ifdef CONFIG_NUMA_BALANCING +DECLARE_STATIC_KEY_FALSE(sched_numa_balancing); extern void task_numa_fault(int last_node, int node, int pages, int flags); extern pid_t task_numa_group_id(struct task_struct *p); extern void set_numabalancing_state(bool enabled); extern void task_numa_free(struct task_struct *p, bool final); extern bool should_numa_migrate_memory(struct task_struct *p, struct page *page, int src_nid, int dst_cpu); +static inline bool numa_balancing_enabled(struct task_struct *p) +{ + if (!static_branch_unlikely(&sched_numa_balancing)) + return false; + + if (p->mm) switch (p->mm->numa_balancing_mode) { + case PR_SET_NUMA_BALANCING_ENABLED: + return true; + case PR_SET_NUMA_BALANCING_DISABLED: + return false; + default: + break; + } + + return sysctl_numa_balancing_mode; +} +static inline int numa_balancing_mode(struct mm_struct *mm) +{ + if (!static_branch_unlikely(&sched_numa_balancing)) + return PR_SET_NUMA_BALANCING_DISABLED; + + if (mm) switch (mm->numa_balancing_mode) { + case PR_SET_NUMA_BALANCING_ENABLED: + return sysctl_numa_balancing_mode == NUMA_BALANCING_DISABLED ? + NUMA_BALANCING_NORMAL : sysctl_numa_balancing_mode; + case PR_SET_NUMA_BALANCING_DISABLED: + return NUMA_BALANCING_DISABLED; + case PR_SET_NUMA_BALANCING_DEFAULT: + default: + break; + } + + return sysctl_numa_balancing_mode; +} #else static inline void task_numa_fault(int last_node, int node, int pages, int flags) @@ -42,6 +79,14 @@ static inline bool should_numa_migrate_memory(struct task_struct *p, { return true; } +static inline int numa_balancing_mode(struct mm_struct *mm) +{ + return 0; +} +static inline bool numa_balancing_enabled(struct task_struct *p) +{ + return 0; +} #endif #endif /* _LINUX_SCHED_NUMA_BALANCING_H */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index f23d9a16507f..7f452f677c61 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -294,4 +294,12 @@ struct prctl_mm_map { #define PR_SET_MEMORY_MERGE 67 #define PR_GET_MEMORY_MERGE 68 + +/* Set/get enabled per-process numa_balancing */ +#define PR_NUMA_BALANCING 69 +# define PR_SET_NUMA_BALANCING_DISABLED 0 +# define PR_SET_NUMA_BALANCING_ENABLED 1 +# define PR_SET_NUMA_BALANCING_DEFAULT 2 +# define PR_GET_NUMA_BALANCING 3 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/fork.c b/kernel/fork.c index 80dca376a536..534ba3566ac0 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -99,6 +99,7 @@ #include #include #include +#include #include #include @@ -1281,6 +1282,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, init_tlb_flush_pending(mm); #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS mm->pmd_huge_pte = NULL; +#endif +#ifdef CONFIG_NUMA_BALANCING + mm->numa_balancing_mode = PR_SET_NUMA_BALANCING_DEFAULT; #endif mm_init_uprobes_state(mm); hugetlb_count_init(mm); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a29ca11bead2..50edc4d89c64 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -47,6 +47,7 @@ #include #include #include +#include #include @@ -2842,7 +2843,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) struct numa_group *ng; int priv; - if (!static_branch_likely(&sched_numa_balancing)) + if (!numa_balancing_enabled(p)) return; /* for example, ksmd faulting in a user's mm */ @@ -3220,7 +3221,7 @@ static void update_scan_period(struct task_struct *p, int new_cpu) int src_nid = cpu_to_node(task_cpu(p)); int dst_nid = cpu_to_node(new_cpu); - if (!static_branch_likely(&sched_numa_balancing)) + if (!numa_balancing_enabled(p)) return; if (!p->mm || !p->numa_faults || (p->flags & PF_EXITING)) @@ -8455,7 +8456,7 @@ static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env) unsigned long src_weight, dst_weight; int src_nid, dst_nid, dist; - if (!static_branch_likely(&sched_numa_balancing)) + if (!numa_balancing_enabled(p)) return -1; if (!p->numa_faults || !(env->sd->flags & SD_NUMA)) @@ -12061,7 +12062,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) entity_tick(cfs_rq, se, queued); } - if (static_branch_unlikely(&sched_numa_balancing)) + if (numa_balancing_enabled(curr)) task_tick_numa(rq, curr); update_misfit_status(curr, rq); diff --git a/kernel/sys.c b/kernel/sys.c index a2bd2b9f5683..d3df9fab1858 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -61,6 +61,7 @@ #include #include #include +#include #include #include #include @@ -2118,6 +2119,35 @@ static int prctl_set_auxv(struct mm_struct *mm, unsigned long addr, return 0; } +#ifdef CONFIG_NUMA_BALANCING +static int prctl_pid_numa_balancing_write(int numa_balancing) +{ + int old_numa_balancing; + + if (numa_balancing != PR_SET_NUMA_BALANCING_DEFAULT && + numa_balancing != PR_SET_NUMA_BALANCING_DISABLED && + numa_balancing != PR_SET_NUMA_BALANCING_ENABLED) + return -EINVAL; + + old_numa_balancing = xchg(¤t->mm->numa_balancing_mode, numa_balancing); + + if (numa_balancing == old_numa_balancing) + return 0; + + if (numa_balancing == 1) + static_branch_inc(&sched_numa_balancing); + else if (old_numa_balancing == 1) + static_branch_dec(&sched_numa_balancing); + + return 0; +} + +static int prctl_pid_numa_balancing_read(void) +{ + return current->mm->numa_balancing_mode; +} +#endif + static int prctl_set_mm(int opt, unsigned long addr, unsigned long arg4, unsigned long arg5) { @@ -2674,6 +2704,23 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = set_syscall_user_dispatch(arg2, arg3, arg4, (char __user *) arg5); break; +#ifdef CONFIG_NUMA_BALANCING + case PR_NUMA_BALANCING: + switch (arg2) { + case PR_SET_NUMA_BALANCING_DEFAULT: + case PR_SET_NUMA_BALANCING_DISABLED: + case PR_SET_NUMA_BALANCING_ENABLED: + error = prctl_pid_numa_balancing_write((int)arg2); + break; + case PR_GET_NUMA_BALANCING: + error = put_user(prctl_pid_numa_balancing_read(), (int __user *)arg3); + break; + default: + error = -EINVAL; + break; + } + break; +#endif #ifdef CONFIG_SCHED_CORE case PR_SCHED_CORE: error = sched_core_share_pid(arg2, arg3, arg4, arg5); diff --git a/mm/mprotect.c b/mm/mprotect.c index afdb6723782e..eb1098f790f2 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include #include @@ -165,10 +166,11 @@ static long change_pte_range(struct mmu_gather *tlb, * Skip scanning top tier node if normal numa * balancing is disabled */ - if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && + if (!(numa_balancing_mode(vma->vm_mm) & NUMA_BALANCING_NORMAL) && toptier) continue; - if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING && + if (numa_balancing_mode(vma->vm_mm) & + NUMA_BALANCING_MEMORY_TIERING && !toptier) xchg_page_access_time(page, jiffies_to_msecs(jiffies)); -- 2.20.1