Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp1933548yba; Sun, 21 Apr 2019 19:13:43 -0700 (PDT) X-Google-Smtp-Source: APXvYqzFhl/fJ3pSjzmcskIVEckuRKq3VmpbSQN9mcfEUCCilHp/xKDmHS+XJfmX6fcXXRiYZjjp X-Received: by 2002:a62:b418:: with SMTP id h24mr17529086pfn.145.1555899223371; Sun, 21 Apr 2019 19:13:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1555899223; cv=none; d=google.com; s=arc-20160816; b=edeHxXbbPR6MLIoloro/9/KulyzJU2XwAX4QyTO248Q2akFJe7voiE7uvzcHhr4mtp 0nTX8sqY/FD6DVb1PjCukS9zE16GCP/ozFzpj5s31QSP0DpOHNJ/ug/lYMdpUIpAptm1 mO9qp1qJnpvzAT+TLzaoprYNwU4UsZHh7LtCuqnt5sxYBiwJruD/g4bmmsmK5fEDxAOR zJnK8T+hqIdvGK14H6VO5WesOAhoYie1SSx6yUxUlgtyzh00OxEjjMTjCg3IJMs3BP8G k6FezKkaNGI5eMXFNPxsIQ1Ja6AbL3hU210F4UBFu6ZfKau70IeNF+lFF3UdOJ++eRKx ei2A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:mime-version:user-agent:date:message-id:subject :from:cc:to; bh=2q443B0eOWyL/uLnvbH3BRqdzuFql2rjMMi/K1LHZUk=; b=fYah+Zk4EfgH67X64i2iUr0/knsTIItp9rpgA0SoEYGSb29TBfsTlNATITvQGjgerz zfLn8RaqqeSLBzlwxSd84ggXx1ghVz0/CQP0OFpuRJIneR4xBbMT1Zlxnu7WR6kAbUEs TgZUApdKAECFt1l/FRBaoyJmdG8vnJ3iVXBJiRXj0c+KakunqyG8wEwOKGtQmmVURvAA EU1OEWSOG2oPfJSV2HfBSp2n8SpPtxjBxIQtLzHegFuy1DeD4EVc4AS+PFsKFrJ6nzsP MDd2Bibg1tROtKx/dFJg4fv4qyu/iQKTzbtjwJr6u2IAbz6wt9bD37u/Y4WtCFjcDhII vXDw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v21si12041417pfe.119.2019.04.21.19.13.21; Sun, 21 Apr 2019 19:13:43 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726195AbfDVCKP (ORCPT + 99 others); Sun, 21 Apr 2019 22:10:15 -0400 Received: from out30-42.freemail.mail.aliyun.com ([115.124.30.42]:38420 "EHLO out30-42.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725991AbfDVCKP (ORCPT ); Sun, 21 Apr 2019 22:10:15 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R141e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04420;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0TPtsupF_1555899010; Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com fp:SMTPD_---0TPtsupF_1555899010) by smtp.aliyun-inc.com(127.0.0.1); Mon, 22 Apr 2019 10:10:10 +0800 To: Peter Zijlstra , hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org From: =?UTF-8?B?546L6LSH?= Subject: [RFC PATCH 0/5] NUMA Balancer Suite Message-ID: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com> Date: Mon, 22 Apr 2019 10:10:10 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org We have NUMA Balancing feature which always trying to move pages of a task to the node it executed more, while still got issues: * page cache can't be handled * no cgroup level balancing Suppose we have a box with 4 cpu, two cgroup A & B each running 4 tasks, below scenery could be easily observed: NODE0 | NODE1 | CPU0 CPU1 | CPU2 CPU3 task_A0 task_A1 | task_A2 task_A3 task_B0 task_B1 | task_B2 task_B3 and usually with the equal memory consumption on each node, when tasks have similar behavior. In this case numa balancing try to move pages of task_A0,1 & task_B0,1 to node 0, pages of task_A2,3 & task_B2,3 to node 1, but page cache will be located randomly, depends on the first read/write CPU location. Let's suppose another scenery: NODE0 | NODE1 | CPU0 CPU1 | CPU2 CPU3 task_A0 task_A1 | task_B0 task_B1 task_A2 task_A3 | task_B2 task_B3 By switching the cpu & memory resources of task_A0,1 and task_B0,1, now workloads of cgroup A all on node 0, and cgroup B all on node 1, resource consumption are same but related tasks could share a closer cpu cache, while cache still randomly located. Now what if the workloads generate lot's of page cache, and most of the memory accessing are page cache writing? A page cache generated by task_A0 on NODE1 won't follow it to NODE0, but if task_A0 was already on NODE0 before it read/write files, caches will be there, so how to make sure this happen? Usually we could solve this problem by binding workloads on a single node, if the cgroup A was binding to CPU0,1, then all the caches it generated will be on NODE0, the numa bonus will be maximum. However, this require a very well administration on specified workloads, suppose in our cases if A & B are with a changing CPU requirement from 0% to 400%, then binding to a single node would be a bad idea. So what we need is a way to detect memory topology on cgroup level, and try to migrate cpu/mem resources to the node with most of the caches there, as long as the resource is plenty on that node. This patch set introduced: * advanced per-cgroup numa statistic * numa preferred node feature * Numa Balancer module Which helps to achieve an easy and flexible numa resource assignment, to gain numa bonus as much as possible. Michael Wang (5): numa: introduce per-cgroup numa balancing locality statistic numa: append per-node execution info in memory.numa_stat numa: introduce per-cgroup preferred numa node numa: introduce numa balancer infrastructure numa: numa balancer drivers/Makefile | 1 + drivers/numa/Makefile | 1 + drivers/numa/numa_balancer.c | 715 +++++++++++++++++++++++++++++++++++++++++++ include/linux/memcontrol.h | 99 ++++++ include/linux/sched.h | 9 +- kernel/sched/debug.c | 8 + kernel/sched/fair.c | 41 +++ mm/huge_memory.c | 7 +- mm/memcontrol.c | 246 +++++++++++++++ mm/memory.c | 9 +- mm/mempolicy.c | 4 + 11 files changed, 1133 insertions(+), 7 deletions(-) create mode 100644 drivers/numa/Makefile create mode 100644 drivers/numa/numa_balancer.c -- 2.14.4.44.g2045bb6