Received: by 2002:a05:6a10:413:0:0:0:0 with SMTP id 19csp228839pxp; Wed, 9 Mar 2022 01:30:40 -0800 (PST) X-Google-Smtp-Source: ABdhPJxObQt1Lp4TWXd1BPc50leJYGUpJNaw8PI87/HKHf3/EOEK95eoHr0qEw9k42L9V7fMM90K X-Received: by 2002:a65:48c8:0:b0:375:9c2b:ad33 with SMTP id o8-20020a6548c8000000b003759c2bad33mr18335316pgs.232.1646818240648; Wed, 09 Mar 2022 01:30:40 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1646818240; cv=none; d=google.com; s=arc-20160816; b=AlH5NemWFMFPjOjNemnnKe97DeLWk5VYQV1/L68Q7qaGh9qLbpPIxhOttDMTp3lHkR 8ZrhCZLJQngv3yeiou4WSnUFOR0GwOqp6XvVWLiv3Li3sfpytbZSfCMGNCOjlmlFtwlr +GVAoC2K2F/a3gZNH3SYtxu3r8JX9gvm12fqQ+9eH7mocx2r5AePVHCiwazlV9fnzwJs P0O+4XPoXqSA6FFO01DvrbP8tdgoFjEHA7JbWXzQn67JOmPVahkxRuJvzNUEXHaTukpY a3yVe76Rtfq3F8P+khk30XoNidTE0g883bfjQYdH+FpUXLFFhBg9qj5eNOdFTOA/P5ZP pkHw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=tPPGTZHp+cXJk2dg1KvSkv5bJHc7AsjzQXZTMkSPbks=; b=IZK/HAJhxHwG8rbQwH3UIoYoIiDVrvg/jKllPXFbA15wrSr1uUs5cqdvTXnIAX/AuP uvXknnXW6+XX9UYTCVdVFrWYF0p05sGhFUgsFaz/UmwYqPTd46BH1reP3wYNuwIi/6iX AoOoBpVQiHZZbTLhz4441iFPbtSS1DJBTpow7g4FMJDMqX6NtY99cuKJetImcvrlIjdR MEYcB2e9Ok+tzEQrIG+WJ+JlnUXrwCbyU0Um7OPZE8gmqs9xkfUCSDO3EPqakoRvxCIB xnL9S0ODfyYSPMITSemnrq5Tsy89d/MN6SFDjv08dWRDFuw/5fmOxBcYNKXLziDIMfR5 +4WA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d62-20020a636841000000b0037c4e327d6asi1350768pgc.526.2022.03.09.01.30.24; Wed, 09 Mar 2022 01:30:40 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231372AbiCIIcA (ORCPT + 99 others); Wed, 9 Mar 2022 03:32:00 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43362 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229498AbiCIIb6 (ORCPT ); Wed, 9 Mar 2022 03:31:58 -0500 Received: from out30-45.freemail.mail.aliyun.com (out30-45.freemail.mail.aliyun.com [115.124.30.45]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 02842107AAA; Wed, 9 Mar 2022 00:30:59 -0800 (PST) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R371e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04407;MF=dtcccc@linux.alibaba.com;NM=1;PH=DS;RN=27;SR=0;TI=SMTPD_---0V6iwN-y_1646814652; Received: from 30.97.48.240(mailfrom:dtcccc@linux.alibaba.com fp:SMTPD_---0V6iwN-y_1646814652) by smtp.aliyun-inc.com(127.0.0.1); Wed, 09 Mar 2022 16:30:54 +0800 Message-ID: Date: Wed, 9 Mar 2022 16:30:51 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.6.1 Subject: Re: [RFC PATCH v2 0/4] Introduce group balancer Content-Language: en-US To: Tejun Heo Cc: Zefan Li , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Johannes Weiner , Michael Wang , Cruz Zhao , Masahiro Yamada , Nathan Chancellor , Kees Cook , Andrew Morton , Vlastimil Babka , "Gustavo A. R. Silva" , Arnd Bergmann , Miguel Ojeda , Chris Down , Vipin Sharma , Daniel Borkmann , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org References: <20220308092629.40431-1-dtcccc@linux.alibaba.com> From: Tianchen Ding In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-7.9 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,HK_RANDOM_ENVFROM,HK_RANDOM_FROM,NICE_REPLY_A, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE,UNPARSEABLE_RELAY,USER_IN_DEF_SPF_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2022/3/9 01:13, Tejun Heo wrote: > Hello, > > On Tue, Mar 08, 2022 at 05:26:25PM +0800, Tianchen Ding wrote: >> Modern platform are growing fast on CPU numbers. To achieve better >> utility of CPU resource, multiple apps are starting to sharing the CPUs. >> >> What we need is a way to ease confliction in share mode, >> make groups as exclusive as possible, to gain both performance >> and resource efficiency. >> >> The main idea of group balancer is to fulfill this requirement >> by balancing groups of tasks among groups of CPUs, consider this >> as a dynamic demi-exclusive mode. Task trigger work to settle it's >> group into a proper partition (minimum predicted load), then try >> migrate itself into it. To gradually settle groups into the most >> exclusively partition. >> >> GB can be seen as an optimize policy based on load balance, >> it obeys the main idea of load balance and makes adjustment >> based on that. >> >> Our test on ARM64 platform with 128 CPUs shows that, >> throughput of sysbench memory is improved about 25%, >> and redis-benchmark is improved up to about 10%. > > The motivation makes sense to me but I'm not sure this is the right way to > architecture it. We already have the framework to do all these - the sched > domains and the load balancer. Architecturally, what the suggested patchset > is doing is building a separate load balancer on top of cpuset after using > cpuset to disable the existing load balancer, which is rather obviously > convoluted. > "the sched domains and the load balancer" you mentioned are the ways to "balance" tasks on each domains. However, this patchset aims to "group" them together to win hot cache and less competition, which is different from load balancer. See commit log of the patch 3/4 and this link: https://lore.kernel.org/all/11d4c86a-40ef-6ce5-6d08-e9d0bc9b512a@linux.alibaba.com/ > * AFAICS, none of what the suggested code does is all that complicated or > needs a lot of input from userspace. it should be possible to parametrize > the existing load balancer to behave better. > Group balancer mainly needs 2 inputs from userspace: cpu partition info and cgroup info. Cpu partition info does need user input (and maybe a bit complicated). As a result, the division methods are __free__ to users(can refer to NUMA nodes, clusters, cache, etc.) Cgroup info doesn't need extra input. It's naturally configured. It do parametrize the existing load balancer to behave better. Group balancer is a kind of optimize policy, and should obey the basic policy (load balance) and improve it. The relationship between load balancer and group balancer is explained in detail at the above link. > * If, for some reason, you need more customizable behavior in terms of cpu > allocation, which is what cpuset is for, maybe it'd be better to build the > load balancer in userspace. That'd fit way better with how cgroup is used > in general and with threaded cgroups, it should fit nicely with everything > else. > We put group balancer in kernel space because this new policy does not depend on userspace apps. It's a "general" feature. Doing "dynamic cpuset" in userspace may also introduce performance issue, since it may need to bind and unbind different cpusets for several times, and is too strict(compared with our "soft bind").