Received: by 2002:a05:6a10:413:0:0:0:0 with SMTP id 19csp138108pxp; Fri, 11 Mar 2022 00:50:48 -0800 (PST) X-Google-Smtp-Source: ABdhPJyhn7aJ56ldZD+72kai8781iyonMonx/6Wc6segJuOLZpgEaM1Jh46GaKHuRVPo+ZOjNg2Z X-Received: by 2002:a05:6a00:1a8b:b0:4f7:10a6:3cea with SMTP id e11-20020a056a001a8b00b004f710a63ceamr9354902pfv.75.1646988647716; Fri, 11 Mar 2022 00:50:47 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1646988647; cv=none; d=google.com; s=arc-20160816; b=g0nTayxg2gsBgJEWsZDqs5NkcJsfIakgnbEy1Zg+DB8wT+5lSBT1KdM7HEBjV4IlKL 2aIQWLovAN1pHT++QDE/9tHRVIi+keU8TbUV99zatZBr6Eo8VktXF+Vv53LBrk8L/c90 E5XpXJ1OA/ovBeEa7hFvI457PsvYALxKyRkhi3aX3O1Q6lG8gTWpVap5UsJqXHCDgw+f 8xptydjx+NnTwIplOmEAdqaJRQVEVkTZE8aNrI4x1L+Lh/4WT6XqV2f/iCLZdH4J30HU IMw7N05A6XNcW5qs+7FJPV54AlT1Zn6hqCujNEzPTGDM28FZavg0WKEaN9zdGQTKbCVr iWQg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=oC/uHp0LKRACjsc065kHnPMdy7awGPicUsov8aeEzoQ=; b=ffeere7rOr0grwePSiTLiScNSHd2cnU3Hy3WHUBGB4p9nejA8EvDB9zodhFCeez7Ln TkM1Q/Wz0lWjWLr1/7P5mBN6ZHuvbEZ8pByXXYeoj5RMYaiS2KRpuWpSuMIj1h9XdQg0 OibaLDoav8mkvga/GnKLIXzryq61ZVG7VCfUL0ILTCK20tC+/Zr3x+ONG2Mva+Uv9pdy /y2Jn/6vQrqepG2zq3FAfXMEMUWu1EcAPlFpjt4vetpLu+YLpL8+L2Gr1rY1PUixPsqL l8m/HaC61pc9dgVNWU6/1kmhn0Teyd69dvR87NZ6BB21I7gqoYHhnJmcW5IDmsjPRn3M gUqQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id m6-20020a17090a4d8600b001bd14e01f4csi7461205pjh.58.2022.03.11.00.50.32; Fri, 11 Mar 2022 00:50:47 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239682AbiCJFsz (ORCPT + 99 others); Thu, 10 Mar 2022 00:48:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51194 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239646AbiCJFsn (ORCPT ); Thu, 10 Mar 2022 00:48:43 -0500 Received: from out30-131.freemail.mail.aliyun.com (out30-131.freemail.mail.aliyun.com [115.124.30.131]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CE1B012E141; Wed, 9 Mar 2022 21:47:41 -0800 (PST) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R731e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04400;MF=dtcccc@linux.alibaba.com;NM=1;PH=DS;RN=27;SR=0;TI=SMTPD_---0V6n8cyj_1646891254; Received: from 30.21.164.59(mailfrom:dtcccc@linux.alibaba.com fp:SMTPD_---0V6n8cyj_1646891254) by smtp.aliyun-inc.com(127.0.0.1); Thu, 10 Mar 2022 13:47:36 +0800 Message-ID: <014c8afe-e57f-0f31-32bb-cf4ff3d3cb95@linux.alibaba.com> Date: Thu, 10 Mar 2022 13:47:34 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.6.1 Subject: Re: [RFC PATCH v2 0/4] Introduce group balancer Content-Language: en-US To: Tejun Heo Cc: Zefan Li , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Johannes Weiner , Michael Wang , Cruz Zhao , Masahiro Yamada , Nathan Chancellor , Kees Cook , Andrew Morton , Vlastimil Babka , "Gustavo A. R. Silva" , Arnd Bergmann , Miguel Ojeda , Chris Down , Vipin Sharma , Daniel Borkmann , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org References: <20220308092629.40431-1-dtcccc@linux.alibaba.com> From: Tianchen Ding In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.9 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,HK_RANDOM_ENVFROM,HK_RANDOM_FROM,NICE_REPLY_A, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, UNPARSEABLE_RELAY,USER_IN_DEF_SPF_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2022/3/10 02:00, Tejun Heo wrote: > Hello, > > On Wed, Mar 09, 2022 at 04:30:51PM +0800, Tianchen Ding wrote: >> "the sched domains and the load balancer" you mentioned are the ways to >> "balance" tasks on each domains. However, this patchset aims to "group" them >> together to win hot cache and less competition, which is different from load >> balancer. See commit log of the patch 3/4 and this link: >> https://lore.kernel.org/all/11d4c86a-40ef-6ce5-6d08-e9d0bc9b512a@linux.alibaba.com/ > > I read that but it doesn't make whole lot of sense to me. As Peter noted, we > already have issues with cross NUMA node balancing interacting with in-node > balancing, which likely indicates that it needs more unified solution rather > than more fragmented. I have a hard time seeing how adding yet another layer > on top helps the situation. > >>> * If, for some reason, you need more customizable behavior in terms of cpu >>> allocation, which is what cpuset is for, maybe it'd be better to build the >>> load balancer in userspace. That'd fit way better with how cgroup is used >>> in general and with threaded cgroups, it should fit nicely with everything >>> else. >>> >> >> We put group balancer in kernel space because this new policy does not >> depend on userspace apps. It's a "general" feature. > > Well, it's general for use cases which are happy with the two knobs that you > defined for your use case. > >> Doing "dynamic cpuset" in userspace may also introduce performance issue, >> since it may need to bind and unbind different cpusets for several times, >> and is too strict(compared with our "soft bind"). > > My bet is that you're gonna be able to get just about the same bench results > with userspace diddling with thread cgroup membership. Why not try that > first? The interface is already there. I have a hard time seeing the > justification for hard coding this into the kernel at this stage. > > Thanks. > Well, I understand your point is putting this in userspace. However, we've considered about that but found it hard to do so. If you have any better idea, please share with us. :-) If we want to build group balancer in userspace, we need: 1) gather load info from each rq periodically 2) make decision to set cpuset.cpus of each cgroup However, there're some problems about this way. For 1), we need to consider how frequently collecting these info, which may impact performance and accuracy. If the load changes hugely right after we get it once, our data are expired and then the decision may be wrong. (If we are in kernel, faster action can be taken.) We believe 2) is harder. The specific policy may be complex and alter according to different scenes. There's not a general method. e.g., with 16cpus and 4 cgroups, how to decide when we set one of them 0-3(when busy)or 0-7(when some of other cgroups are idle)? If there are much more threads in cgroupA than cgroupB/C/D , and we want to satisfy cgroupA as far as possible(on the premise of fairness of B/C/D), dynamically enlarging(when B/C/D partly idle) and shrinking(when B/C/D busy) cpuset of cgroupA requires complex policy. In this example, fairness and performance can be provided by existing scheduler, but when it comes to grouping hot cache or decreasing competion, both scheduler in kernel and action in userspace are hard to solve. What's more, in many cloud computing scenes, there may be hundreds or thousands of containers, which are much larger than partition number. These containers may be dynamically created and destroyed at any time. Making policy to manage them from userspace will not be practical. These problems become easy when going to kernelspace. We get info directly from scheduler, and help revising its decision at some key points, or do some support work(e.g., task migration if possible). Thanks.