Received: by 2002:a05:6358:bb9e:b0:b9:5105:a5b4 with SMTP id df30csp5711277rwb; Wed, 7 Sep 2022 07:02:15 -0700 (PDT) X-Google-Smtp-Source: AA6agR7cBR8FqK3SL2zmowE5fzZ7OTeo6rMMLbsD997R1cPwkDnXGvNcPFn7XOB0Ng/0T5z9wo/l X-Received: by 2002:a17:907:75f6:b0:771:b827:938 with SMTP id jz22-20020a17090775f600b00771b8270938mr325810ejc.582.1662559335435; Wed, 07 Sep 2022 07:02:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1662559335; cv=none; d=google.com; s=arc-20160816; b=c7BbHEXTQ49yDX/0zhAqbL9HOI/JIAquIr4S4xh+pJN3WdWCJPnNxtfiSUbimo0HGZ RHTyEriB5v7FxH+YaVTQEFyMe45PFw2Ztdm88AThz2UIG2plTQOPbtef0ibTcqaCt1K0 uk8cat7d8FFnS+qzgWGFmgJBLXpCC5jOOYkcoRb7LLeewoXeE/JV4GW8ZXzDSpKlE4vN 1kQg5kAeb5VuByaqMw8lCz0Kpve6vy4rCK0ADwF5VKH6g8Xc27xllYWJTW2OJn71hYSU W8nbbI8O25r9lWCGZq+BIfh6zc84PfnMGTUZG8hRN5aAyNBIKSkfb4xFA3N75sy3+31s xA/A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id :dkim-signature; bh=MgiF2QhoVH7izaAC6v9HdSeip3/dnqOw2S5EnYEL7zk=; b=jvgsCZfe+9G+Hdb8/P+SWbnEt48SV799ylL0zI5nhFjVEX0Dbm0Pm5Xea7rwpDb4hv 6oGDzgGNyJrAntYQ0xZwwYOjGHhbRUkCHt+0sRcZ2Zj7uzKMcus6jwLUo+yPvIXFqnpe 623mvqDzK6mNC3l7BeVGUq/c4sFwi0Wej323CNwKnCKf+SdRNWfQup98GF7RpGVIqTUz arG9OgKvEnjkqhxK5ClIsOfnxD7WBKQmkYJbz5YeHwHCcXZp4iBLl5MHHXcpS/XpvXvZ qwfK9yU8jziMKDwoOQpKBDyMM+NlbfoKhVW+DKF7+mfY3l0g7E3LpKAJl/6bGCzRTs3g X7lQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=tyBysNta; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id dt13-20020a170907728d00b007422127d084si12145811ejc.742.2022.09.07.07.01.49; Wed, 07 Sep 2022 07:02:15 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=tyBysNta; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229833AbiIGNuu (ORCPT + 99 others); Wed, 7 Sep 2022 09:50:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33668 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229780AbiIGNur (ORCPT ); Wed, 7 Sep 2022 09:50:47 -0400 Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6FDFBA50E4 for ; Wed, 7 Sep 2022 06:50:35 -0700 (PDT) Received: by mail-pl1-x636.google.com with SMTP id t3so9537049ply.2 for ; Wed, 07 Sep 2022 06:50:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date; bh=MgiF2QhoVH7izaAC6v9HdSeip3/dnqOw2S5EnYEL7zk=; b=tyBysNtakVLwLosd6NM/vK1CqUudzQqZPlC72cw1rfa2hg2wgZvjK0mTMetrBlukH7 tZb+67+N/zPM5Tmb+bOM1Sykm7wQgEVgj8BUBD8vCGwxoSYiRU/hUjveit3Ud+/2bmBC ZR7Ssg78La9RTA7wxRojAleaPa7l0715zdhc2pbBLmjwJTFUPto3Fv10Vvy6GhelSdFG A+8ZMtgwXszLUMihVXzN7kstNCgHJEzSn/Aw+XPw44wzu6MDLjAWgNOu8Ej6gsuX4eOm RTYQun6Tw78tSBtvWIMsySt1k72KvjHw8OpB3O1h7WMhOXb/pDiZ/L+kDny2Jt3KubrT H/GQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date; bh=MgiF2QhoVH7izaAC6v9HdSeip3/dnqOw2S5EnYEL7zk=; b=xCsGpjXy6qxH4C4sPbhSQMYBjyov1bPXH3affkYbRM5Z2AGfMdpTBA0S+Bwc+CCeyk 2fByHYd9uyIUcG/hVzy+7UwS1eHzEUTUc6f73Nn9iZ9R0rCBz3tBv3Tp51ln84X+208b ZJ/RW25npcOiyTEpf4+St82LRQy3sc6JBsJRLoUDkHYerqZsio6zAoMK04AV9qf8yxCH uw3VVuxVXsJRk4zf9ceFuOW4rOipFJ3nDx2s5N7d/+3a2sKjMmrNPumAAyYN4+Y6JSVd 0+FjKviCsA+RPKXBBlZeNS3EraI3J7g7jaTI9PHF+CmEmByXHIVMNg7B2fOEtB7kHMX+ cZgw== X-Gm-Message-State: ACgBeo0BOl5TpRtahqX4IIyIdfN2cJXR1Q0a3OflrzY9ObeoO7s43Pgn Dt04QxrOfk556PRCYRNn+b67xT6sS3u7Xg== X-Received: by 2002:a17:90a:c1:b0:1f4:f757:6b48 with SMTP id v1-20020a17090a00c100b001f4f7576b48mr30371570pjd.56.1662558633815; Wed, 07 Sep 2022 06:50:33 -0700 (PDT) Received: from [10.255.85.171] ([139.177.225.235]) by smtp.gmail.com with ESMTPSA id a7-20020aa78e87000000b0053bb934eaa9sm8324011pfr.201.2022.09.07.06.50.30 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 07 Sep 2022 06:50:33 -0700 (PDT) Message-ID: <93d76370-6c43-5560-9a5f-f76a8cc979e0@bytedance.com> Date: Wed, 7 Sep 2022 21:50:24 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.13.0 Subject: Re: [External] Re: [PATCH] cgroup/cpuset: Add a new isolated mems.policy type. To: Michal Hocko Cc: hannes@cmpxchg.org, roman.gushchin@linux.dev, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, lizefan.x@bytedance.com, wuyun.abel@bytedance.com References: <20220904040241.1708-1-hezhongkun.hzk@bytedance.com> <0e5f380b-9201-0f56-9144-ce8449491fc8@bytedance.com> From: Zhongkun He In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-6.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >> Hi Michal, thanks for your reply. >> >>>> Say parent has a stronger requirement (say bind) than a child(prefer)? >> >> Yes, combine all these together. > > What is the semantic of the resulting policy? > >> The parent's task will use 'bind', child's >> use 'prefer'.This is the current implementation, and we can discuss and >> modify it together if there are other suggestions. >> >> 1:Existing shortcomings >> >> In our use case, the application and the control plane are two separate >> systems. When the application is created, it doesn't know how to use memory, >> and it doesn't care. The control plane will decide the memory usage policy >> based on different reasons (the attributes of the application itself, the >> priority, the remaining resources of the system). Currently, numactl is used >> to set it at program startup, and the child process will inherit the >> mempolicy. > > Yes this is common practice I have seen so far. > >> But we can't dynamically adjust the memory policy, except >> restart, the memory policy will not change. > > Do you really need to change the policy itself or only the effective > nodemask? I mean what is your usecase to go from say mbind to preferred > policy? Do you need any other policy than bind and preferred? > >> 2:Our goals >> >> For the above reasons, we want to create a mempolicy at the cgroup level. >> Usually processes under a cgroup have the same priority and attributes, and >> we can dynamically adjust the memory allocation strategy according to the >> remaining resources of the system. For example, a low-priority cgroup uses >> the 'bind:2-3' policy, and a high-priority cgroup uses bind:0-1. When >> resources are insufficient, it will be changed to bind:3, bind:0-2 by >> control plane, etc.Further more, more mempolicy can be extended, such as >> allocating memory according to node weight, etc. > > Yes, I do understand that you want to change the node affinity and that > is already possible with cpuset cgroup. The existing constrain is that > the policy is hardcoded mbind IIRC. So you cannot really implement a dynamic > preferred policy which would make some sense to me. The question is how > to implement that with a sensible semantic. It is hard to partition the > system into several cgroups if subset allows to spill over to others. > Say something like the following > root (nodes=0-3) > / \ > A (0, 1) B (2, 3) > > if both are MBIND then this makes sense because they are kinda isolated > (at least for user allocations) but if B is PREFERRED and therefore > allowed to use nodes 0 and 1 then it can deplete the memory from A and > therefore isolation doesn't work at all. > > I can imagine that the all cgroups would use PREFERRED policy and then > nobody can expect anything and the configuration is mostly best effort. > But it feels like this is an abuse of the cgroup interface and a proper > syscall interface is likely due. Would it make more sense to add > pidfd_set_mempolicy and allow sufficiently privileged process to > manipulate default memory policy of a remote process? Hi Michal, thanks for your reply. > Do you really need to change the policy itself or only the effective > nodemask? Do you need any other policy than bind and preferred? Yes, we need to change the policy, not only his nodemask. we really want policy is interleave, and extend it to weight-interleave. Say something like the following node weight interleave: 0-3 1:1:1:1 default one by one weight-interleave: 0-3 1:2:4:6 alloc pages by weight (User set weight.) In the actual usecase, the remaining resources of each node are different, and the use of interleave cannot maximize the use of resources. Back to the previous question. >The question is how to implement that with a sensible semantic. Thanks for your analysis and suggestions.It is really difficult to add policy directly to cgroup for the hierarchical enforcement. It would be a good idea to add pidfd_set_mempolicy. Also, there is a new idea. We can try to separate the elements of mempolicy and use them independently. Mempolicy has two meanings: nodes:which nodes to use(nodes,0-3), we can use cpuset's effective_mems directly. mode:how to use them(bind,prefer,etc). change the mode to a cpuset->flags,such as CS_INTERLEAVE。 task_struct->mems_allowed is equal to cpuset->effective_mems,which is hierarchical enforcement。CS_INTERLEAVE can also be updated into tasks, just like other flags(CS_SPREAD_PAGE). When a process needs to allocate memory, it can find the appropriate node to allocate pages according to the flag and mems_allowed. thanks.