Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp6117785rwd; Mon, 5 Jun 2023 13:20:11 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6M8Z21goDCOwPxpBOHTn2OdTuDFpzoNw8GlvyxEsoKjYHP+20Xs4SftCXWQ68FHCIC7mzp X-Received: by 2002:a05:620a:4d85:b0:75d:a00a:1e1b with SMTP id uw5-20020a05620a4d8500b0075da00a1e1bmr963246qkn.6.1685996411062; Mon, 05 Jun 2023 13:20:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685996411; cv=none; d=google.com; s=arc-20160816; b=ppg/tkWVEGxpIhue5t+xZouwER7IenqrqNwQF9RgyBM1rz23A8iXMGdo/38LfV4/+l Spk/h7gDpv3N5iITxZpUWhlkEyFzGRjRtnBGOcPz6ge23TS3EBxVurlZYPhI4bu6JESt 28CcobZrkG/bGKmJgrH6gDDBBswOpdepWKDLtuTeGz69rUZkZEJHgyISNyN6+Xkta/L+ 4isScK1GvtObQS0bQ1cctrRp7Vzf53DI2zfyBDmIUvQsgR3o9MN3mQjZi+Sv2xjEK2Jl oVm2JNJJrPIhcQN0VjmB02xTxmEG+IbEhPcNK8VMoBU9oD7cKB6gLNXrFcRKleK1jJAg l5Bw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=0I6F66ZIa/tgx7QAKJtZG0czi9R4RbNmqFbBvhdUYFM=; b=Iychvjf1aINcc6F3OP7QCQgAb1R99ql6msDI+VJYavtxrJWYfl+ZhayhnOeYYzEMe+ Z975GZz1/TgBdBhNaOPvyGwarIxLUMGqE5AFsoWCqLXzlQnzUCFsLv7x2gSo5HdLlMs5 B2ympZutsRE3KxdaV7QsdP4DBBl5UYhg9O5wBP4LhiOlF57gf32j66Z5Wcr8Ly2x4F6j ZpxKfm8ZJu5a6HBEKZYSCAY61Gh88urj6f0+kjiMPZZqV5BW+kjRrcHkwQz76wHGdPRS 3eH/fZ3rnLm/Cu00T1ASoiKrh5WphtyCeffIIbxm54ImKqBMUb7PwRBOpXX3HICzWGRF PcMw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Q7H4cAOq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id op51-20020a05620a537300b0075d3abe9693si4423735qkn.438.2023.06.05.13.19.56; Mon, 05 Jun 2023 13:20:11 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Q7H4cAOq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236274AbjFEUBj (ORCPT + 99 others); Mon, 5 Jun 2023 16:01:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38222 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236092AbjFEUBb (ORCPT ); Mon, 5 Jun 2023 16:01:31 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6407ED3 for ; Mon, 5 Jun 2023 13:00:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1685995244; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0I6F66ZIa/tgx7QAKJtZG0czi9R4RbNmqFbBvhdUYFM=; b=Q7H4cAOqHBgqVJkSeWSsAeUGLzH3P9w+OR4f5X+THOxiassLywXNqCxwZYb5uHHM41e7jt PFOdknHlbNVA9E7NUn6rIkpgCSU6Nd56BpiZ1TV2KnWtsDUEPAeo4u7kcwE4J+Z/ahaQJn l+SE/MUYoGGSxggfXOGGQyM+kXaoQME= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-122-_QZEDCkuNTaN2rwTYIO-fA-1; Mon, 05 Jun 2023 16:00:41 -0400 X-MC-Unique: _QZEDCkuNTaN2rwTYIO-fA-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id B713B8015D8; Mon, 5 Jun 2023 20:00:40 +0000 (UTC) Received: from [10.22.10.186] (unknown [10.22.10.186]) by smtp.corp.redhat.com (Postfix) with ESMTP id C8847C1603B; Mon, 5 Jun 2023 20:00:39 +0000 (UTC) Message-ID: Date: Mon, 5 Jun 2023 16:00:39 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.7.1 Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition Content-Language: en-US To: Tejun Heo Cc: =?UTF-8?Q?Michal_Koutn=c3=bd?= , Zefan Li , Johannes Weiner , Jonathan Corbet , Shuah Khan , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, Juri Lelli , Valentin Schneider , Frederic Weisbecker , Mrunal Patel , Ryan Phillips , Brent Rowsell , Peter Hunt , Phil Auld References: <46d26abf-a725-b924-47fa-4419b20bbc02@redhat.com> <759603dd-7538-54ad-e63d-bb827b618ae3@redhat.com> <405b2805-538c-790b-5bf8-e90d3660f116@redhat.com> <18793f4a-fd39-2e71-0b77-856afb01547b@redhat.com> From: Waiman Long In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.8 X-Spam-Status: No, score=-2.2 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 6/5/23 14:03, Tejun Heo wrote: > Hello, Waiman. > > On Sun, May 28, 2023 at 05:18:50PM -0400, Waiman Long wrote: >> On 5/22/23 15:49, Tejun Heo wrote: >> Sorry for the late reply as I had been off for almost 2 weeks due to PTO. > And me too. Just moved. > >>> Why is the syntax different from .cpus? Wouldn't it be better to keep them >>> the same? >> Unlike cpuset.cpus, cpuset.cpus.reserve is supposed to contains CPUs that >> are used in multiple partitions. Also automatic reservation of adjacent >> partitions can happen in parallel. That is why I think it will be safer if > Ah, I see, this is because cpu.reserve is only in the root cgroup, so you > can't say that the knob is owned by the parent cgroup and thus access is > controlled that way. > > ... >>>>     There are two types of partitions - adjacent and remote.  The >>>>     parent of an adjacent partition must be a valid partition root. >>>>     Partition roots of adjacent partitions are all clustered around >>>>     the root cgroup.  Creation of adjacent partition is done by >>>>     writing the desired partition type into "cpuset.cpus.partition". >>>> >>>>     A remote partition does not require a partition root parent. >>>>     So a remote partition can be formed far from the root cgroup. >>>>     However, its creation is a 2-step process.  The CPUs needed >>>>     by a remote partition ("cpuset.cpus" of the partition root) >>>>     has to be written into "cpuset.cpus.reserve" of the root >>>>     cgroup first.  After that, "isolated" can be written into >>>>     "cpuset.cpus.partition" of the partition root to form a remote >>>>     isolated partition which is the only supported remote partition >>>>     type for now. >>>> >>>>     All remote partitions are terminal as adjacent partition cannot >>>>     be created underneath it. >>> Can you elaborate this extra restriction a bit further? >> Are you referring to the fact that only remote isolated partitions are >> supported? I do not preclude the support of load balancing remote >> partitions. I keep it to isolated partitions for now for ease of >> implementation and I am not currently aware of a use case where such a >> remote partition type is needed. >> >> If you are talking about remote partition being terminal. It is mainly >> because it can be more tricky to support hierarchical adjacent partitions >> underneath it especially if it is not isolated. We can certainly support it >> if a use case arises. I just don't want to implement code that nobody is >> really going to use. >> >> BTW, with the current way the remote partition is created, it is not >> possible to have another remote partition underneath it. > The fact that the control is spread across a root-only file and per-cgroup > file seems hacky to me. e.g. How would it interact with namespacing? Are > there reasons why this can't be properly hierarchical other than the amount > of work needed? For example: > > cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs > that the cgroup holds exclusively. The mask is always a subset of > cpuset.cpus. The parent loses access to a CPU when the CPU is given to a > child by setting the CPU in the child's cpus.exclusive and the CPU can't > be given to more than one child. IOW, exclusive CPUs are available only to > the leaf cgroups that have them set in their .exclusive file. > > When a cgroup is turned into a partition, its cpuset.cpus and > cpuset.cpus.exclusive should be the same. For backward compatibility, if > the cgroup's parent is already a partition, cpuset will automatically > attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive. > > I could well be missing something important but I'd really like to see > something like the above where the reservation feature blends in with the > rest of cpuset. It can certainly be made hierarchical as you suggest. It does increase complexity from both user and kernel point of view. From the user point of view, there is one more knob to manage hierarchically which is not used that often. From the kernel point of view, we may need to have one more cpumask per cpuset as the current subparts_cpus is used to track automatic reservation. We need another cpumask to contain extra exclusive CPUs not allocated through automatic reservation. The fact that you mention this new control file as a list of exclusively owned CPUs for this cgroup. Creating a partition is in fact allocating exclusive CPUs to a cgroup. So it kind of overlaps with the cpuset.cpus.partititon file. Can we fail a write to cpuset.cpus.exclusive if those exclusive CPUs cannot be granted or will this exclusive list is only valid if a valid partition can be formed. So we need to properly manage the dependency between these 2 control files. Alternatively, I have no problem exposing cpuset.cpus.exclusive as a read-only file. It is a bit problematic if we need to make it writable. As for namespacing, you do raise a good point. I was thinking mostly from a whole system point of view as the use case that I am aware of does not needs that. To allow delegation of exclusive CPUs to a child cgroup, that cgroup has to be a partition root itself. One compromise that I can think of is to only allow automatic reservation only in such a scenario. In that case, I need to support a remote load balanced partition as well and hierarchical sub-partitions underneath it. That can be done with some extra code to the existing v2 patchset without introducing too much complexity. IOW, the use of remote partition is only allowed on the whole system level where one has access to the cgroup root. Exclusive CPUs distribution within a container can only be done via the use of adjacent partitions with automatic reservation. Will that be a good enough compromise from your point of view? Cheers, Longman