Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp5148514pxb; Thu, 14 Oct 2021 20:56:02 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz45z+fGtpZfHrLAgS28G03JX83SLrue3MRMMLDeti8ik73TOSKE8JwT6SZAUQei104PZUm X-Received: by 2002:a05:6402:222b:: with SMTP id cr11mr14974015edb.392.1634270161803; Thu, 14 Oct 2021 20:56:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1634270161; cv=none; d=google.com; s=arc-20160816; b=0P8Clre+c4CWjk5Xy3/Wc8b5Z1g8gXre8sNE8C69b6gOBNZiE/s2HTsePF+TGLywio WSmAp7TnjIjuOxG/6wej9pCJITM/ctBwwzO9Qt6Q3H2rc4FC6IJNW2ZJYJ9KHEUbmejw cSQNOvJwxuV0pOAC5ItmZ4d3LNiRs+pPQ+LiRhokL3ACc6yWEiDsMMdEXPOOxn6UY32O iGvJL3mtIZ4dU/dYMOdv0Gct2+PfsohtL9WA6A6J9wwHHCJYnIrtsmmqVWhD+jVUcH9y MzYJrgh3wxRcnYn2cxDt1fZS2530uW42i7c9X0uSsuYLC8LQf3rucTpkqXHhtfpi6/Sa ePPQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:sender:dkim-signature; bh=C5ZuKrGzyI6sXcttUGT8m4bIhAiihLIa6Po9/1DTork=; b=Pi0fwAAvg2JQzLtdboaePxcBn4xdnAeMxr42zP/R1QK2JCzpNMd7fVZX6kVUaSrbp8 gRte5MmBTytH8/1L+HEF3nw6huoKZKcry/vafAuroqQ9GBKXKT5rAmWfKul40ULerMYt HxlJK276wRxNQ3X1OfSZOK1E2cB9YCAXaesg+0+KoKd8GVZx/+0Pxu6W23RheVdP0VBm NrES2KeCuFkn5qfNI7r1lqjzsUpfNOla+9XB6+Zg0on83pFxt4t1rG4C7KREoyFCBD5s H7IBtiBpphlVh9YjMn27CnTIQbOYf6YKQtE5X8quhBjWVEjfagTBycQnDMPiMC8/4kxE lbgw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=AsE4PDAG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id w15si6189017eds.603.2021.10.14.20.55.36; Thu, 14 Oct 2021 20:56:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=AsE4PDAG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233856AbhJNWQh (ORCPT + 99 others); Thu, 14 Oct 2021 18:16:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32972 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229829AbhJNWQg (ORCPT ); Thu, 14 Oct 2021 18:16:36 -0400 Received: from mail-pf1-x42a.google.com (mail-pf1-x42a.google.com [IPv6:2607:f8b0:4864:20::42a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4A19AC061570; Thu, 14 Oct 2021 15:14:31 -0700 (PDT) Received: by mail-pf1-x42a.google.com with SMTP id m26so6695308pff.3; Thu, 14 Oct 2021 15:14:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=C5ZuKrGzyI6sXcttUGT8m4bIhAiihLIa6Po9/1DTork=; b=AsE4PDAG7h8rwyf8bL0F9Io1UF0AoB2kT269QADqYzz+dKtUQx60eUE9HUAHP8Z2Bn uC08a/3eAlVxbqwKa/qT8bLYMerUdPOKCGoFBWmLJcRh7g8cFRavDIDHuoNAXemunyc3 V6INJTbXAxAS9Jnbszimpa1g+68qh9s0SdjeSYXc0UhbjCvYuKjefabWJe4ojMhhZn+C pbnXxRrC/Eylg/ZBrgx543czX/iLm3zfSmRuuHlLHU/tGce8ckWeiAQkhJHLxrec8wht p4MFNx48pCbGHNwS5DPD82va6P50sWg/rE7o+au6qdSB8gB9Yg3jZXtRzNNJ+QnFrzM1 nWvg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to; bh=C5ZuKrGzyI6sXcttUGT8m4bIhAiihLIa6Po9/1DTork=; b=JCD+u6In0lVwFe/SsLynWOWjaLV+wY3UzALbCuqyPYVi1GQ9D5iHBy24uRCnaA98gC +tyoX97lPaEEohnSdFQ4CJxzhFExsYIMI8Z63eFFsegwWKGhC7ZpJt9wgz0oNRZAHg4M +/w32XX9p+pRiHCSlrp7IAjHq/qoijn8l6NTuhyzpdX4sZ1bsTbcNdRWyqOTCO7ZOYtY to5knIMFdAKd3xxxFgkzi9Z11XVtogZl+tvjPEyY+VkFQYmqjgi4od2ahFqdAZodgfsn Sc5A0TrNt/Cs624K9qh5eo/OCAgGYeklr4aHTIXb6H2yF9P1X5VCMjHCknOljZVC8LWE BVpw== X-Gm-Message-State: AOAM532KaH0upLv9iw/+5vGOGW/Zl7aAYPBcCVti4b/QsWlhcHrMCX8T ITiHJPNjY+qbI/vcjMz5Uw16y0jnKlL/7g== X-Received: by 2002:a63:2361:: with SMTP id u33mr6168320pgm.369.1634249670578; Thu, 14 Oct 2021 15:14:30 -0700 (PDT) Received: from localhost (2603-800c-1a02-1bae-e24f-43ff-fee6-449f.res6.spectrum.com. [2603:800c:1a02:1bae:e24f:43ff:fee6:449f]) by smtp.gmail.com with ESMTPSA id l4sm3233910pga.71.2021.10.14.15.14.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Oct 2021 15:14:30 -0700 (PDT) Sender: Tejun Heo Date: Thu, 14 Oct 2021 12:14:28 -1000 From: Tejun Heo To: Pratik Sampat Cc: Christian Brauner , bristot@redhat.com, christian@brauner.io, ebiederm@xmission.com, lizefan.x@bytedance.com, hannes@cmpxchg.org, mingo@kernel.org, juri.lelli@redhat.com, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org, containers@lists.linux.dev, containers@lists.linux-foundation.org, pratik.r.sampat@gmail.com Subject: Re: [RFC 0/5] kernel: Introduce CPU Namespace Message-ID: References: <20211009151243.8825-1-psampat@linux.ibm.com> <20211011101124.d5mm7skqfhe5g35h@wittgenstein> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, On Tue, Oct 12, 2021 at 02:12:18PM +0530, Pratik Sampat wrote: > > > The control and the display interface is fairly disjoint with each > > > other. Restrictions can be set through control interfaces like cgroups, > > A task wouldn't really opt-in to cpu isolation with CLONE_NEWCPU it > > would only affect resource reporting. So it would be one half of the > > semantics of a namespace. > > > I completely agree with you on this, fundamentally a namespace should > isolate both the resource as well as the reporting. As you mentioned > too, cgroups handles the resource isolation while this namespace > handles the reporting and this seems to break the semantics of what a > namespace should really be. > > The CPU resource is unique in that sense, at least in this context, > which makes it tricky to design a interface that presents coherent > information. It's only unique in the context that you're trying to place CPU distribution into the namespace framework when the resource in question isn't distributed that way. All of the three major local resources - CPU, memory and IO - are in the same boat. Computing resources, the physical ones, don't render themselves naturally to accounting and ditributing by segmenting _name_ spaces which ultimately just shows and hides names. This direction is a dead-end. > I too think that having a brand new interface all together and teaching > userspace about it is much cleaner approach. > On the same lines, if were to do that, we could also add more useful > metrics in that interface like ballpark number of threads to saturate > usage as well as gather more such metrics as suggested by Tejun Heo. > > My only concern for this would be that if today applications aren't > modifying their code to read the existing cgroup interface and would > rather resort to using userspace side-channel solutions like LXCFS or > wrapping them up in kata containers, would it now be compelling enough > to introduce yet another interface? While I'm sympathetic to compatibility argument, identifying available resources was never well-define with the existing interfaces. Most of the available information is what hardware is available but there's no consistent way of knowing what the software environment is like. Is the application the only one on the system? How much memory should be set aside for system management, monitoring and other administrative operations? In practice, the numbers that are available can serve as the starting points on top of which application and environment specific knoweldge has to be applied to actually determine deployable configurations, which in turn would go through iterative adjustments unless the workload is self-sizing. Given such variability in requirements, I'm not sure what numbers should be baked into the "namespaced" system metrics. Some numbers, e.g., number of CPUs can may be mapped from cpuset configuration but even that requires quite a bit of assumptions about how cpuset is configured and the expectations the applications would have while other numbers - e.g. available memory - is a total non-starter. If we try to fake these numbers for containers, what's likely to happen is that the service owners would end up tuning workload size against whatever number the kernel is showing factoring in all the environmental factors knowingly or just through iterations. And that's not *really* an interface which provides compatibility. We're just piping new numbers which don't really mean what they used to mean and whose meanings can change depending on configuration through existing interfaces and letting users figure out what to do with the new numbers. To achieve compatibility where applications don't need to be changed, I don't think there is a solution which doesn't involve going through userspace. For other cases and long term, the right direction is providing well-defined resource metrics that applications can make sense of and use to size themselves dynamically. > While I concur with Tejun Heo's comment the mail thread and overloading > existing interfaces of sys and proc which were originally designed for > system wide resources, may not be a great idea: > > > There is a fundamental problem with trying to represent a resource shared > > environment controlled with cgroup using system-wide interfaces including > > procfs > > A fundamental question we probably need to ascertain could be - > Today, is it incorrect for applications to look at the sys and procfs to > get resource information, regardless of their runtime environment? Well, it's incomplete even without containerization. Containerization just amplifies the shortcomings. All of these problems existed well before cgroups / namespaces. How would you know how much resource you can consume on a system just looking at hardware resources without implicit knowledge of what else is on the system? It's just that we are now more likely to load systems dynamically with containerization. > Also, if an application were to only be able to view the resources > based on the restrictions set regardless of the interface - would there > be a disadvantage for them if they could only see an overloaded context > sensitive view rather than the whole system view? Can you elaborate further? I have a hard time understanding what's being asked. Thanks. -- tejun