Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp3117349pxb; Mon, 18 Oct 2021 08:32:22 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw1U2lJXWcFDfeZ1OBPqbFJa+6BgB/mOkjON2U4livy5JZcou3qVsveIiQaMclyK4JcoK9B X-Received: by 2002:a17:906:1381:: with SMTP id f1mr29975480ejc.547.1634571142647; Mon, 18 Oct 2021 08:32:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1634571142; cv=none; d=google.com; s=arc-20160816; b=tFWSX7q0qJH0GTF0qlSHMz4Iq/3B1uC6+Uo/mbKXruvKrslhttzwpB1NvV/p6wxnj2 4dS+3d7ikDBdfM+2SOW9dLgR6NpwEpPgZcgoNO1yzc7vnTmfhPQiGDIMC6fffdzdB+EP u2qoEFTbpvU/bi27AXONRCQySXRK9/SYuIkhOx5R1p2fY0UmGM+Au9JucBLTHvmHwoQe JMBy6A93lV2TEAERno29ioqsETH/vQGJ0MNwV9srrQeKuK0vgHSJfTlfJG3g2nauFiU6 oq8NwfssIVKjV4YjZO5oZExIkBybaigIrLGsStU2/ExCeKR3+Tg+eVmiqkGFF7CleP10 7dgw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=rcJecVSV9JJG4iazj3f+iELtHVS0LVi684kZY28slOs=; b=edeE7IwKMiQUScLPnree0b+edW4t9iNDu2lFaLRZmsuiU3tMAtio+ixZqLD8vJ2V4N cVfUswBZ5NwrVMVYJVUn96bL0cSFqmxry1wjSp6rLr9mnD9Xo7ufV+M9oMtaWT5mzzi1 84FmbR3u4RZhKs8Lkb2m0rR37LTdJm9+OSmtwIMjyI/UIwkXbwlNFACOPgZP8RopDEf9 nDAdrcM6w73Q6cPYJA3B2Khks8mspxtQ4+KIwGUi0Cbz1xEdYnhkDN8gl4IuDe56uYJz tCLSVCLzI/AJgkuKzyD6vZOldSC5EgQrWhryZvDW8owlNnRNE3pUFmvG7GQ10LiqzzQA TYjQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=mqH+feqf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id d2si21295326edo.4.2021.10.18.08.31.40; Mon, 18 Oct 2021 08:32:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=mqH+feqf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232105AbhJRPbs (ORCPT + 99 others); Mon, 18 Oct 2021 11:31:48 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:59440 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229696AbhJRPbr (ORCPT ); Mon, 18 Oct 2021 11:31:47 -0400 Received: from pps.filterd (m0098410.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 19IE0N0S000432; Mon, 18 Oct 2021 11:29:27 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=message-id : date : mime-version : subject : to : cc : references : from : in-reply-to : content-type : content-transfer-encoding; s=pp1; bh=rcJecVSV9JJG4iazj3f+iELtHVS0LVi684kZY28slOs=; b=mqH+feqf8Hb96BFOXyA67/YgYynP5NHv/Yz0TI8SovXpfs7dbpTqX5yYEPiyXPogaT0P XIOBrQIcRvyWtdt2hjI3WyYToW9m/suJxgrLX/uzoagEE8vvUh+rHqt+mMvlQ/xlbHtW +JoWWWm1rfgNOrYLz/V1QrddYcGU2oqmzltViA2wX9I1mMRYf6VcbrpoS7lSb9gEVo2k uoSRPOt/of8eYkRH7LDSDMWZA8vpRsfPMw0zk+Nl+rIeKederxDTHVeaRwLCldWIqQSl AQ2oSBa9SuSV5gyqXJL0b443IhkTH0lo70MC3WqoRL/Sh4qWWxR3JAq4Xa20Mr7zv8Uh +g== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 3bs8vkvpra-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 18 Oct 2021 11:29:27 -0400 Received: from m0098410.ppops.net (m0098410.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 19IFMoTr011438; Mon, 18 Oct 2021 11:29:27 -0400 Received: from ppma03fra.de.ibm.com (6b.4a.5195.ip4.static.sl-reverse.com [149.81.74.107]) by mx0a-001b2d01.pphosted.com with ESMTP id 3bs8vkvpqc-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 18 Oct 2021 11:29:26 -0400 Received: from pps.filterd (ppma03fra.de.ibm.com [127.0.0.1]) by ppma03fra.de.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 19IFBYRa003805; Mon, 18 Oct 2021 15:29:24 GMT Received: from b06avi18878370.portsmouth.uk.ibm.com (b06avi18878370.portsmouth.uk.ibm.com [9.149.26.194]) by ppma03fra.de.ibm.com with ESMTP id 3bqpc96xx4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 18 Oct 2021 15:29:24 +0000 Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com [9.149.105.61]) by b06avi18878370.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 19IFNWP763111574 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 18 Oct 2021 15:23:32 GMT Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1331811C050; Mon, 18 Oct 2021 15:29:22 +0000 (GMT) Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 7FCD311C05C; Mon, 18 Oct 2021 15:29:17 +0000 (GMT) Received: from [9.43.5.59] (unknown [9.43.5.59]) by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 18 Oct 2021 15:29:17 +0000 (GMT) Message-ID: Date: Mon, 18 Oct 2021 20:59:16 +0530 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.1.0 Subject: Re: [RFC 0/5] kernel: Introduce CPU Namespace Content-Language: en-US To: Tejun Heo Cc: Christian Brauner , bristot@redhat.com, christian@brauner.io, ebiederm@xmission.com, lizefan.x@bytedance.com, hannes@cmpxchg.org, mingo@kernel.org, juri.lelli@redhat.com, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org, containers@lists.linux.dev, containers@lists.linux-foundation.org, pratik.r.sampat@gmail.com References: <20211009151243.8825-1-psampat@linux.ibm.com> <20211011101124.d5mm7skqfhe5g35h@wittgenstein> From: Pratik Sampat In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 X-Proofpoint-GUID: olrr0wC6rZHl_pzXUHGZ6Zj92X7utMiM X-Proofpoint-ORIG-GUID: 7JIhdfYuc3_3Tz8ordGqJ0m0G30qAnTY X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.182.1,Aquarius:18.0.790,Hydra:6.0.425,FMLib:17.0.607.475 definitions=2021-10-18_06,2021-10-18_01,2020-04-07_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 mlxlogscore=999 bulkscore=0 clxscore=1015 suspectscore=0 adultscore=0 phishscore=0 mlxscore=0 spamscore=0 lowpriorityscore=0 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2109230001 definitions=main-2110180093 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 15/10/21 3:44 am, Tejun Heo wrote: > Hello, > > On Tue, Oct 12, 2021 at 02:12:18PM +0530, Pratik Sampat wrote: >>>> The control and the display interface is fairly disjoint with each >>>> other. Restrictions can be set through control interfaces like cgroups, >>> A task wouldn't really opt-in to cpu isolation with CLONE_NEWCPU it >>> would only affect resource reporting. So it would be one half of the >>> semantics of a namespace. >>> >> I completely agree with you on this, fundamentally a namespace should >> isolate both the resource as well as the reporting. As you mentioned >> too, cgroups handles the resource isolation while this namespace >> handles the reporting and this seems to break the semantics of what a >> namespace should really be. >> >> The CPU resource is unique in that sense, at least in this context, >> which makes it tricky to design a interface that presents coherent >> information. > It's only unique in the context that you're trying to place CPU distribution > into the namespace framework when the resource in question isn't distributed > that way. All of the three major local resources - CPU, memory and IO - are > in the same boat. Computing resources, the physical ones, don't render > themselves naturally to accounting and ditributing by segmenting _name_ > spaces which ultimately just shows and hides names. This direction is a > dead-end. > >> I too think that having a brand new interface all together and teaching >> userspace about it is much cleaner approach. >> On the same lines, if were to do that, we could also add more useful >> metrics in that interface like ballpark number of threads to saturate >> usage as well as gather more such metrics as suggested by Tejun Heo. >> >> My only concern for this would be that if today applications aren't >> modifying their code to read the existing cgroup interface and would >> rather resort to using userspace side-channel solutions like LXCFS or >> wrapping them up in kata containers, would it now be compelling enough >> to introduce yet another interface? > While I'm sympathetic to compatibility argument, identifying available > resources was never well-define with the existing interfaces. Most of the > available information is what hardware is available but there's no > consistent way of knowing what the software environment is like. Is the > application the only one on the system? How much memory should be set aside > for system management, monitoring and other administrative operations? > > In practice, the numbers that are available can serve as the starting points > on top of which application and environment specific knoweldge has to be > applied to actually determine deployable configurations, which in turn would > go through iterative adjustments unless the workload is self-sizing. > > Given such variability in requirements, I'm not sure what numbers should be > baked into the "namespaced" system metrics. Some numbers, e.g., number of > CPUs can may be mapped from cpuset configuration but even that requires > quite a bit of assumptions about how cpuset is configured and the > expectations the applications would have while other numbers - e.g. > available memory - is a total non-starter. > > If we try to fake these numbers for containers, what's likely to happen is > that the service owners would end up tuning workload size against whatever > number the kernel is showing factoring in all the environmental factors > knowingly or just through iterations. And that's not *really* an interface > which provides compatibility. We're just piping new numbers which don't > really mean what they used to mean and whose meanings can change depending > on configuration through existing interfaces and letting users figure out > what to do with the new numbers. > > To achieve compatibility where applications don't need to be changed, I > don't think there is a solution which doesn't involve going through > userspace. For other cases and long term, the right direction is providing > well-defined resource metrics that applications can make sense of and use to > size themselves dynamically. I agree that major local resources like CPUs and memory cannot to be distributed cleanly in a namespace semantic. Thus the memory resource like CPU too does face similar coherency issues where /proc/meminfo can be different from what the restrictions are. While a CPU namespace maybe not be the preferred way of solving this problem, the prototype RFC is rather for understanding related problems with this as well as other potential directions that we could explore for solving this problem. Also, I agree with your point about variability of requirements. If the interface we give even though it is in conjunction with the limits set, if the applications have to derive metrics from this or from other kernel information regardless; then the interface would not be useful. If the solution to this problem lies in userspace, then I'm all for it as well. However, the intention is to probe if this could potentially be solved in cleanly in the kernel. >> While I concur with Tejun Heo's comment the mail thread and overloading >> existing interfaces of sys and proc which were originally designed for >> system wide resources, may not be a great idea: >> >>> There is a fundamental problem with trying to represent a resource shared >>> environment controlled with cgroup using system-wide interfaces including >>> procfs >> A fundamental question we probably need to ascertain could be - >> Today, is it incorrect for applications to look at the sys and procfs to >> get resource information, regardless of their runtime environment? > Well, it's incomplete even without containerization. Containerization just > amplifies the shortcomings. All of these problems existed well before > cgroups / namespaces. How would you know how much resource you can consume > on a system just looking at hardware resources without implicit knowledge of > what else is on the system? It's just that we are now more likely to load > systems dynamically with containerization. Yes, these shortcomings exist even without containerization, on a dynamically loaded multi-tenant system it becomes very difficult to determine what is the maximum amount resource that can be requested before we hurt our own performance. cgroups and namespace mechanics help containers give some structure to the maximum amount of resources that they can consume. However, applications are unable to leverage that in some cases especially if they are more inclined to look at a more traditional system wide interface like sys and proc. >> Also, if an application were to only be able to view the resources >> based on the restrictions set regardless of the interface - would there >> be a disadvantage for them if they could only see an overloaded context >> sensitive view rather than the whole system view? > Can you elaborate further? I have a hard time understanding what's being > asked. The question that I have essentially tries to understand the implications of overloading existing interface's definitions to be context sensitive. The way that the prototype works today is that it does not interfere with the information when the system boots or even when it is run in a new namespace. The effects are only observed when restrictions are applied to it. Therefore, what would potentially break if interfaces like these are made to divulge information based on restrictions rather than the whole system view? Thanks Pratik