Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S262065AbVBKJXo (ORCPT ); Fri, 11 Feb 2005 04:23:44 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S262072AbVBKJXo (ORCPT ); Fri, 11 Feb 2005 04:23:44 -0500 Received: from omx3-ext.sgi.com ([192.48.171.20]:42125 "EHLO omx3.sgi.com") by vger.kernel.org with ESMTP id S262065AbVBKJXE (ORCPT ); Fri, 11 Feb 2005 04:23:04 -0500 Date: Fri, 11 Feb 2005 01:21:12 -0800 From: Paul Jackson To: Chandra Seetharaman Cc: colpatch@us.ibm.com, dino@in.ibm.com, mbligh@aracnet.com, pwil3058@bigpond.net.au, frankeh@watson.ibm.com, dipankar@in.ibm.com, akpm@osdl.org, ckrm-tech@lists.sourceforge.net, efocht@hpce.nec.com, lse-tech@lists.sourceforge.net, hch@infradead.org, steiner@sgi.com, jbarnes@sgi.com, sylvain.jeaugey@bull.net, djh@sgi.com, linux-kernel@vger.kernel.org, Simon.Derr@bull.net, ak@suse.de, sivanich@sgi.com Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement Message-Id: <20050211012112.4913a3e2.pj@sgi.com> In-Reply-To: <20050211024606.GB19997@chandralinux.beaverton.ibm.com> References: <415F37F9.6060002@bigpond.net.au> <821020000.1096814205@[10.10.2.4]> <20041003083936.7c844ec3.pj@sgi.com> <834330000.1096847619@[10.10.2.4]> <1097014749.4065.48.camel@arrakis> <420800F5.9070504@us.ibm.com> <20050208095440.GA3976@in.ibm.com> <42090C42.7020700@us.ibm.com> <20050208124234.6aed9e28.pj@sgi.com> <20050209175928.GA5710@chandralinux.beaverton.ibm.com> <20050211024606.GB19997@chandralinux.beaverton.ibm.com> Organization: SGI X-Mailer: Sylpheed version 1.0.0 (GTK+ 1.2.10; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 17787 Lines: 386 [ For those who have already reached a conclusion on this subject, there is little that is new below. It's just cast in a different light, as an analysis of how well the CKRM cpuset/memset task class that Chandra describes meets the needs of cpusets. The conclusion is: not well. A pickup truck and a motorcycle both have their uses. It's just difficult to combine them in a useful fashion. Feel free to skim or skip the rest of this message. -pj ] Chandra writes: > If I missed some feature of cpuset that shows a bigger problem, please > let me know. Perhaps it would be better if first you ask yourself what features your cpuset/memset taskclasses provide beyond what's available in the basic sched_setaffinity (for cpu) and mbind/set_mempolicy (for memory) calls. Offhand, I don't see any. But, I will grant, with my apologies, that I wrote the above more in irritation than in a sincere effort to explain. So, let me come at this through another door. Since it seems apparent by now that both numa placement and workload management cause some form of mutually exclusive brain damage to its practitioners, making it difficult for either to understand the other, let me: 1) describe the important properties of cpusets, 2) examine how well your proposal provides such, and 3) examine its additional costs compared to cpusets. 1. The important properties of cpusets. ======================================= Cpusets facilitate integrated processor and memory placement of jobs on large systems, especially useful on numa systems, where the co-ordinated placement of jobs on cpus and memory is important, sometimes critical, to obtaining good performance. It is becoming increasingly obvious, as Intel, IBM and AMD push more and more cores into one package at one end, and as NEC, IBM, Bull, SGI and others push more and more packages into single image systems at the other end, that complex layered numa topologies are here to stay, in increasing number and complexity. Cpusets helps manage numa placement of jobs in a way that numa folks seem to find makes sense. The names of key interface elements, and the opening remarks in commentary and documentation are specific and relevant to the needs of those doing numa placement. It does so with a minimal, low cost patch in the main kernel. Running diffstat on the cpuset* patches in 2.6.11-rc1-mm2 shows the following summary stats: 19 files changed, 2362 insertions(+), 253 deletions(-) The runtime costs are nearly zero, consisting in the usual case on any hot paths of a usage counter increment at fork, a usage counter decrement at exit, a usually inconsequential bitmask test in mm/page_alloc.c, and a generation number check in the mm/mempolicy.c alloc_page_vma() wrapper to __alloc_pages(). Cpusets handles any number of CPUs and Memory Nodes, with no practical hard limit imposed by the API or data types. Cpusets can be used in combination with a workload manager such as CKRM. You can use cpusets to create "soft partitions" that are subsets of the entire system, and then in each such partition, you can run a separate instance of a workload manager to obtain the desired resource sharing. Cpusets may provide a practical API to support administrative refinements of scheduler domains, along more optimal natural job boundaries, instead of just along automatic, artificial architecture boundaries. Matthew and Nick both seem to be making mumblings in this direction, but the jury is still out. Indeed, we're still investigating. I have not heard of anyone proposing to integrate CKRM and sched domains in this manner, nor do I expect to. There is no reason to artificially limit the depth of the cpuset hierarchy, which represents subsets of subsets of cpus and nodes. The rules (invariants) of cpusets have been carefully chosen so as to never require any global or wide ranging analysis of the cpuset hierarchy in order to enforce. Each child must be a subset of its parent, and exclusive cpusets cannot overlap their siblings. That's about it. Both rules can be evaluated locally, using just the nearest relatives of an affected cpuset. An essential feature of the cpuset proposal is its file system model of the 'nested subsets of cpus and nodes'. This provides a name space, and permission model, that supports sensible administration of numa friendly subsets of the compute resources of large systems in complex administration environments. A system can be dynamically 'partitioned' and 'sub-partitioned', with sensible names and permissions for the partitions, while maintaining the benefits of a single system image. This is a classic use of a kernel, to manage a system wide resource with a name space, structure rules, resource attributes, and a permission/access model. In sum, cpusets provides substantial benefit past the individual sched_setaffinity/mbind/set_mempolicy calls for managing the numa placement of jobs on large systems, at modest cost in code size, runtime, maintenance and intellectual mastery. 2. How much of the above does your proposal provide? ==================================================== Not much. As best as I can tell, it provides an alternative to the existing numa cpu and memory calls, at the cost of considerable code, complexity and obtuseness above and beyond cpusets. That additional complexity may well be necessary, for the more difficult job it is trying to accomplish. But it is not necessary for the simpler task of numa placement of jobs on named, controlled, subsets of cpus and memory nodes. Your proposal doesn't provide a distinguished "numa computation unit" (cpu + memory), but rather tends to lose those two elements in a longer list of task class elements. I can't tell if it's just because you didn't take much time to study cpusets, or if it's due to more essential limitations of the CKRM implementation, but you got the subsetting and exclusive rules wrong (or at least different). The CKRM documentation and the names of key flags and such are not intuitive to those doing numa work. If one comes at CKRM from the perspective of someone trying to solve a numa placement problem, the interfaces, documentation and naming really don't make sense. Even if your architecture is more general and powerful, I suspect your presentation is not widely accessible outside those with a workload focus. Or perhaps I'm just more dimwitted than most. It's difficult for me to know which. But certainly both Matthew and I have struggled to make sense of CKRM from a numa perspective. You state you'd have a 128 CPU limitation. I don't know why that would be, but it would be a critical imitation for SGI -- no small problem. As explained below, with your proposal, one could not readily do both workload management and numa placement at the same time, because the task class hierarchy needed for the two is not the same. As noted above, while there seems to be a decent chance that cpusets will provide some benefit to scheduler domains, allowing the option of organizing sched domains along actual job usage lines instead of artificial architecture lines, I have seen no suggestion that CKRM task classes have that potential to improve sched domains. Elsewhere I recall you've had to impose fairly modest bounds on the depth of your class hierarchy, because your resource balancing rules are expensive to evaluate across deep, large trees. The cpuset hierarchy has no such restraint. Your task class hierarchy, if hijacked for numa placement, might provide the kernel managed naming, structure and access control of dynamic (soft) numa partitions that cpusets does. I haven't looked closely at the permission model of CKRM to see if it matches the needs of cpusets, so I can't speak to that detail. In sum, your cpuset/memset CKRM proposal provides few, if any, of the additional benefits to numa placement work that cpusets provides over the existing affinity and numa system calls. 3. What are the additional costs of your proposal over cpusets? =============================================================== Your proposal, while it seems to offer little advantage for numa placement to what we already have without cpusets, comes at a substantial cost great than cpusets. The CKRM patch is five times the size of the cpuset patch, with diffstat on the ckrm-e17.2610.patch showing: 65 files changed, 13020 insertions(+), 19 deletions(-) The CKRM runtime, from what I can tell on the lmbench slide from OLS 2004, costs several percent of available cycles. You propose to include the cpu/mem placement hierarchy in the task class hierarchy. This presents difficulties. Essentially, they are not the same hierarchies. A jobs placement is independent of its priority. Both high and low priority jobs may well require proper numa placement, and both high and low priority tasks may well run within the same cpuset. So if your task class hierarchy is hijacked for numa placement, it will not serve you well for workload management. On a system that required numa placement using something like cpusets, the fives times larger size of the kernel patch required for CKRM would be entirely unjustified, as CKRM would only be usable for its cpuset-like capabilities. Much of what you have now in CKRM would be useless for cpuset work. As you observed in your proposal, you would need new cpuset related rules for the subset and exclusive properties. The cpuset scheduler hook is none - it only needs the existing cpus_allowed check that Ingo already added, years ago. You propose having the scheduler check the appropriate cpu mask in the task class, which would definitely increase the cache footprint size of the scheduler. The papers for CKRM speak of providing policy driven classification and differentiated service. The focus is on managing resource sharing, to allow different classes of tasks to get controlled allocations of proportions of shared resources. Cpusets is not about sharing proportions of a common resource, but rather about dedicating entire resources. Granted, mathematically, there might be a mapping between these two. But is it certainly an impediment to those having to understand something, if it is implemented by abusing something quite larger and quite foreign in intention. This flows through to the names of the specific files in the directory representing a cpuset or class. The names for CKRM class directories are necessarily rather generic and abstract, whereas those for cpusets directly represent the particular need of placing tasks on cpus and memory nodes. For someone doing numa placement, the latter are much easier to understand. And as noted above, since you can't do both at the same time (both use the CKRM infrastructure for its traditional workload management and use it for numa placement) it's not like the administrator of such a system gains any from the more abstract names, if they are just using it for cpusets (numa placement). There is no synergy in the kernel hooks required in the scheduler and memory allocator. The hooks required by cpusets check bitmasks in order to allow or prohibit scheduling a task on a CPU, or allocating a page from a particular node to a task. These are quite distinct from the hooks required by CKRM when used as a fair share scheduler and workload manager, which requires adding delays to tasks in order to obtain the desired proportion of resource usage between classes. Similarly, the CKRM memory allocator hooks manage the number of pages in use by each task class and/or the rate of page faults, while the cpuset memory allocator hooks manage which memory nodes are available to satisfy an allocation request. The share usage hooks that monitor each resource, and its usage by each class, are useless for cpusets, which has no dependency on resource usage. In cpusets, a task can use as much of its allowed CPUs and Memory Nodes, without throttling. There is no feedback loop based on rates of resource usage per class. Most of the hooks required by the CKRM classification engine to check for possible changes in a tasks class, such as in fork, exec, setuid, listen, and other points where a kernel object might change are not needed for cpusets. The cpuset patch only requires such state change hooks in fork, exit and allocation, and only requires to increment or decrement a usage count in the fork and exit, and check a generation number in allocation. Cpusets has no use for a kernel classification engine. Outside of the trivial, automatic propagation of cpusets in fork and exit, the only changes in cpusets are mandated from user space. Nor do cpusets have any need for the kernel to support externally defined policy rules. Cpusets has no use for the classification engines callback mechanism. In cpusets, no events that might affect state, such as fork, exit, reclassifications, changes in uid, or resource rate usage samples, need to be reported to any state agent, and there is no state agent, nor any communication channel thereto. Cpusets has no use for a facility that lets server tasks tell some external classifier what phase they are operating in. Cpusets has no need for some workload manager to be sampling resource consumption and task state to determine resource consumption. Cpusets has no need to track, in user space or kernel, the state of tasks after they exit. Cpusets has no use for delays nor for tracking them in the task struct. Cpusets has no need for the hooks at the entry to, and exit from, memory allocation routines to distinguish delays due to memory allocation from those due to application i/o. Cpusets has no need for sampling task state at fixed intervals, and our big iron scientific customers would without a doubt not tolerate a scan of the entire set of tasks every second for such resource and task state data collection. Such a scan does _not_ scale well on big honkin numa boxes. Whereas CKRM requires something like relayfs to pass back to user space the constant stream of such data, cpusets has no such needs and no such data. Certainly, none of the network hooks that CKRM requires to provide differentiated service across priority classes would be of any use in a system (ab)using CKRM to provide cpuset style numa placement. It is true that both cpusets and CKRM make good use of the Linux kernel's virtual file system (vfs). Cpusets uses vfs to model the hierarchy of 'soft partitions' in the system. CKRM uses vfs to model a resource priority hierarchy, essentially replacing a single 'task priority' with hierarchical resource allocations, managing what proportion, out of what is available, of fungible resources such as ticks, cycles, bytes or data transfers a given class of tasks is allowed to use in the aggregate. Just because two facilities use vfs is certainly not sufficient basis for deciding that they should be combined into one facility. The shares and stats control files in each task_class directory are not needed by cpusets, but new control files, for cpus_allowed and mems_allowed are needed. That, or the existing names have to be overloaded, at the cost of obfuscating the interface. The kernel hooks for cpusets are fewer, simpler and more specific than those for CKRM. Our high performance customers would want the cpuset hooks compiled in, not the more generic ones for CKRM (which they could not easily use for any other workload management purpose anyway, if the task class hierarchy were hijacked for the needs of cpusets, as noted above). The development costs of cpusets so far, which are perhaps the best predictor we have of future costs, have been substantially lower than they have been for CKRM. In sum, your proposal costs alot more than cpusets, by a variety of metrics. ================================================= In summary, I find that your cpuset/memset CKRM proposal provides little or no benefit past the simpler cpu and memory placement calls already available, while costing substantially more in a variety of ways than my cpuset proposal, when evaluated for its usefulness for numa placement. (Of course, if evaluated for suitability for workload management, the table is turned, and your CKRM patch provides essential capability that my cpuset patch could never dream of doing.) Moreover, the additional workload management benefits that your CKRM facility provides, and that some of my customers might want to use in combination with numa placement, would probably become unavailable to them if we integrated cpusets and CKRM, because cpusets would have to hijack the task class hierarchy for its own nefarious purposes. Such an attempt to integrate cpusets and CKRM would be a major setback for cpusets, substantially increasing its costs and reducing is value, probably well past the point of it even being worth persuing further, in the mainstream kernel. Adding all that foreign logic of cpusets to the CKRM patch probably wouldn't help CKRM much either. The CKRM patch is already one that requires a bright mind and some careful thought to master. Adding cpuset numa placement logic, which is typically different in detail, would add a complexity burden to the CKRM code that would serve no one well. > Note that I am not pitching for a marriage We agree. I just took more words to say it '). -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/