Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id ; Thu, 20 Jun 2002 01:24:48 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Thu, 20 Jun 2002 01:24:48 -0400 Received: from bitmover.com ([192.132.92.2]:55487 "EHLO bitmover.com") by vger.kernel.org with ESMTP id ; Thu, 20 Jun 2002 01:24:43 -0400 Date: Wed, 19 Jun 2002 22:24:44 -0700 From: Larry McVoy To: "Eric W. Biederman" Cc: Linus Torvalds , Cort Dougan , Benjamin LaHaise , Rusty Russell , Robert Love , Linux Kernel Mailing List Subject: Re: latest linus-2.5 BK broken Message-ID: <20020619222444.A26194@work.bitmover.com> Mail-Followup-To: Larry McVoy , "Eric W. Biederman" , Linus Torvalds , Cort Dougan , Benjamin LaHaise , Rusty Russell , Robert Love , Linux Kernel Mailing List References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: ; from ebiederm@xmission.com on Wed, Jun 19, 2002 at 09:57:35PM -0600 Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 22267 Lines: 508 > I totally agree, mostly I was playing devils advocate. The model > actually in my head is when you have multiple kernels but they talk > well enough that the applications have to care in areas where it > doesn't make a performance difference (There's got to be one of those). .... > The compute cluster problem is an interesting one. The big items > I see on the todo list are: > > - Scalable fast distributed file system (Lustre looks like a > possibility) > - Sub application level checkpointing. > > Services like a schedulers, already exist. > > Basically the job of a cluster scheduler gets much easier, and the > scheduler more powerful once it gets the ability to suspend jobs. > Checkpointing buys three things. The ability to preempt jobs, the > ability to migrate processes, and the ability to recover from failed > nodes, (assuming the failed hardware didn't corrupt your jobs > checkpoint). > > Once solutions to the cluster problems become well understood I > wouldn't be surprised if some of the supporting services started to > live in the kernel like nfsd. Parts of the distributed filesystem > certainly will. http://www.bitmover.com/cc-pitch I've been trying to get Linus to listen to this for years and he keeps on flogging the tired SMP horse instead. DEC did it and Sun has been passing around these slides for a few weeks, so maybe they'll do it too. Then Linux can join the party after it has become a fine grained, locked to hell and back, soft "realtime", numa enabled, bloated piece of crap like all the other kernels and we'll get to go through the "let's reinvent Unix for the 3rd time in 40 years" all over again. What fun. Not. Sorry to be grumpy, go read the slides, I'll be at OLS, I'd be happy to talk it over with anyone who wants to think about it. Paul McKenney from IBM came down the San Francisco to talk to me about it, put me through an 8 or 9 hour session which felt like a PhD exam, and after trying to poke holes in it grudgingly let on that maybe it was a good idea. He was kind of enough to write up what he took away from it, here it is. --lm From: "Paul McKenney" To: lm@bitmover.com, tytso@mit.edu Subject: Greatly enjoyed our discussion yesterday! Date: Fri, 9 Nov 2001 18:48:56 -0800 Hello! I greatly enjoyed our discussion yesterday! Here are the pieces of it that I recall, I know that you will not be shy about correcting any errors and omissions. Thanx, Paul Larry McVoy's SMP Clusters Discussion on November 8, 2001 Larry McVoy, Ted T'so, and Paul McKenney What is SMP Clusters? SMP Clusters is a method of partioning an SMP (symmetric multiprocessing) machine's CPUs, memory, and I/O devices so that multiple "OSlets" run on this machine. Each OSlet owns and controls its partition. A given partition is expected to contain from 4-8 CPUs, its share of memory, and its share of I/O devices. A machine large enough to have SMP Clusters profitably applied is expected to have enough of the standard I/O adapters (e.g., ethernet, SCSI, FC, etc.) so that each OSlet would have at least one of each. Each OSlet has the same data structures that an isolated OS would have for the same amount of resources. Unless interactions with the OSlets are required, an OSlet runs very nearly the same code over very nearly the same data as would a standalone OS. Although each OSlet is in most ways its own machine, the full set of OSlets appears as one OS to any user programs running on any of the OSlets. In particular, processes on on OSlet can share memory with processes on other OSlets, can send signals to processes on other OSlets, communicate via pipes and Unix-domain sockets with processes on other OSlets, and so on. Performance of operations spanning multiple OSlets may be somewhat slower than operations local to a single OSlet, but the difference will not be noticeable except to users who are engaged in careful performance analysis. The goals of the SMP Cluster approach are: 1. Allow the core kernel code to use simple locking designs. 2. Present applications with a single-system view. 3. Maintain good (linear!) scalability. 4. Not degrade the performance of a single CPU beyond that of a standalone OS running on the same resources. 5. Minimize modification of core kernel code. Modified or rewritten device drivers, filesystems, and architecture-specific code is permitted, perhaps even encouraged. ;-) OS Boot Early-boot code/firmware must partition the machine, and prepare tables for each OSlet that describe the resources that each OSlet owns. Each OSlet must be made aware of the existence of all the other OSlets, and will need some facility to allow efficient determination of which OSlet a given resource belongs to (for example, to determine which OSlet a given page is owned by). At some point in the boot sequence, each OSlet creates a "proxy task" for each of the other OSlets that provides shared services to them. Issues: 1. Some systems may require device probing to be done by a central program, possibly before the OSlets are spawned. Systems that react in an unfriendly manner to failed probes might be in this class. 2. Interrupts must be set up very carefully. On some systems, the interrupt system may constrain the ways in which the system is partitioned. Shared Operations This section describes some possible implementations and issues with a number of the shared operations. Shared operations include: 1. Page fault on memory owned by some other OSlet. 2. Manipulation of processes running on some other OSlet. 3. Access to devices owned by some other OSlet. 4. Reception of network packets intended for some other OSlet. 5. SysV msgq and sema operations on msgq and sema objects accessed by processes running on multiple of the OSlets. 6. Access to filesystems owned by some other OSlet. The /tmp directory gets special mention. 7. Pipes connecting processes in different OSlets. 8. Creation of processes that are to run on a different OSlet than their parent. 9. Processing of exit()/wait() pairs involving processes running on different OSlets. Page Fault As noted earlier, each OSlet maintains a proxy process for each other OSlet (so that for an SMP Cluster made up of N OSlets, there are N*(N-1) proxy processes). When a process in OSlet A wishes to map a file belonging to OSlet B, it makes a request to B's proxy process corresponding to OSlet A. The proxy process maps the desired file and takes a page fault at the desired address (translated as needed, since the file will usually not be mapped to the same location in the proxy and client processes), forcing the page into OSlet B's memory. The proxy process then passes the corresponding physical address back to the client process, which maps it. Issues: o How to coordinate pageout? Two approaches: 1. Use mlock in the proxy process so that only the client process can do the pageout. 2. Make the two OSlets coordinate their pageouts. This is more complex, but will be required in some form or another to prevent OSlets from "ganging up" on one of their number, exhausting its memory. o When OSlet A ejects the memory from its working set, where does it put it? 1. Throw it away, and go to the proxy process as needed to get it back. 2. Augment core VM as needed to track the "guest" memory. This may be needed for performance, but... o Some code is required in the pagein() path to figure out that the proxy must be used. 1. Larry stated that he is willing to be punched in the nose to get this code in. ;-) The amount of this code is minimized by creating SMP-clusters-specific filesystems, which have their own functions for mapping and releasing pages. (Does this really cover OSlet A's paging out of this memory?) o How are pagein()s going to be even halfway fast if IPC to the proxy is involved? 1. Just do it. Page faults should not be all that frequent with today's memory sizes. (But then why do we care so much about page-fault performance???) 2. Use "doors" (from Sun), which are very similar to protected procedure call (from K42/Tornado/Hurricane). The idea is that the CPU in OSlet A that is handling the page fault temporarily -becomes- a member of OSlet B by using OSlet B's page tables for the duration. This results in some interesting issues: a. What happens if a process wants to block while "doored"? Does it switch back to being an OSlet A process? b. What happens if a process takes an interrupt (which corresponds to OSlet A) while doored (thus using OSlet B's page tables)? i. Prevent this by disabling interrupts while doored. This could pose problems with relatively long VM code paths. ii. Switch back to OSlet A's page tables upon interrupt, and switch back to OSlet B's page tables upon return from interrupt. On machines not supporting ASID, take a TLB-flush hit in both directions. Also likely requires common text (at least for low-level interrupts) for all OSlets, making it more difficult to support OSlets running different versions of the OS. Furthermore, the last time that Paul suggested adding instructions to the interrupt path, several people politely informed him that this would require a nose punching. ;-) c. If a bunch of OSlets simultaneously decide to invoke their proxies on a particular OSlet, that OSlet gets lock contention corresponding to the number of CPUs on the system rather than to the number in a single OSlet. Some approaches to handle this: i. Stripe -everything-, rely on entropy to save you. May still have problems with hotspots (e.g., which of the OSlets has the root of the root filesystem?). ii. Use some sort of queued lock to limit the number CPUs that can be running proxy processes in a given OSlet. This does not really help scaling, but would make the contention less destructive to the victim OSlet. o How to balance memory usage across the OSlets? 1. Don't bother, let paging deal with it. Paul's previous experience with this philosophy was not encouraging. (You can end up with one OSlet thrashing due to the memory load placed on it by other OSlets, which don't see any memory pressure.) 2. Use some global memory-pressure scheme to even things out. Seems possible, Paul is concerned about the complexity of this approach. If this approach is taken, make sure someone with some control-theory experience is involved. Manipulation of Processes Running on Some Other OSlet. The general idea here is to implement something similar to a vproc layer. This is common code, and thus requires someone to sacrifice their nose. There was some discussion of other things that this would be useful for, but I have lost them. Manipulations discussed included signals and job control. Issues: o Should process information be replicated across the OSlets for performance reasons? If so, how much, and how to synchronize. 1. No, just use doors. See above discussion. 2. Yes. No discussion of synchronization methods. (Hey, we had to leave -something- for later!) Access to Devices Owned by Some Other OSlet Larry mentioned a /rdev, but if we discussed any details of this, I have lost them. Presumably, one would use some sort of IPC or doors to make this work. Reception of Network Packets Intended for Some Other OSlet. An OSlet receives a packet, and realizes that it is destined for a process running in some other OSlet. How is this handled without rewriting most of the networking stack? The general approach was to add a NAT-like layer that inspected the packet and determined which OSlet it was destined for. The packet was then forwarded to the correct OSlet, and subjected to full IP-stack processing. Issues: o If the address map in the kernel is not to be manipulated on each packet reception, there needs to be a circular buffer in each OSlet for each of the other OSlets (again, N*(N-1) buffers). In order to prevent the buffer from needing to be exceedingly large, packets must be bcopy()ed into this buffer by the OSlet that received the packet, and then bcopy()ed out by the OSlet containing the target process. This could add a fair amount of overhead. 1. Just accept the overhead. Rely on this being an uncommon case (see the next issue). 2. Come up with some other approach, possibly involving the user address space of the proxy process. We could not articulate such an approach, but it was late and we were tired. o If there are two processes that share the FD on which the packet could be received, and these two processes are in two different OSlets, and neither is in the OSlet that received the packet, what the heck do you do??? 1. Prevent this from happening by refusing to allow processes holding a TCP connection open to move to another OSlet. This could result in load-balance problems in some workloads, though neither Paul nor Ted were able to come up with a good example on the spot (seeing as BAAN has not been doing really well of late). To indulge in l'esprit d'escalier... How about a timesharing system that users access from the network? A single user would have to log on twice to run a job that consumed more than one OSlet if each process in the job might legitimately need access to stdin. 2. Do all protocol processing on the OSlet on which the packet was received, and straighten things out when delivering the packet data to the receiving process. This likely requires changes to common code, hence someone to volunteer their nose. SysV msgq and sema Operations We didn't discuss these. None of us seem to be SysV fans, but these must be made to work regardless. Larry says that shm should be implemented in terms of mmap(), so that this case reduces to page-mapping discussed above. Of course, one would need a filesystem large enough to handle the largest possible shmget. Paul supposes that one could dynamically create a memory filesystem to avoid problems here, but is in no way volunteering his nose to this cause. Access to Filesystems Owned by Some Other OSlet. For the most part, this reduces to the mmap case. However, partitioning popular filesystems over the OSlets could be very helpful. Larry mentioned that this had been prototyped. Paul cannot remember if Larry promised to send papers or other documentation, but duly requests them after the fact. Larry suggests having a local /tmp, so that /tmp is in effect private to each OSlet. There would be a /gtmp that would be a globally visible /tmp equivalent. We went round and round on software compatibility, Paul suggesting a hashed filesystem as an alternative. Larry eventually pointed out that one could just issue different mount commands to get a global filesystem in /tmp, and create a per-OSlet /ltmp. This would allow people to determine their own level of risk/performance. Pipes Connecting Processes in Different OSlets. This was mentioned, but I have forgotten the details. My vague recollections lead me to believe that some nose-punching was required, but I must defer to Larry and Ted. Ditto for Unix-domain sockets. Creation of Processes on a Different OSlet Than Their Parent. There would be a inherited attribute that would prevent fork() or exec() from creating its child on a different OSlet. This attribute would be set by default to prevent too many surprises. Things like make(1) would clear this attribute to allow amazingly fast kernel builds. There would also be a system call that would cause the child to be placed on a specified OSlet (Paul suggested use of HP's "launch policy" concept to avoid adding yet another dimension to the exec() combinatorial explosion). The discussion of packet reception lead Larry to suggest that cross-OSlet process creation would be prohibited if the parent and child shared a socket. See above for the load-balancing concern and corresponding l'esprit d'escalier. Processing of exit()/wait() Pairs Crossing OSlet Boundaries We didn't discuss this. My guess is that vproc deals with it. Some care is required when optimizing for this. If one hands off to a remote parent that dies before doing a wait(), one would not want one of the init processes getting a nasty surprise. (Yes, there are separate init processes for each OSlet. We did not talk about implications of this, which might occur if one were to need to send a signal intended to be received by all the replicated processes.) Other Desiderata: 1. Ability of surviving OSlets to continue running after one of their number fails. Paul was quite skeptical of this. Larry suggested that the "door" mechanism could use a dynamic-linking strategy. Paul remained skeptical. ;-) 2. Ability to run different versions of the OS on different OSlets. Some discussion of this above. The Score. Paul agreed that SMP Clusters could be implemented. He was not sure that it could achieve good performance, but could not prove otherwise. Although he suspected that the complexity might be less than the proprietary highly parallel Unixes, he was not convinced that it would be less than Linux would be, given the Linux community's emphasis on simplicity in addition to performance. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/