Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758427AbZCRTH4 (ORCPT ); Wed, 18 Mar 2009 15:07:56 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1758881AbZCRTHg (ORCPT ); Wed, 18 Mar 2009 15:07:36 -0400 Received: from brinza.cc.columbia.edu ([128.59.29.8]:48067 "EHLO brinza.cc.columbia.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754694AbZCRTHe (ORCPT ); Wed, 18 Mar 2009 15:07:34 -0400 Message-ID: <49C145A1.3000506@cs.columbia.edu> Date: Wed, 18 Mar 2009 15:04:01 -0400 From: Oren Laadan Organization: Columbia University User-Agent: Thunderbird 2.0.0.19 (X11/20090105) MIME-Version: 1.0 To: Mike Waychison CC: Linus Torvalds , Andrew Morton , linux-api@vger.kernel.org, containers@lists.linux-foundation.org, hpa@zytor.com, linux-kernel@vger.kernel.org, Dave Hansen , linux-mm@kvack.org, viro@zeniv.linux.org.uk, mingo@elte.hu, mpm@selenic.com, tglx@linutronix.de, Sukadev Bhattiprolu , Alexey Dobriyan , xemul@openvz.org Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do? References: <1234475483.30155.194.camel@nimitz> <20090212141014.2cd3d54d.akpm@linux-foundation.org> <1234479845.30155.220.camel@nimitz> <20090226155755.GA1456@x200.localdomain> <20090310215305.GA2078@x200.localdomain> <49B775B4.1040800@free.fr> <20090312145311.GC12390@us.ibm.com> <1236891719.32630.14.camel@bahia> <20090312212124.GA25019@us.ibm.com> <604427e00903122129y37ad791aq5fe7ef2552415da9@mail.gmail.com> <20090313053458.GA28833@us.ibm.com> <49BAC6AF.9090607@google.com> <49BADFCE.8020207@cs.columbia.edu> <49C1435B.1090809@google.com> In-Reply-To: <49C1435B.1090809@google.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-No-Spam-Score: Local Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4945 Lines: 118 Mike Waychison wrote: > Oren Laadan wrote: >> >> Mike Waychison wrote: >>> Linus Torvalds wrote: >>>> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote: >>>> >>>>> Ying Han [yinghan@google.com] wrote: >>>>> | Hi Serge: >>>>> | I made a patch based on Oren's tree recently which implement a new >>>>> | syscall clone_with_pid. I tested with checkpoint/restart process >>>>> tree >>>>> | and it works as expected. >>>>> >>>>> Yes, I think we had a version of clone() with pid a while ago. >>>> Are people _at_all_ thinking about security? >>>> >>>> Obviously not. >>>> >>>> There's no way we can do anything like this. Sure, it's trivial to >>>> do inside the kernel. But it also sounds like a _wonderful_ attack >>>> vector against badly written user-land software that sends signals >>>> and has small races. >>> I'm not really sure how this is different than a malicious app going >>> off and spawning thousands of threads in an attempt to hit a target >>> pid from a security pov. Sure, it makes it easier, but it's not like >>> there is anything in place to close the attack vector. >>> >>>> Quite frankly, from having followed the discussion(s) over the last >>>> few weeks about checkpoint/restart in various forms, my reaction to >>>> just about _all_ of this is that people pushing this are pretty damn >>>> borderline. >>>> I think you guys are working on all the wrong problems. >>>> Let's face it, we're not going to _ever_ checkpoint any kind of >>>> general case process. Just TCP makes that fundamentally impossible >>>> in the general case, and there are lots and lots of other cases too >>>> (just something as totally _trivial_ as all the files in the >>>> filesystem that don't get rolled back). >>> In some instances such as ours, TCP is probably the easiest thing to >>> migrate. In an rpc-based cluster application, TCP is nothing more >>> than an RPC channel and applications already have to handle RPC >>> channel failure and re-establishment. >>> >>> I agree that this is not the 'general case' as you mention above >>> however. This is the bit that sorta bothers me with the way the >>> implementation has been going so far on this list. The >>> implementation that folks are building on top of Oren's patchset >>> tries to be everything to everybody. For our purposes, we need to >>> have the flexibility of choosing *how* we checkpoint. The line seems >>> to be arbitrarily drawn at the kernel being responsible for >>> checkpointing and restoring all resources associated with a task, and >>> leaving userland with nothing more than transporting filesystem >>> bits. This approach isn't flexible enough: Consider the case where >>> we want to stub out most of the TCP file descriptors with >>> ECONNRESETed sockets because we know that they are RPC sockets and >>> can re-establish themselves, but we want to use some other mechanism >>> for TCP sockets we don't know much about. The current monolithic >>> approach has zero flexibility for doing anything like this, and I >>> figure out how we could even fit anything like this in. >> >> The flexibility exists, but wasn't spelled out, so here it is: >> >> 1) Similar to madvice(), I envision a cradvice() that could tell the c/r >> something about specific resources, e.g.: >> * cradvice(CR_ADV_MEM, ptr, len) -> don't save that memory, it's >> scratch >> * cradvice(CR_ADV_SOCK, fd, CR_ADV_SOCK_RESET) -> reset connection >> on restart >> etc .. (nevermind the exact interface right now) >> >> 2) Tasks can ask to be notified (e.g. register a signal) when a >> checkpoint >> or a restart complete successfully. At that time they can do their >> private >> house-keeping if they know better. >> >> 3) If restoring some resource is significantly easier in user space >> (e.g. a >> file-descriptor of some special device which user space knows how to >> re-initialize), then the restarting task can prepare it ahead of time, >> and, call: >> * cradvice(CR_ADV_USERFD, fd, 0) -> use the fd in place instead of >> trying >> to restore it yourself. > > This would be called by the embryo process (mktree.c?) before calling > sys_restart? Yes. > >> >> Method #3 is what I used in Zap to implement distributed checkpoints, >> where >> it is so much easier to recreate all network connections in user space >> then >> putting that logic into the kernel. >> >> Now, on the other hand, doing the c/r from userland is much less flexible >> than in the kernel (e.g. epollfd, futex state and much more) and requires >> exposing tremendous amount of in-kernel data to user space. And we all >> know >> than exposing internals is always a one-way ticket :( >> >> [...] >> >> Oren. >> >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/