Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756085Ab0KWRyD (ORCPT ); Tue, 23 Nov 2010 12:54:03 -0500 Received: from serrano.cc.columbia.edu ([128.59.29.6]:61357 "EHLO serrano.cc.columbia.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753759Ab0KWRyB (ORCPT ); Tue, 23 Nov 2010 12:54:01 -0500 Date: Tue, 23 Nov 2010 12:53:26 -0500 (EST) From: Oren Laadan X-X-Sender: orenl@takamine.ncl.cs.columbia.edu To: Gene Cooperman cc: Tejun Heo , Kapil Arya , linux-kernel@vger.kernel.org, xemul@sw.ru, "Eric W. Biederman" , Linux Containers Subject: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch In-Reply-To: <20101121082143.GB21672@sundance.ccs.neu.edu> Message-ID: References: <4CD72150.9070705@cs.columbia.edu> <4CE3C334.9080401@kernel.org> <20101117153902.GA1155@hallyn.com> <4CE3F8D1.10003@kernel.org> <20101119041045.GC24031@hallyn.com> <4CE683E1.6010500@kernel.org> <4CE69B8C.6050606@cs.columbia.edu> <4CE8228C.3000108@kernel.org> <20101121081853.GA21672@sundance.ccs.neu.edu> <20101121082143.GB21672@sundance.ccs.neu.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-No-Spam-Score: Local Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9204 Lines: 179 On Sun, 21 Nov 2010, Gene Cooperman wrote: > As Kapil and I wrote before, we benefited greatly from having talked with Oren, > and learning some more about the context of the discussion. We were able > to understand better the good technical points that Oren was making. > Since the comparison table below concerns DMTCP, we'd like to > state some additional technical points that could affect the conlusions. > > > category linux-cr userspace > > -------------------------------------------------------------------------------- > > PERFORMANCE has _zero_ runtime overhead visible overhead due to syscalls > > interposition and state tracking > > even w/o checkpoints; > > In our experiments so far, the overhead of system calls has been > unmeasurable. We never wrap read() or write(), in order to keep overhead low. > We also never wrap pthread synchronization primitives such as locks, > for the same reason. The other system calls are used much less often, and so > the overhead has been too small to measure in our experiments. Syscall interception will have visible effect on applications that use those syscalls. You may not observe overheasd with HPC ones, but do you have numbers on server apps ? apps that use fork/clone and pipes extensively ? threads benchmarks et ? compare that to aboslute zero overhead of linux-cr. > > > OPTIMIZATIONS many optimizations possible limited, less effective > > only in kernel, for downtime, w/ much larger overhead. > > image size, live-migration > > As above, we believe that the overhead while running is negligible. I'm For the HPC apps that you use. > assuming that image size refers to in-kernel advantages for incremental > checkpointing. This is useful for apps where the modified pages tend > not to dominate. We agree with this point. As an orthogonal point, > by default DMTCP compresses all checkpoint images using gzip on the fly. > This is useful even when most pages are modified between checkpoints. > Still, as Oren writes, Linux C/R could also add a userland component > to compress checkpoint images on the fly. This is not "userland component", it's "checkpoint | gzip > image.out"... > Next, live migration is a question that we simply haven't thought much > about. If it's important, we could think about what userland approaches might > exist, but we have no near-term plans to tackle live migration. As it is, live-migration _is_ a very important use case. > > > OPERATION applications run unmodified to do c/r, needs 'controller' > > task (launch and manage _entire_ > > execution) - point of failure. > > restricts how a system is used. > > We'd like to clarify what may be some misconceptions. The DMTCP > controller does not launch or manage any tasks. The DMTCP controller > is stateless, and is only there to provide a barrier, namespace server, > and single point of contact to relay ckpt/restart commands. Recall that > the DMTCP controller handls processes across hosts --- not just on a > single host. The controller is another point of failure. I already pointed that the (controlled) application crashes when your controller dies, and you mentioned it's a bug that should be fixed. But then there will always be a risk for another, and another ... You also mentioned that if the controller dies, then the app should contionue to run, but will not be checkpointable anymore (IIUC). The point is, that the controller is another point of failure, and makes the execution/checkpoint intrusive. It also adds security and user-management issues as you'll need one (or more ?) controller per user (right now, it's one for all, no ?). and so on. Plus, because the restarted apps get their virtualized IDs from the controller, then they can't now "see" existing/new processes that may get the "same" pids (virtualization is not in the kernel). > Also, in any computation involving multiple processes, _every_ process > of the computation is a point of failure. If any process of the computation > dies, then the simple application strategy is to give up and revert to an > earlier checkpoint. There are techniques by which an app or DMTCP can > recreate certain failed processes. DMTCP doesn't currently recreate > a dead controller (no demand for it), but it's not hard to do technically. The point is that you _add_ a point of failure: you make the "checkpoint" operation a possible reason for the application to crash. In contrast, in linux-cr the checkpoiint is idempotent - nunharmful because it does not make the applications execute. Instead, it merely observes their state. > > PREEMPTIVE checkpoint at any time, use processes must be runnable and > > auxiliary task to save state; "collaborate" for checkpoint; > > non-intrusive: failure does long task coordination time > > not impact checkpointees. with many tasks/threads. alters > > state of checkpointee if fails. > > e.g. cannot checkpoint when in > > vfork(), ptrace states, etc. > > Our current support of vfork and ptrace has some of the issues that Oren points > out. One example occurs if a process is in the kernel, and a ptrace state has > changed. If it was important for some application, we would either have > to think of some "hack", or follow Tejun's alternative suggestion to work > with the developers to add further kernel support. The kernel developers > on this list can estimate the difficulties of kernel support better than I can. > > > COVERAGE save/restore _all_ task state; needs new ABI for everything: > > identify shared resources; can expose state, provide means to > > extend for new kernel features restore state (e.g. TCP protocol > > easily options negotiated with peers) > > Currently, the only kernel support used by DMTCP is system calls (wrappers), > /proc/*/fd, /proc/*/maps, /proc/*/cmdline, /proc/*/exe, /proc/*/stat. (I think > I've named them all now.) The kernel developers will know better > than us what other kernel state one might want to support for C/R, and what > types of applications would need that. > > > RELIABILITY checkpoint w/ single syscall; non-atomic, cannot find leaks > > atomic operation. guaranteed to determine restartability > > restartability for containers > > My understanding is that the guarantees apply for Linux containers, but not > for a tree of processes. Does this imply that linux-cr would have some > of the same reliability issues as DMTCP for a tree of processes? (I mean > the question sincerely, and am not intending to be rude.) In any case, > won't DMTCP and Linux C/R have to handle orthogonal reliability issues > such as external database, time virtualization, and other examples > from our previous post? There are two points in the claim above: 1) linux-cr can checkpoint with a single syscall - it's atomic. This gives you more guarantees about the consistency of the checkpointed application(s), and less "opportunitites" for the operation as a whole to fail. 2) restartability - for full-container checkpoint only. There is no "reliability" issue with c/r of non-containers - it's a matter of definition: it depends on what your requirements from the userspace application and what sort of "glue" you have for it. And I request again - let's leave out the questions of "time virtualization" and "external databases" - how are they different for the VM virtalization solution ? they are conpletely orthogonal to the question we are debating. Thanks, Oren. > > > USERSPACE GLUE possible possible > > > > SECURITY root and non-root modes root and non-root modes > > native support for LSM > > > > MAINTENANCE changes mainly for features changes mainly for features; > > create new ABI for features > > > iAnd by all means, I intend to cooperate with Gene to see how to > > make the other part of DMTCP, namely the userspace "glue", work on > > top of linux-cr to have the benefits of all worlds ! > > This is true, and we strongly welcome the cooperation. We don't know how > this experiment will turn out, but the only way to find out is to sincerely > try it. Whether we succeed or fail, we will learn something either way! > > - Gene and Kapil > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/