Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758292AbZDNSWX (ORCPT ); Tue, 14 Apr 2009 14:22:23 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755588AbZDNSWM (ORCPT ); Tue, 14 Apr 2009 14:22:12 -0400 Received: from jalapeno.cc.columbia.edu ([128.59.29.5]:41104 "EHLO jalapeno.cc.columbia.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755392AbZDNSWL (ORCPT ); Tue, 14 Apr 2009 14:22:11 -0400 Message-ID: <49E4D3A9.6020100@cs.columbia.edu> Date: Tue, 14 Apr 2009 14:19:21 -0400 From: Oren Laadan Organization: Columbia University User-Agent: Mozilla-Thunderbird 2.0.0.19 (X11/20090103) MIME-Version: 1.0 To: Alexey Dobriyan CC: akpm@linux-foundation.org, containers@lists.linux-foundation.org, xemul@parallels.com, serue@us.ibm.com, dave@linux.vnet.ibm.com, mingo@elte.hu, hch@infradead.org, torvalds@linux-foundation.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 10/30] cr: core stuff References: <20090410023539.GK27788@x200.localdomain> <49E41D7B.8030003@cs.columbia.edu> <20090414160003.GD27461@x200.localdomain> In-Reply-To: <20090414160003.GD27461@x200.localdomain> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-No-Spam-Score: Local Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12526 Lines: 302 Alexey Dobriyan wrote: > On Tue, Apr 14, 2009 at 01:22:03AM -0400, Oren Laadan wrote: >> Alexey Dobriyan wrote: >>> * add struct file_operations::checkpoint >>> >>> The point of hook is to serialize enough information to allow restoration >>> of an opened file. >>> >>> The idea (good one!) is that the code which supplies struct file_operations >>> know better what to do with file. >> Actually, credit is due to Dave Hansen (or Christoph Hellwig, or both?). >> >>> Hook gets C/R context (a cookie more or less) on which dump code can >>> cr_write() and small restrictions on what to write: globally unique object id >>> and correct object length to allow jumping through objects. >>> >>> For usual files on on-disk filesystem add generic_file_checkpoint() >>> >>> Add ext3 opened regular files and directories for start. >>> >>> No ->checkpoint, checkpointing is aborted -- deny by default. >>> >>> FIXME: unlinked, but opened files aren't supported yet. >>> >>> * C/R image design >>> >>> The thing should be flexible -- kernel internals changes every day, so we can't >>> really afford a format with much enforced structure. >>> >>> Image consists of header, object images and terminator. >>> >>> Image header consists of immutable part and mutable part (for future). >>> >>> Immutable header part is magic and image version: "LinuxC/R" + __le32 >>> >>> Image version determines everything including image header's mutable part. >>> Image version is going to be bumped at earliest opportunity following changes >>> in kernel internals. >>> >>> So far image header mutable part consists of arch of the kernel which dumped >>> the image (i386, x86_64, ...) and kernel version as found in utsname. >>> >>> Kernel version as string is for distributions. Distro can support C/R for >>> their own kernels, but can't realistically be expected to bump image version -- >>> this will conflict with mainline kernels having used same version. We also don't >>> want requests for private parts of image version space. >> So far so good, like in our patch-set. >> >> You also need to address differences in configuration (kernel could >> have been recompiled) and runtime environment (boot params, etc). >> >> We deferred this issue to a later time. >> >>> Distro expected to keep image version alone and on restart(2) check utsname >>> version and compare it against previously release kernel versions and based >>> on that turn on compatibility code. >> Are you suggesting that conversion of a checkpoint image from an older >> version to a newer version be done in the kernel ? > > For mainline kernel it's completely unrealistic to support all backwards > compatibility code for previous versions. Some mythical userspace > program will convert images. > > But it's completely realistic and much easier for distro kernel because > distro kernel doesn't generally include patches with significant in-kernel > internals changes, so they simply can support > '2.6.26-1-amd64' => '2.6.26-2-amd64' situation. > > Distros can write conversion program too, but I don't expect they will. > >> It may work for a few versions, and then you'll get a spaghetti of >> #ifdef's in the code, together with a plethora of legacy code. > > Expectation is for one kernel branch like RHEL5 kernel updates during > RHEL5 lifecycle. > > For RHEL5 => RHEL6, it's up to them what to do. > > Anyway distro can add compat code _anyway_, for this we help them with > this image format tweak, so they won't bug mainline with "reserve bit 31 > for Red Hat". > > Image version is kept small (__le32) for this reason too :-) > >> It is much better/easier to handle checkpoint image transformations >> in user space. The kernel will only understand its "current" version >> (for some definition of version). >> >>> Object image is very flexible, the only required parts are a) object type (u32) >>> and b) object total length (u32, [knocks wood]) which must be at the beginning >>> of an image. The rest is not generic C/R code problem. >>> >>> Object images follow one another without holes. Holes are in theory possible but >>> unneeded. >>> >> When would you need holes ? >> >>> Image ends with terminator object. This is mostly to be sure, that, yes, image >>> wasn't truncated for some reason. >>> >>> >>> * Objects subject to C/R >>> >>> The idea is to not be very smart but directly dump core kernel data structures >>> related to processes. This includes in this patch: >>> >>> struct task_struct >>> struct mm_struct >>> VMAs >>> dirty pages >>> struct file >>> >>> Relations between objects (task_struct has pointer to mm_struct) are fullfilled >>> by dumping pointed to object first, keeping it's position in dumpfile and saving >>> position in a image of pointe? object: >> Unless you use the physical position to actually lseek to there to >> re-read the data, there is no reason to use the actual position. In >> fact it is easier to debug when the shared object identifier is a >> simple counter. >> >> If you do use it to lseek, then it's a poor decision -- sounds fragile: >> what if we change the file (legitimately) adding data in the middle - >> the whole concept breaks. > > Adder of data is expected to understand image format and update all references > just like surgeon is expected to understand human anatomy. > >>> struct cr_image_task_struct { >>> cr_pos_t cr_pos_mm; >>> ... >>> }; >>> >>> Code so far tries hard to dump objects in certain order so there won't be any loops. >>> This property of process that dumpfile can in theory be O_APPEND, will likely be >>> sacrifised (read: child can ptrace parent) >> The ability to streamline the checkpoint image IMHO is invaluable. >> It's the unix way (TM) of doing things; it makes the process pipe-able. >> >> You can do many nice things when the checkpoint can be streamed: you >> can compress, sign, encrypt etc on the fly without taking additional >> diskspace. You can transfer over the network (e.g. for migration), >> or store remotely without explicit file system support. You can easily >> transform the stream from one c/r version to another etc. >> >> This should be a design principle. In my experience I never hit a wall >> that forced me to "sacrifice" this decision. >> >>> sacrifised (read: child can ptrace parent) >> Hmmm... if all tasks are created in user space, then this specific >> becomes a no-brainer ! > > No! Actually yes :) > > A ptraces B. Container is checkpointed. > > Kernel realizes ptrace is going on. A and B in theory can have any > realitionship. > > Consequently, kernel doesn't know in which order to dump A and B. > > And there is no such order: > *) A can be parent of B (you dump A, B), > *) A can be child of B (you want to dump B, A, but this conflicts with > ->real_parent order) > *) A and B just tasks (any order). Current code does not support ptrace() - which has a multitude if tidy-bits issues to solve during restart regardless. However, creating tasks in userspace uses (and will uses) only "real" process relationships, not ptrace-relationships, when it comes to decide on the fork/clone order. Technically, that can be done in checkpoint (dumping the task tree) or in restart-user-space (rearranging the data before fork/clone). > > I'm showing that whole issue can be avoided: If the issue can be avoided, then why would you need to sacrifice the stream-ability of the checkpoint image ? > *) all tasks are simply created regardless of who is parent of whom > (see kernel_thread()) > *) Every task_struct image among other things contains references to > ->real_parent and ->parent. > *) After every task is created it's time to change references: > **) lookup who is ->real_parent, change ->real_parent _by hand_ > not with some "correct clone(2)" order. > **) lookup who is ->parent, change ->parent. > > You're probably escaping all of this with object numbers? (Will be) escaping this by arranging to fork/clone in the proper order. > >>> * add struct vm_operations_struct::checkpoint >>> >>> just like with files, code that creates special VMAs should know what to do with them >>> used. >>> >>> just like with files, deny checkpointing by default >>> >>> So far used to install vDSO to same place. >> VDSO can be a troublemaker; in recent kernels its location in the MM >> can be randomized. > > See arch_setup_additional_pages() patch. > >> It is not necessarily immutable - it can reflect >> ynamic kernel data. It may contain different code on newer versions, >> so must be compared or worked around during restart etc. > > i386 if I'm not mistaken only contain syscall entry code, but, yes, > generally one should check if PC is inside such page. If you up restart on a different kernel that has a different VDSO, then you need to bring the old VDSO with you, and tweak it so it pulls the dynamic kernel data from the right place. Ugh ... :( Oren. > >>> * add checkpoint(2) >>> >>> Done by determining which tasks are subject to checkpointing, freezeing them, >>> collecting pointers to necessary kernel internals (task_struct, mm_struct, ...), >>> doing that checking supported/unsupported status and aborting if necessary, >>> actual dumping, unfreezeing/killing set of tasks. >>> >>> Also in-checkpoint refcount is maintained to abort on possible invisible changes. >>> Now it works: >>> >>> For every collected object (mm_struct) keep numbers of references from >>> other collected objects. It should match object's own refcount. >>> If there is a mismatch, something is likely pinning object, which means >>> there is "leak" to outside which means checkpoint(2) can't realistically and >>> without consequences proceed. >>> >>> This is in some sense independent check. It's designed to protect from internals >>> change when C/R code was forgotten to be updated. >>> >>> Userpsace supplies pid of root task and opened file descriptor of future dump file. >>> Kernel reports 0/-E as usual. >>> >>> Runtime tracking of "checkpointable" property is explicitly not done. >>> This introduces overhead even if checkpoint(2) is not done as shown by proponents. >>> Instead any check is done at checkpoint(2) time and -E is returned if something is >>> suspicious or known to be unsupported. >>> >>> FIXME: more checks especially in cr_check_task_struct(). >>> >>> * add restart(2) >>> >>> Recreate tasks and evething dumped by checkpoint(2) as if nothing happened. >>> >>> The focus is on correct recreating, checking every possibility that target kernel >>> can be on different arch (i386 => x86_64) and target kernel can be very different >>> from source kernel by mistake (i386 => x86_64 COMPAT=n) kernel. >>> >>> restart(2) is done first by creating kernel thread and that demoting it to usual >>> process by adding mm_struct, VMAs, et al. This saves time against method when >>> userspace does fork(2)+restart(2) -- forked mm_struct will be thrown out anyway >>> or at least everything will be unmapped in any case. >> Do have figures to support your claims about "saves time" ? >> >> The *largest* component of the restart time, as you probably know, >> is the time it takes to restore the memory address space (pages, pages) >> of the tasks. >> >> If you do show that this optimization is worth our attention, then it >> takes < 10 lines to change current mktree.c to use CLONE_VM ... voila. >> >> I'm interested in hearing more convincing arguments in favor of kernel >> creations of restarting tasks (see my other post about it). > > OK, in another post. > >>> Restoration is done in current context except CPU registers at last stage. >>> This is because "creation is done by current" is in many, many places, >>> e.g. mmap(2) code. >>> >>> It's expected that filesystem state will be the same. Kernel can't do anything >>> about it expect probably virtual filesystems. If a file is not there anymore, >>> it's not kernel fault, -E will be returned, restart aborted. >>> >>> FIXME: errors aren't propagated correctly out of kernel thread context >> Heh .. I guess they always propagate correctly out of regular task >> context ;) > > :-) > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/