by Justin P. Mattock

On Wed, Aug 20, 2008 at 11:06:50PM -0400, Oren Laadan wrote:
>
> Infrastructure to handle objects that may be shared and referenced by
> multiple tasks or other objects, e..g open files, memory address space
> etc.
>
> The state of shared objects is saved once. On the first encounter, the
> state is dumped and the object is assigned a unique identifier and also
> stored in a hash table (indexed by its physical kenrel address). From
> then on the object will be found in the hash and only its identifier is
> saved.
>
> On restart the identifier is looked up in the hash table; if not found
> then the state is read, the object is created, and added to the hash
> table (this time indexed by its identifier). Otherwise, the object in
> the hash table is used.

[...]

> diff --git a/checkpoint/ckpt.h b/checkpoint/ckpt.h
> index 0addb63..8b02c4c 100644
> --- a/checkpoint/ckpt.h
> +++ b/checkpoint/ckpt.h
> @@ -29,6 +29,8 @@ struct cr_ctx {
> void *hbuf; /* header: to avoid many alloc/dealloc */
> int hpos;
>
> + struct cr_objhash *objhash;
> +
> struct cr_pgarr *pgarr;
> struct cr_pgarr *pgcur;
>
> @@ -56,6 +58,22 @@ int cr_kread(struct cr_ctx *ctx, void *buf, int count);
> void *cr_hbuf_get(struct cr_ctx *ctx, int n);
> void cr_hbuf_put(struct cr_ctx *ctx, int n);
>
> +/* shared objects handling */
> +
> +enum {
> + CR_OBJ_FILE = 1,
> + CR_OBJ_MAX
> +};
> + +void cr_objhash_free(struct cr_ctx *ctx);
^
|
Strange, isn't it?

> +int cr_objhash_alloc(struct cr_ctx *ctx);
> +void *cr_obj_get_by_tag(struct cr_ctx *ctx, int tag, unsigned short type);
> +int cr_obj_get_by_ptr(struct cr_ctx *ctx, void *ptr, unsigned short type);
> +int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *tag,
> + unsigned short type, unsigned short flags);
> +int cr_obj_add_tag(struct cr_ctx *ctx, void *ptr, int tag,
> + unsigned short type, unsigned short flags);
> +
> struct cr_hdr;
>
> int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
> diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
> new file mode 100644
> index 0000000..aca32c6
> --- /dev/null
> +++ b/checkpoint/objhash.c
> @@ -0,0 +1,193 @@
> +/*
> + * Checkpoint-restart - object hash infrastructure to manage shared
> objects + *
> + * Copyright (C) 2008 Oren Laadan
> + *
> + * This file is subject to the terms and conditions of the GNU General Public
> + * License. See the file COPYING in the main directory of the Linux
> + * distribution for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/file.h>
> +#include <linux/hash.h>
> +
> +#include "ckpt.h"
> +
> +struct cr_obj {
> + int tag;
> + void *ptr;
> + unsigned short type;
> + unsigned short flags;
> + struct cr_obj *next;
> +};
> +
> +struct cr_objhash {
> + struct cr_obj **hash;
> + int next_tag;
> +};
> +
> +#define CR_OBJHASH_NBITS 10 /* 10 bits = 1K buckets */
> +#define CR_OBJHASH_ORDER 0 /* 1K buckets * 4 bytes/bucket = 1 page */

Only true when PAGE_SIZE == 4K and in 32bits. Perhaps like below?

#define CR_OBJHASH_BUCKET_NBITS (BITS_PER_LONG == 64 ? 3 : 2)
#define CR_MIN_OBJHASH_NBITS ((PAGE_SHIFT - CR_OBJHASH_BUCKET_NBITS)
#define CR_OBJHASH_NBITS (CR_MIN_OBJHASH_NBITS >= 10 ? CR_MIN_OBJHASH_NBITS : 10)
#define CR_OBJHASH_ORDER (CR_OBJHASH_NBITS + CR_OBJHASH_BUCKET_NBITS - PAGE_SHIFT)

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes

Attachments:

(No filename) (3.42 kB)
signature.asc (189.00 B)
Digital signature Download all attachments

2008-08-21 11:06:26

[permalink] [raw]

Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state

Oren Laadan wrote:
> Dave, Serge:
>
> I'm currently away so I must keep this short. I think we have so far
> more discussion than an actual problem. I'm happy to coordinate with
> every interested party to eventually see this work go into main stream.

thanks. We do need a moderator and federator.

> My only concerns are twofold: first, to get more feedback I believe we
> need to get the code a bit more usable; including FDs is an excellent
> way to actually do that. That will add significant value to the patch.

hmm, yes and no.

fd's are a must have but I would be more interested to see an external
checkpoint/restart and signal support first. why ? because it would be
already usable for most computational programs in HPC, like this stupid
one :

https://www.ccs.uky.edu/~bmadhu/pi/pi1.c

signals are required because it's how 'load' and/or 'system' managers
interact with the jobs they spawn. external checkpoint/restart for the
same reason.

for files, I would first only care about stdios (make sure they're
relinked to something safe on restart) and file states of regular files.
contents is generally handled externally (deleted files being an annoying
exception)

then, support for openmp application is a nice to have, so I'd probably
go on with thread support.

> I think it's important to demonstrate how shared resources and multiple
> processes are handled. FDs demonstrate the former (with a fixed version
> of the recent patchset - I will post soon).

shared resources are only useful in a multiprocess/multitask context.
I'd start working on this first. here we jump directly in the pid namespace
issues, how we start a set of process in a pidnamespace ? how do we
relink it to its parent pidnamespace ? are signals well propagated ? etc.
but hey, we'll have to solve it one day.

FD's are shared but have many types which are pain to handle. (it would
interesting to see if we can add checkpoint() and restart() operations in
fileops) So, for shared resources demonstration, I'd work on sysvipc,
there are less types to handle and they force us to think how we are going
to merge with the sysvipc namespaces.

> The latter will increase the size of the patchset significantly, so
> perhaps can indeed wait for now.

hmm, that depends how you do it.

If you restart all the hierarchy in the kernel, It will increase for sure
the patch footprint. However if you restore the hierarchy from user space
and then let each process restore itself from some binary blob, it should
not. This, of course, means that the binary blob representing the state of
the container (we call it statefile) is not totally opaque. It see it a bit
like /proc, a directory containing shared states (all namespaces) and tasks
states. That's something to discuss.

I do prefer the second option for many reasons:

. each process restarts itself from its current context, this makes it easier
to reuse kernel services depending on current.

. user tools can evaluate more precisely what they are going to restart from
the statefile. see this as a generalised 'readelf' that would be run on
the statefile, like we do on a core file today.

> It should not be hard for me to add functionality on top of a more
> basic patchset. The question is, what is "basic" ? Anyway, I will be
> back towards the end of the week. Let's try to discuss this over IRC
> then (e.g. Friday afternoon ?).

IHMO, the first one is to support a 'basic' computational program in
a real environment (under a load manager HPC). your POC nearly reaches
it but the user space API (how to launch, checkpoint, restart) needs to
be worked on.

There are some big steps in the development.

Multi-task is a big step which opens plenty of other big steps with
shared resources : mem, ipc, fds, etc. Not all have to be solved
but at least detected if we don't have the support.

Network is another one. This is an interesting step to support
distributed application using MPI over TCP. May be a priority.

there are also plenty of funky kernel resources used by misc servers,
database that will need special attention.

I'll be happy to start with the basic menu first as I know that it will
be useful for many applications !

Thanks,

C.