LinuxLists.cc - Re: [fs/pipe] 5a519c8fe4: WARNING:at_mm/page_alloc.c:#__alloc

2022-04-22 20:42:36

Subject: Re: [fs/pipe] 5a519c8fe4: WARNING:at_mm/page_alloc.c:#__alloc_pages

On Thu, Apr 21, 2022 at 12:28 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, Apr 21, 2022 at 9:30 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > The pipe part sounds like a horrible hacky thing.
> >
> > I also assume you already tried that, and hit some performance issues.
> > But it does sound like the better interface, more directly what you
> > want.
> >
> > So what are the problems with using process_vm_readv?

The big advantage of vmsplice is that it can attach real user pages into
a pipe and then any following changes of these pages by the process
don't trigger any allocations and extra copies of data. vmsplice in this
case is fast. After splicing pages to pipes, we resume a process and
splice pages from pipes to a socket or a file. The whole process of
dumping process pages is zero-copy.

>
> Actually, I take that back.
>
> Don't use pipes.
>
> Don't use process_vm_readv().
>
> Use the system call we already have for "snapshot the current VM".
>
> It's called "fork()". It's cheap, it's efficient, and it snapshots the
> whole VM in one go. No stupid extra buffers in pipes, no crazy things
> like that.
>
> So just make your pre-dump code do a simple fork(), let the parent
> continue, and then do the dumping in the child at whatever pace you
> want.
>
> In fact, you might just leave the child process alone, and let it _be_
> that pre-dump.
>
> You can create a new snapshot every once in a while, and kill the
> previous snapshot, if you want to keep the snapshot close to the
> target, and then use the memory tracking to track what has changed
> since.
>
> And you might not want to use plain "fork()", but instead some kind of
> "clone()" variant. You might want to use CLONE_PARENT and some
> non-SIGCHLD exit signal to basically hide the snapshot image from the
> thing you are snapshotting.
>
> Anyway, the "use vmsplice to a pipe to create a snapshot" sounds just
> insane when you have a very traditional system call that is all about
> snapshotting the process.
>
> Maybe a new CLONE_xyz flag could be added to make that memory tracking
> integrate better or whatever.
>
> Any showstoppers?

We considered this approach. CRIU dumps a tree of processes. In many
cases, it's a container with its pid namespace. In such cases, it isn't
possible to fork helper processes without affecting the behavior of
dumped processes. First, helper processes will be visible for dumped
processes. Second, waitid with __WALL will wait for our helpers and a
dumped process can be very surprised to find a child that it hasn't
created. For the pre-dump, we don't need a true memory snapshot, we
don't care about changed pages. But if we fork a process in the wrong
moment, we can double its memory consumption and as this is happening in
a dumped process context, we can hit its resource limits or trigger OOM
in a dumped container.
Forking a helper itself can hit resource limits such as rlimits or
cgroup limits.

Thanks,
Andrei

2022-04-22 22:45:12

by Linus Torvalds

[permalink] [raw]

Subject: Re: [fs/pipe] 5a519c8fe4: WARNING:at_mm/page_alloc.c:#__alloc_pages

On Thu, Apr 21, 2022 at 10:23 PM Andrei Vagin <[email protected]> wrote:
>
> The big advantage of vmsplice is that it can attach real user pages into
> a pipe and then any following changes of these pages by the process
> don't trigger any allocations and extra copies of data. vmsplice in this
> case is fast. After splicing pages to pipes, we resume a process and
> splice pages from pipes to a socket or a file. The whole process of
> dumping process pages is zero-copy.

Hmm. What happens if you just use /proc/<pid>/mem?

That just takes a reference to the tsk->mm. No page copies at all.
After that you can do anything you want to that mm.

Well, anything a /proc/<pid>/mm fd allows, which is mainly read and
write. But it stays around for as long as you keep it open, and
fundamentally stays coherent with that mm, because it *is* that mm.

And it doesn't affect anything else, because all it literally has is
that mm_struct pointer.

Linus

2022-04-24 15:37:01

by Andrei Vagin

[permalink] [raw]

Subject: Re: [fs/pipe] 5a519c8fe4: WARNING:at_mm/page_alloc.c:#__alloc_pages

On Fri, Apr 22, 2022 at 10:23 AM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, Apr 21, 2022 at 10:23 PM Andrei Vagin <[email protected]> wrote:
> >
> > The big advantage of vmsplice is that it can attach real user pages into
> > a pipe and then any following changes of these pages by the process
> > don't trigger any allocations and extra copies of data. vmsplice in this
> > case is fast. After splicing pages to pipes, we resume a process and
> > splice pages from pipes to a socket or a file. The whole process of
> > dumping process pages is zero-copy.
>
> Hmm. What happens if you just use /proc/<pid>/mem?
>
> That just takes a reference to the tsk->mm. No page copies at all.
> After that you can do anything you want to that mm.
>
> Well, anything a /proc/<pid>/mm fd allows, which is mainly read and
> write. But it stays around for as long as you keep it open, and
> fundamentally stays coherent with that mm, because it *is* that mm.
>
> And it doesn't affect anything else, because all it literally has is
> that mm_struct pointer.

I think the main reason for using vmsplice&splice was zero-copy. I wrote
a small benchmark to compare /proc/pid/mem, process_vm_readv, and
vmsplice. This benchmark emulates how criu dumps memory. It creates a
child process and dumps its memory into a file. The code is here:
https://github.com/avagin/procmem.

Here are results from my laptop:
$ ./procmem [CMD] [DUMP FILE] [BUF_SIZE] [MEM_SIZE]

$ ./procmem splice /tmp/procmem.out 1048576 2147483648
ok 4877 MB/sec
ok 4733 MB/sec
ok 4777 MB/sec
ok 4766 MB/sec
ok 4821 MB/sec
ok 4777 MB/sec
ok 4798 MB/sec
ok 4798 MB/sec
ok 4798 MB/sec
ok 4798 MB/sec

$ ./procmem mem /tmp/procmem.out 1048576 2147483648
ok 3236 MB/sec
ok 2651 MB/sec
ok 3216 MB/sec
ok 3211 MB/sec
ok 3216 MB/sec
ok 3206 MB/sec
ok 3211 MB/sec
ok 3216 MB/sec
ok 3206 MB/sec
ok 3211 MB/sec

$ ./procmem process_vm_readv /tmp/procmem.out 1048576 2147483648
ok 3833 MB/sec
ok 3075 MB/sec
ok 3792 MB/sec
ok 3792 MB/sec
ok 3819 MB/sec
ok 3813 MB/sec
ok 3819 MB/sec
ok 3806 MB/sec
ok 3799 MB/sec
ok 3813 MB/sec

vmsplice & splice is the best. /proc/pid/mem is 30% slower.
process_vm_readv is 20% slower.

Thanks,
Andrei