2017-06-01 06:53:14

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE

On Tue, May 30, 2017 at 04:39:41PM +0200, Michal Hocko wrote:
> On Tue 30-05-17 16:04:56, Andrea Arcangeli wrote:
> >
> > UFFDIO_COPY while not being a major slowdown for sure, it's likely
> > measurable at the microbenchmark level because it would add a
> > enter/exit kernel to every 4k memcpy. It's not hard to imagine that as
> > measurable. How that impacts the total precopy time I don't know, it
> > would need to be benchmarked to be sure.
>
> Yes, please!

I've run a simple test (below) that fills 1G of memory either with memcpy
of ioctl(UFFDIO_COPY) in 4K chunks.
The machine I used has two "Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz" and
128G of RAM.
I've averaged elapsed time reported by /usr/bin/time over 100 runs and here
what I've got:

memcpy with THP on: 0.3278 sec
memcpy with THP off: 0.5295 sec
UFFDIO_COPY: 0.44 sec

That said, for the CRIU usecase UFFDIO_COPY seems faster that disabling THP
and then doing memcpy.

--
Sincerely yours,
Mike.

----------------------------------------------------------
{
...

src = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (src == MAP_FAILED)
fprintf(stderr, "map src failed\n"), exit(1);
*((unsigned long *)src) = 1;

if (disable_huge && prctl(PR_SET_THP_DISABLE, 1, 0, 0, 0))
fprintf(stderr, "ptctl failed\n"), exit(1);

dst = mmap(NULL, page_size * nr_pages, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (dst == MAP_FAILED)
fprintf(stderr, "map dst failed\n"), exit(1);

if (use_uffd && userfaultfd_register(dst))
fprintf(stderr, "userfault_register failed\n"), exit(1);

for (i = 0; i < nr_pages; i++) {
char *address = dst + i * page_size;

if (use_uffd) {
struct uffdio_copy uffdio_copy;

uffdio_copy.dst = (unsigned long)address;
uffdio_copy.src = (unsigned long)src;
uffdio_copy.len = page_size;
uffdio_copy.mode = 0;
uffdio_copy.copy = 0;

ret = ioctl(uffd, UFFDIO_COPY, &uffdio_copy);
if (ret)
fprintf(stderr, "copy: %d, %d\n", ret, errno),
exit(1);
} else {
memcpy(address, src, page_size);
}

}

return 0;
}


2017-06-01 08:09:15

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE

On Thu 01-06-17 09:53:02, Mike Rapoport wrote:
> On Tue, May 30, 2017 at 04:39:41PM +0200, Michal Hocko wrote:
> > On Tue 30-05-17 16:04:56, Andrea Arcangeli wrote:
> > >
> > > UFFDIO_COPY while not being a major slowdown for sure, it's likely
> > > measurable at the microbenchmark level because it would add a
> > > enter/exit kernel to every 4k memcpy. It's not hard to imagine that as
> > > measurable. How that impacts the total precopy time I don't know, it
> > > would need to be benchmarked to be sure.
> >
> > Yes, please!
>
> I've run a simple test (below) that fills 1G of memory either with memcpy
> of ioctl(UFFDIO_COPY) in 4K chunks.
> The machine I used has two "Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz" and
> 128G of RAM.
> I've averaged elapsed time reported by /usr/bin/time over 100 runs and here
> what I've got:
>
> memcpy with THP on: 0.3278 sec
> memcpy with THP off: 0.5295 sec
> UFFDIO_COPY: 0.44 sec

I assume that the standard deviation is small?

> That said, for the CRIU usecase UFFDIO_COPY seems faster that disabling THP
> and then doing memcpy.

That is a bit surprising. I didn't think that the userfault syscall
(ioctl) can be faster than a regular #PF but considering that
__mcopy_atomic bypasses the page fault path and it can be optimized for
the anon case suggests that we can save some cycles for each page and so
the cumulative savings can be visible.

--
Michal Hocko
SUSE Labs

2017-06-01 08:35:13

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE

On Thu, Jun 01, 2017 at 10:09:09AM +0200, Michal Hocko wrote:
> On Thu 01-06-17 09:53:02, Mike Rapoport wrote:
> > On Tue, May 30, 2017 at 04:39:41PM +0200, Michal Hocko wrote:
> > > On Tue 30-05-17 16:04:56, Andrea Arcangeli wrote:
> > > >
> > > > UFFDIO_COPY while not being a major slowdown for sure, it's likely
> > > > measurable at the microbenchmark level because it would add a
> > > > enter/exit kernel to every 4k memcpy. It's not hard to imagine that as
> > > > measurable. How that impacts the total precopy time I don't know, it
> > > > would need to be benchmarked to be sure.
> > >
> > > Yes, please!
> >
> > I've run a simple test (below) that fills 1G of memory either with memcpy
> > of ioctl(UFFDIO_COPY) in 4K chunks.
> > The machine I used has two "Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz" and
> > 128G of RAM.
> > I've averaged elapsed time reported by /usr/bin/time over 100 runs and here
> > what I've got:
> >
> > memcpy with THP on: 0.3278 sec
> > memcpy with THP off: 0.5295 sec
> > UFFDIO_COPY: 0.44 sec
>
> I assume that the standard deviation is small?

Yes.

> > That said, for the CRIU usecase UFFDIO_COPY seems faster that disabling THP
> > and then doing memcpy.
>
> That is a bit surprising. I didn't think that the userfault syscall
> (ioctl) can be faster than a regular #PF but considering that
> __mcopy_atomic bypasses the page fault path and it can be optimized for
> the anon case suggests that we can save some cycles for each page and so
> the cumulative savings can be visible.
>
> --
> Michal Hocko
> SUSE Labs
>

2017-06-01 13:45:26

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE

On Thu, Jun 01, 2017 at 10:09:09AM +0200, Michal Hocko wrote:
> That is a bit surprising. I didn't think that the userfault syscall
> (ioctl) can be faster than a regular #PF but considering that
> __mcopy_atomic bypasses the page fault path and it can be optimized for
> the anon case suggests that we can save some cycles for each page and so
> the cumulative savings can be visible.

__mcopy_atomic works not just for anonymous memory, hugetlbfs/shmem
are covered too and there are branches to handle those.

If you were to run more than one precopy pass UFFDIO_COPY shall become
slower than the userland access starting from the second pass.

At the light of this if CRIU can only do one single pass of precopy,
CRIU is probably better off using UFFDIO_COPY than using prctl or
madvise to temporarily turn off THP.

With QEMU as opposed we set MADV_HUGEPAGE during precopy on
destination to maximize the THP utilization for all those 2M naturally
aligned guest regions that aren't re-dirtied in the source, so we're
better off without using UFFDIO_COPY in precopy even during the first
pass to avoid the enter/kernel for subpages that are written to
destination in a already instantiated THP. At least until we teach
QEMU to map 2M at once if possible (UFFDIO_COPY would then also
require an enhancement, because currently it won't map THP on the
fly).

Thanks,
Andrea

2017-06-02 09:11:45

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE

On Thu, Jun 01, 2017 at 03:45:22PM +0200, Andrea Arcangeli wrote:
> On Thu, Jun 01, 2017 at 10:09:09AM +0200, Michal Hocko wrote:
> > That is a bit surprising. I didn't think that the userfault syscall
> > (ioctl) can be faster than a regular #PF but considering that
> > __mcopy_atomic bypasses the page fault path and it can be optimized for
> > the anon case suggests that we can save some cycles for each page and so
> > the cumulative savings can be visible.
>
> __mcopy_atomic works not just for anonymous memory, hugetlbfs/shmem
> are covered too and there are branches to handle those.
>
> If you were to run more than one precopy pass UFFDIO_COPY shall become
> slower than the userland access starting from the second pass.
>
> At the light of this if CRIU can only do one single pass of precopy,
> CRIU is probably better off using UFFDIO_COPY than using prctl or
> madvise to temporarily turn off THP.

CRIU does memory tracking differently from QEMU. Every round of pre-copy in
CRIU means we dump the dirty pages into an image file. The restore then
chooses what image file to use. Anyway, we fill the memory only once at
restore time, hence UFFDIO_COPY would be better than disabling THP.

> With QEMU as opposed we set MADV_HUGEPAGE during precopy on
> destination to maximize the THP utilization for all those 2M naturally
> aligned guest regions that aren't re-dirtied in the source, so we're
> better off without using UFFDIO_COPY in precopy even during the first
> pass to avoid the enter/kernel for subpages that are written to
> destination in a already instantiated THP. At least until we teach
> QEMU to map 2M at once if possible (UFFDIO_COPY would then also
> require an enhancement, because currently it won't map THP on the
> fly).
>
> Thanks,
> Andrea
>

--
Sincerely yours,
Mike.