2009-06-11 18:46:46

by Stefan Lankes

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4]: affinity-on-next-touch

> Your patches seem to have a lot of overlap with
> Lee Schermerhorn's old migrate memory on cpu migration patches.
> I don't know the status of those.

I analyze Lee Schermerhorn's migrate memory on cpu migration patches
(http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that Lee
Schermerhorn add similar functionalities to the kernel. He called the
"affinity-on-next-touch" functionality "migrate_on_fault" and uses in his
patches the normal NUMA memory policies. Therefore, his solution fits better
to the Linux kernel. I tested his patches with our test applications and got
nearly the same performance results.

I found only patches for the kernel 2.6.25-rc2-mm1. Does someone develop
these patches further?

Stefan



2009-06-12 10:25:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch

On Thu, Jun 11, 2009 at 08:45:29PM +0200, Stefan Lankes wrote:
> > Your patches seem to have a lot of overlap with
> > Lee Schermerhorn's old migrate memory on cpu migration patches.
> > I don't know the status of those.
>
> I analyze Lee Schermerhorn's migrate memory on cpu migration patches
> (http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that Lee
> Schermerhorn add similar functionalities to the kernel. He called the
> "affinity-on-next-touch" functionality "migrate_on_fault" and uses in his
> patches the normal NUMA memory policies. Therefore, his solution fits better
> to the Linux kernel. I tested his patches with our test applications and got
> nearly the same performance results.

That's great to know.

I didn't think he had a per process setting though, did he?

> I found only patches for the kernel 2.6.25-rc2-mm1. Does someone develop
> these patches further?

Not to much knowledge. Maybe Lee will pick them up again now that there
are more use cases.

If he doesn't have time maybe you could update them?

-Andi
--
[email protected] -- Speaking for myself only.

2009-06-12 11:46:48

by Stefan Lankes

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4]: affinity-on-next-touch


> > I analyze Lee Schermerhorn's migrate memory on cpu migration patches
> > (http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that
> Lee
> > Schermerhorn add similar functionalities to the kernel. He called the
> > "affinity-on-next-touch" functionality "migrate_on_fault" and uses in
> his
> > patches the normal NUMA memory policies. Therefore, his solution fits
> better
> > to the Linux kernel. I tested his patches with our test applications
> and got
> > nearly the same performance results.
>
> That's great to know.
>
> I didn't think he had a per process setting though, did he?

He enables the support of migration-on-fault via cpusets (echo 1 >
/dev/cpuset/migrate_on_fault).
Afterwards, every process could initiate migration-on-fault via mbind(...,
MPOL_MF_MOVE|MPOL_MF_LAZY).

> > I found only patches for the kernel 2.6.25-rc2-mm1. Does someone
> develop
> > these patches further?
>
> Not to much knowledge. Maybe Lee will pick them up again now that there
> are more use cases.
>
> If he doesn't have time maybe you could update them?

We are planning to work in this area. I think that I could update these
patches.
At least, I am able to support Lee.

Stefan

2009-06-12 12:39:42

by Brice Goglin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch

Stefan Lankes wrote:
> He enables the support of migration-on-fault via cpusets (echo 1 >
> /dev/cpuset/migrate_on_fault).
> Afterwards, every process could initiate migration-on-fault via mbind(...,
> MPOL_MF_MOVE|MPOL_MF_LAZY).

So mbind(MPOL_MF_LAZY) is taking care of changing page protection so as
to generate page-faults on next-touch? (instead of your madvise)
Is it migrating the whole memory area? Or only single pages?

Then, what's happening with MPOL_MF_LAZY in the kernel? Is it actually
stored in the mempolicy? If so, couldn't another fault later cause
another migration?
Or is MPOL_MF_LAZY filtered out of the policy once the protection of all
PTE has been changed?
I don't see why we need a new mempolicy here. If we are migrating single
pages, migrate-on-next-touch looks like a page-attribute to me. There
should be nothing to store in a mempolicy/VMA/whatever.

Brice

2009-06-12 13:21:28

by Stefan Lankes

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4]: affinity-on-next-touch

> So mbind(MPOL_MF_LAZY) is taking care of changing page protection so as
> to generate page-faults on next-touch? (instead of your madvise)
> Is it migrating the whole memory area? Or only single pages?

mbind removes the pte references. Page migration will occur, when a task
access to one of these unmapped pages. Therefore, Lee's solution migrate one
single page and not the whole area.

You find further information at slides 19-23 of
http://mirror.linux.org.au/pub/linux.conf.au/2007/video/talks/197.pdf.

> Then, what's happening with MPOL_MF_LAZY in the kernel? Is it actually
> stored in the mempolicy? If so, couldn't another fault later cause
> another migration?
> Or is MPOL_MF_LAZY filtered out of the policy once the protection of
> all
> PTE has been changed?
>
> I don't see why we need a new mempolicy here. If we are migrating
> single
> pages, migrate-on-next-touch looks like a page-attribute to me. There
> should be nothing to store in a mempolicy/VMA/whatever.
>

MPOL_MF_LAZY is used as flag and does not specify a new policy. Therefore,
MPOL_MF_LAZY isn't stored in a VMA. The flag is only used to detect that the
system call mbind has to unmap these pages.

Stefan

2009-06-16 02:21:46

by Lee Schermerhorn

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4]: affinity-on-next-touch

On Thu, 2009-06-11 at 20:45 +0200, Stefan Lankes wrote:
> > Your patches seem to have a lot of overlap with
> > Lee Schermerhorn's old migrate memory on cpu migration patches.
> > I don't know the status of those.
>
> I analyze Lee Schermerhorn's migrate memory on cpu migration patches
> (http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that Lee
> Schermerhorn add similar functionalities to the kernel. He called the
> "affinity-on-next-touch" functionality "migrate_on_fault" and uses in his
> patches the normal NUMA memory policies. Therefore, his solution fits better
> to the Linux kernel. I tested his patches with our test applications and got
> nearly the same performance results.
>
> I found only patches for the kernel 2.6.25-rc2-mm1. Does someone develop
> these patches further?

Sorry for the delay. I was offline for a long weekend.

Regarding the patches: I was rebasing them every few mmotm releases
until I ran into trouble with the memory controller handling of page
migration conflicting with migrating in the fault path and haven't had
time to investigate a solution.

Here's the problem I have:

when migrating a page with memory controller configured, the migration
code [mem_cgroup_prepare_migration()] tentatively charges the page
against the control group. Then, when migration completes, it calls
mem_cgroup_end_migration() to commit [or cancel?] the charge. Migration
on fault operates on an anon page in the page cache that has zero pte
references [page_mapcount(page) == 0] in do_swap_page(). do_swap_page()
does a mem_cgroup_try_charge_swapin() that also tentatively charges the
page. I don't try to migrate the page unless this succeeds. No sense
in doing all that work if the cgroup can't afford the page.

But, this ends up with nested "tentative charges" against the page when
I call down into the migration code via migrate_misplaced_page() and I
was having problems getting the ref counting correct. It would bug out
under load.

What I want to do is see if the page migration code can "atomically
transfer" the page charge [including any tentative charge from
do_swap_page()] down in migrate_page_copy(), the way all other page
state is copied. Haven't had time to see whether this is feasible.

Regards,
Lee

2009-06-16 02:25:42

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch

On Fri, 2009-06-12 at 12:32 +0200, Andi Kleen wrote:
> On Thu, Jun 11, 2009 at 08:45:29PM +0200, Stefan Lankes wrote:
> > > Your patches seem to have a lot of overlap with
> > > Lee Schermerhorn's old migrate memory on cpu migration patches.
> > > I don't know the status of those.
> >
> > I analyze Lee Schermerhorn's migrate memory on cpu migration patches
> > (http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that Lee
> > Schermerhorn add similar functionalities to the kernel. He called the
> > "affinity-on-next-touch" functionality "migrate_on_fault" and uses in his
> > patches the normal NUMA memory policies. Therefore, his solution fits better
> > to the Linux kernel. I tested his patches with our test applications and got
> > nearly the same performance results.
>
> That's great to know.
>
> I didn't think he had a per process setting though, did he?

Hi, Andi.

My patches don't have per process enablement. Rather, I chose to use
per cpuset enablement. I view cpusets as sort of "numa control groups"
and thought this was an appropriate level at which to control this sort
of behavior--analogous to memory_spread_{page|slab}. That probably
needs to be discussed more widely, tho'.

>
> > I found only patches for the kernel 2.6.25-rc2-mm1. Does someone develop
> > these patches further?
>
> Not to much knowledge. Maybe Lee will pick them up again now that there
> are more use cases.
>
> If he doesn't have time maybe you could update them?

As I mentioned earlier, I need to sort out the interaction with the
memory controller. It was changing too fast for me to keep up in the
time I could devote to it.

Lee

2009-06-16 02:39:53

by Lee Schermerhorn

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4]: affinity-on-next-touch

On Fri, 2009-06-12 at 13:46 +0200, Stefan Lankes wrote:
> > > I analyze Lee Schermerhorn's migrate memory on cpu migration patches
> > > (http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that
> > Lee
> > > Schermerhorn add similar functionalities to the kernel. He called the
> > > "affinity-on-next-touch" functionality "migrate_on_fault" and uses in
> > his
> > > patches the normal NUMA memory policies. Therefore, his solution fits
> > better
> > > to the Linux kernel. I tested his patches with our test applications
> > and got
> > > nearly the same performance results.
> >
> > That's great to know.
> >
> > I didn't think he had a per process setting though, did he?
>
> He enables the support of migration-on-fault via cpusets (echo 1 >
> /dev/cpuset/migrate_on_fault).
> Afterwards, every process could initiate migration-on-fault via mbind(...,
> MPOL_MF_MOVE|MPOL_MF_LAZY).

I should have read through the entire thread before responding Andi's
mail.

>
> > > I found only patches for the kernel 2.6.25-rc2-mm1. Does someone
> > develop
> > > these patches further?
> >
> > Not to much knowledge. Maybe Lee will pick them up again now that there
> > are more use cases.
> >
> > If he doesn't have time maybe you could update them?
>
> We are planning to work in this area. I think that I could update these
> patches.
> At least, I am able to support Lee.

I would like to get these patches working with latest mmotm to test on
some newer hardware where I think they will help more. And I would
welcome your support. However, I think we'll need to get Balbir and
Kamezawa-san involved to sort out the interaction with memory control
group.

I can send you the more recent rebase that I've done. This is getting
pretty old now: 2.6.28-rc4-mmotm-081110-081117. I'll try to rebase to
the most recent mmotm [that boots on my platforms], at least so that we
can build and boot with migrate-on-fault disabled, within the next
couple of weeks.

Regards,
Lee

2009-06-16 13:59:13

by Stefan Lankes

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4]: affinity-on-next-touch

>
> I would like to get these patches working with latest mmotm to test on
> some newer hardware where I think they will help more. And I would
> welcome your support. However, I think we'll need to get Balbir and
> Kamezawa-san involved to sort out the interaction with memory control
> group.
>
> I can send you the more recent rebase that I've done. This is getting
> pretty old now: 2.6.28-rc4-mmotm-081110-081117. I'll try to rebase to
> the most recent mmotm [that boots on my platforms], at least so that we
> can build and boot with migrate-on-fault disabled, within the next
> couple of weeks.
>

Sounds good! Send me your last version and I will try to reconstruct your
problems. Afterwards, we could try to solve these problems.

Regards,

Stefan

2009-06-16 15:00:08

by Lee Schermerhorn

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4]: affinity-on-next-touch

On Tue, 2009-06-16 at 15:58 +0200, Stefan Lankes wrote:
> >
> > I would like to get these patches working with latest mmotm to test on
> > some newer hardware where I think they will help more. And I would
> > welcome your support. However, I think we'll need to get Balbir and
> > Kamezawa-san involved to sort out the interaction with memory control
> > group.
> >
> > I can send you the more recent rebase that I've done. This is getting
> > pretty old now: 2.6.28-rc4-mmotm-081110-081117. I'll try to rebase to
> > the most recent mmotm [that boots on my platforms], at least so that we
> > can build and boot with migrate-on-fault disabled, within the next
> > couple of weeks.
> >
>
> Sounds good! Send me your last version and I will try to reconstruct your
> problems. Afterwards, we could try to solve these problems.
>

Stefan:

I've placed the last rebased version in :

http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-081110/

As I recall, this version DOES bug out because of reference count
problems due to interaction with the memory controller.

As I said, I'll try to rebase to latest mmotm real soon. I need to test
whether it will boot on my test systems first.

Lee


2009-06-17 01:24:29

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch

On Tue, 16 Jun 2009 10:59:55 -0400
Lee Schermerhorn <[email protected]> wrote:

> On Tue, 2009-06-16 at 15:58 +0200, Stefan Lankes wrote:
> > >
> > > I would like to get these patches working with latest mmotm to test on
> > > some newer hardware where I think they will help more. And I would
> > > welcome your support. However, I think we'll need to get Balbir and
> > > Kamezawa-san involved to sort out the interaction with memory control
> > > group.
> > >
> > > I can send you the more recent rebase that I've done. This is getting
> > > pretty old now: 2.6.28-rc4-mmotm-081110-081117. I'll try to rebase to
> > > the most recent mmotm [that boots on my platforms], at least so that we
> > > can build and boot with migrate-on-fault disabled, within the next
> > > couple of weeks.
> > >
> >
> > Sounds good! Send me your last version and I will try to reconstruct your
> > problems. Afterwards, we could try to solve these problems.
> >
>
> Stefan:
>
> I've placed the last rebased version in :
>
> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-081110/
>
> As I recall, this version DOES bug out because of reference count
> problems due to interaction with the memory controller.
>
please report in precise if memcg has bug.
An example of test is welcome.

Thanks,
-Kmae

2009-06-17 07:45:35

by Stefan Lankes

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4]: affinity-on-next-touch


> I've placed the last rebased version in :
>
> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-
> 081110/
>

OK! I will try to reconstruct the problem.

Stefan

2009-06-17 12:05:29

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch

On Wed, 2009-06-17 at 10:22 +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 16 Jun 2009 10:59:55 -0400
> Lee Schermerhorn <[email protected]> wrote:
>
> > On Tue, 2009-06-16 at 15:58 +0200, Stefan Lankes wrote:
> > > >
> > > > I would like to get these patches working with latest mmotm to test on
> > > > some newer hardware where I think they will help more. And I would
> > > > welcome your support. However, I think we'll need to get Balbir and
> > > > Kamezawa-san involved to sort out the interaction with memory control
> > > > group.
> > > >
> > > > I can send you the more recent rebase that I've done. This is getting
> > > > pretty old now: 2.6.28-rc4-mmotm-081110-081117. I'll try to rebase to
> > > > the most recent mmotm [that boots on my platforms], at least so that we
> > > > can build and boot with migrate-on-fault disabled, within the next
> > > > couple of weeks.
> > > >
> > >
> > > Sounds good! Send me your last version and I will try to reconstruct your
> > > problems. Afterwards, we could try to solve these problems.
> > >
> >
> > Stefan:
> >
> > I've placed the last rebased version in :
> >
> > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-081110/
> >
> > As I recall, this version DOES bug out because of reference count
> > problems due to interaction with the memory controller.
> >
> please report in precise if memcg has bug.
> An example of test is welcome.

Not an memcg bug. Just an implementation choice [2 phase migration
handling: start/end calls] that is problematic for "lazy" page
migration--i.e., "migration-on-touch" in the fault path. I'd be
interested in your opinion on the feasibility of transferring the
"charge" against the page--including the "try charge" from
do_swap_page()--in migrate_page_copy() along with other page state.

I'll try to rebase my lazy migration series to recent mmotm [as time
permits] in the "near future" and gather more info on the problem I was
having.

Regards,
Lee

2009-06-18 04:37:48

by Lee Schermerhorn

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4]: affinity-on-next-touch

On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
> > I've placed the last rebased version in :
> >
> > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-
> > 081110/
> >
>
> OK! I will try to reconstruct the problem.

Stefan:

Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
[along with my shared policy series atop which they sit in my tree].
Patches reside in:

http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/


I did a quick test. I'm afraid the patches have suffered some "bit rot"
vis a vis mainline/mmotm over the past several months. Two possibly
related issues:

1) lazy migration doesn't seem to work. Looks like
mbind(<some-policy>+MPOL_MF_MOVE+MPOL_MF_LAZY) is not unmapping the
pages so, of course, migrate on fault won't work. I suspect the
reference count handling has changed since I last tried this. [Note one
of the patch conflicts was in the MPOL_MF_LAZY addition to the mbind
flag definitions in mempolicy.h and I may have botched the resolution
thereof.]

2) When the pages get freed on exit/unmap, they are still PageLocked()
and free_pages_check()/bad_page() bugs out with bad page state.

Note: This is independent of memcg--i.e., happens whether or not memcg
configured.

To test this, I created a test cpuset with all nodes/mems/cpus and
enabled migrate_on_fault therein. I then ran an interactive "memtoy"
session there [shown below]. Memtoy is a program I use for ad hoc
testing of various mm features. You can find the latest version [almost
always] at:

http://free.linux.hp.com/~lts/Tools/memtoy-latest.tar.gz

You'll need the numactl-devel package to build--an older one with the V1
api, I think. I need to upgrade it to latest libnuma.

The same directory [Tools] contains a tarball of simple cpuset scripts
to make, query, modify, "enter" and run commands in cpusets. There may
be other versions of such scripts around. If you don't already have
any, feel free to grab them.

Since you've expressed interest in this [as has Kamezawa-san], I'll try
to pay some attention to debugging the patches in my copious spare time.
And, I'd be very interested in anything you discover in your
investigations.

Regards,
Lee

Memtoy-0.19c [for latest MPOL_MF flags defs]:

!!! lines are my annotations:

memtoy pid: 4222
memtoy>mems
mems allowed = 0-3
mems policy = 0-3
memtoy>cpus
cpu affinity mask/ids: 0-7
memtoy>anon a1 8p
memtoy>map a1
memtoy>mbind a1 pref 1
memtoy>touch a1 w
memtoy: touched 8 pages in 0.000 secs
memtoy>where a1
a 0x00007f51ae757000 0x000000008000 0x000000000000 rw- default a1
page offset +00 +01 +02 +03 +04 +05 +06 +07
0: 1 1 1 1 1 1 1 1
memtoy>mbind a1 pref+move 2
memtoy: migration of a1 [8 pages] took 0.000secs.

memtoy>where a1
a 0x00007f51ae757000 0x000000008000 0x000000000000 rw- default a1
page offset +00 +01 +02 +03 +04 +05 +06 +07
0: 2 2 2 2 2 2 2 2

!!! direct migration [still] works! Try lazy:

memtoy>mbind a1 pref+move+lazy 3
memtoy: unmap of a1 [8 pages] took 0.000secs.
memtoy>where a1

!!! "where" command uses get_mempolicy() w/ MPOL_ADDR|MPOL_NODE flags to
fetch page location. Will call get_user_pages() and refault pages.
Should migrate to node 3, but:

a 0x00007f51ae757000 0x000000008000 0x000000000000 rw- default a1
page offset +00 +01 +02 +03 +04 +05 +06 +07
0: 2 2 2 2 2 2 2 2
!!! didn't move
memtoy>exit


On console I see, for each of 8 pages of segment a1:

BUG: Bad page state in process memtoy pfn:67515f
page:ffffea001699ccc8 flags:0a0000000010001d count:0 mapcount:0
mapping:(null) index:7f51ae75e
Pid: 4222, comm: memtoy Not tainted 2.6.30-mmotm-090612-1220+spol+lpm #6
Call Trace:
[<ffffffff810a787a>] bad_page+0xaa/0x130
[<ffffffff810a8719>] free_hot_cold_page+0x199/0x1d0
[<ffffffff810a8774>] __pagevec_free+0x24/0x30
[<ffffffff810ac96a>] release_pages+0x1ca/0x210
[<ffffffff810c8b7d>] free_pages_and_swap_cache+0x8d/0xb0
[<ffffffff810c0505>] exit_mmap+0x145/0x160
[<ffffffff81044177>] mmput+0x47/0xa0
[<ffffffff81048854>] exit_mm+0xf4/0x130
[<ffffffff81049c58>] do_exit+0x188/0x810
[<ffffffff81337194>] ? do_page_fault+0x184/0x310
[<ffffffff8104a31e>] do_group_exit+0x3e/0xa0
[<ffffffff8104a392>] sys_exit_group+0x12/0x20
[<ffffffff8100bd2b>] system_call_fastpath+0x16/0x1b


Page flags 0x10001d: locked, referenced, uptodate, dirty, swapbacked.
'locked' is bad state.

2009-06-18 19:04:56

by Lee Schermerhorn

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4]: affinity-on-next-touch

On Thu, 2009-06-18 at 00:37 -0400, Lee Schermerhorn wrote:
> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
> > > I've placed the last rebased version in :
> > >
> > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-
> > > 081110/
> > >
> >
> > OK! I will try to reconstruct the problem.
>
> Stefan:
>
> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
> [along with my shared policy series atop which they sit in my tree].
> Patches reside in:
>
> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
>

I have updated the migrate-on-fault tarball in the above location to fix
part of the problems I was seeing. See below.

>
> I did a quick test. I'm afraid the patches have suffered some "bit rot"
> vis a vis mainline/mmotm over the past several months. Two possibly
> related issues:
>
> 1) lazy migration doesn't seem to work. Looks like
> mbind(<some-policy>+MPOL_MF_MOVE+MPOL_MF_LAZY) is not unmapping the
> pages so, of course, migrate on fault won't work. I suspect the
> reference count handling has changed since I last tried this. [Note one
> of the patch conflicts was in the MPOL_MF_LAZY addition to the mbind
> flag definitions in mempolicy.h and I may have botched the resolution
> thereof.]
>
> 2) When the pages get freed on exit/unmap, they are still PageLocked()
> and free_pages_check()/bad_page() bugs out with bad page state.
>
> Note: This is independent of memcg--i.e., happens whether or not memcg
> configured.
>
<snip>

OK. Found time to look at this. Turns out I hadn't tested since
trylock_page() was introduced. I did a one-for-one replacement of the
old API [TestSetPageLocked()], not noticing that the sense of the return
was inverted. Thus, I was bailing out of the migrate_pages_unmap_only()
loop with the page locked, thinking someone else had locked it and would
take care of it. Since the page wasn't unmapped from the page table[s],
of course it wouldn't migrate on fault--wouldn't even fault!

Fixed this.

Now: lazy migration works w/ or w/o memcg configured, but NOT with the
swap resource controller configured. I'll look at that as time permits.

Lee

2009-06-19 15:27:18

by Lee Schermerhorn

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4]: affinity-on-next-touch

On Thu, 2009-06-18 at 15:04 -0400, Lee Schermerhorn wrote:
> On Thu, 2009-06-18 at 00:37 -0400, Lee Schermerhorn wrote:
> > On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
> > > > I've placed the last rebased version in :
> > > >
> > > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-
> > > > 081110/
> > > >
> > >
> > > OK! I will try to reconstruct the problem.
> >
> > Stefan:
> >
> > Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
> > [along with my shared policy series atop which they sit in my tree].
> > Patches reside in:
> >
> > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
> >
>
> I have updated the migrate-on-fault tarball in the above location to fix
> part of the problems I was seeing. See below.
>
> >
> > I did a quick test. I'm afraid the patches have suffered some "bit rot"
> > vis a vis mainline/mmotm over the past several months. Two possibly
> > related issues:
> >
> > 1) lazy migration doesn't seem to work. Looks like
> > mbind(<some-policy>+MPOL_MF_MOVE+MPOL_MF_LAZY) is not unmapping the
> > pages so, of course, migrate on fault won't work. I suspect the
> > reference count handling has changed since I last tried this. [Note one
> > of the patch conflicts was in the MPOL_MF_LAZY addition to the mbind
> > flag definitions in mempolicy.h and I may have botched the resolution
> > thereof.]
> >
> > 2) When the pages get freed on exit/unmap, they are still PageLocked()
> > and free_pages_check()/bad_page() bugs out with bad page state.
> >
> > Note: This is independent of memcg--i.e., happens whether or not memcg
> > configured.
> >
> <snip>
>
> OK. Found time to look at this. Turns out I hadn't tested since
> trylock_page() was introduced. I did a one-for-one replacement of the
> old API [TestSetPageLocked()], not noticing that the sense of the return
> was inverted. Thus, I was bailing out of the migrate_pages_unmap_only()
> loop with the page locked, thinking someone else had locked it and would
> take care of it. Since the page wasn't unmapped from the page table[s],
> of course it wouldn't migrate on fault--wouldn't even fault!
>
> Fixed this.
>
> Now: lazy migration works w/ or w/o memcg configured, but NOT with the
> swap resource controller configured. I'll look at that as time permits.

Update: I now can't reproduce the lazy migration failure with the swap
resource controller configured. Perhaps I had booted the wrong kernel
for the test reported above. Now the updated patch series mentioned
above seems to be working with both memory and swap resource controllers
configured for simple memtoy driven lazy migration.

Lee

2009-06-19 15:41:24

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch

* Lee Schermerhorn <[email protected]> [2009-06-19 11:26:53]:

> On Thu, 2009-06-18 at 15:04 -0400, Lee Schermerhorn wrote:
> > On Thu, 2009-06-18 at 00:37 -0400, Lee Schermerhorn wrote:
> > > On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
> > > > > I've placed the last rebased version in :
> > > > >
> > > > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-
> > > > > 081110/
> > > > >
> > > >
> > > > OK! I will try to reconstruct the problem.
> > >
> > > Stefan:
> > >
> > > Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
> > > [along with my shared policy series atop which they sit in my tree].
> > > Patches reside in:
> > >
> > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
> > >
> >
> > I have updated the migrate-on-fault tarball in the above location to fix
> > part of the problems I was seeing. See below.
> >
> > >
> > > I did a quick test. I'm afraid the patches have suffered some "bit rot"
> > > vis a vis mainline/mmotm over the past several months. Two possibly
> > > related issues:
> > >
> > > 1) lazy migration doesn't seem to work. Looks like
> > > mbind(<some-policy>+MPOL_MF_MOVE+MPOL_MF_LAZY) is not unmapping the
> > > pages so, of course, migrate on fault won't work. I suspect the
> > > reference count handling has changed since I last tried this. [Note one
> > > of the patch conflicts was in the MPOL_MF_LAZY addition to the mbind
> > > flag definitions in mempolicy.h and I may have botched the resolution
> > > thereof.]
> > >
> > > 2) When the pages get freed on exit/unmap, they are still PageLocked()
> > > and free_pages_check()/bad_page() bugs out with bad page state.
> > >
> > > Note: This is independent of memcg--i.e., happens whether or not memcg
> > > configured.
> > >
> > <snip>
> >
> > OK. Found time to look at this. Turns out I hadn't tested since
> > trylock_page() was introduced. I did a one-for-one replacement of the
> > old API [TestSetPageLocked()], not noticing that the sense of the return
> > was inverted. Thus, I was bailing out of the migrate_pages_unmap_only()
> > loop with the page locked, thinking someone else had locked it and would
> > take care of it. Since the page wasn't unmapped from the page table[s],
> > of course it wouldn't migrate on fault--wouldn't even fault!
> >
> > Fixed this.
> >
> > Now: lazy migration works w/ or w/o memcg configured, but NOT with the
> > swap resource controller configured. I'll look at that as time permits.
>
> Update: I now can't reproduce the lazy migration failure with the swap
> resource controller configured. Perhaps I had booted the wrong kernel
> for the test reported above. Now the updated patch series mentioned
> above seems to be working with both memory and swap resource controllers
> configured for simple memtoy driven lazy migration.

Excellent, I presume that you are using the latest mmotm or mainline.
We've had some swap cache leakage fix go in, but those are not as
serious (they can potentially cause OOM in a cgroup when the leak
occurs).


--
Balbir

2009-06-19 16:00:04

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch

On Fri, 2009-06-19 at 21:11 +0530, Balbir Singh wrote:
> * Lee Schermerhorn <[email protected]> [2009-06-19 11:26:53]:
>
> > On Thu, 2009-06-18 at 15:04 -0400, Lee Schermerhorn wrote:
> > > On Thu, 2009-06-18 at 00:37 -0400, Lee Schermerhorn wrote:
> > > > On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
> > > > > > I've placed the last rebased version in :
> > > > > >
> > > > > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-
> > > > > > 081110/
> > > > > >
> > > > >
> > > > > OK! I will try to reconstruct the problem.
> > > >
> > > > Stefan:
> > > >
> > > > Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
> > > > [along with my shared policy series atop which they sit in my tree].
> > > > Patches reside in:
> > > >
> > > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
> > > >
> > >
> > > I have updated the migrate-on-fault tarball in the above location to fix
> > > part of the problems I was seeing. See below.
> > >
> > > >
> > > > I did a quick test. I'm afraid the patches have suffered some "bit rot"
> > > > vis a vis mainline/mmotm over the past several months. Two possibly
> > > > related issues:
> > > >
> > > > 1) lazy migration doesn't seem to work. Looks like
> > > > mbind(<some-policy>+MPOL_MF_MOVE+MPOL_MF_LAZY) is not unmapping the
> > > > pages so, of course, migrate on fault won't work. I suspect the
> > > > reference count handling has changed since I last tried this. [Note one
> > > > of the patch conflicts was in the MPOL_MF_LAZY addition to the mbind
> > > > flag definitions in mempolicy.h and I may have botched the resolution
> > > > thereof.]
> > > >
> > > > 2) When the pages get freed on exit/unmap, they are still PageLocked()
> > > > and free_pages_check()/bad_page() bugs out with bad page state.
> > > >
> > > > Note: This is independent of memcg--i.e., happens whether or not memcg
> > > > configured.
> > > >
> > > <snip>
> > >
> > > OK. Found time to look at this. Turns out I hadn't tested since
> > > trylock_page() was introduced. I did a one-for-one replacement of the
> > > old API [TestSetPageLocked()], not noticing that the sense of the return
> > > was inverted. Thus, I was bailing out of the migrate_pages_unmap_only()
> > > loop with the page locked, thinking someone else had locked it and would
> > > take care of it. Since the page wasn't unmapped from the page table[s],
> > > of course it wouldn't migrate on fault--wouldn't even fault!
> > >
> > > Fixed this.
> > >
> > > Now: lazy migration works w/ or w/o memcg configured, but NOT with the
> > > swap resource controller configured. I'll look at that as time permits.
> >
> > Update: I now can't reproduce the lazy migration failure with the swap
> > resource controller configured. Perhaps I had booted the wrong kernel
> > for the test reported above. Now the updated patch series mentioned
> > above seems to be working with both memory and swap resource controllers
> > configured for simple memtoy driven lazy migration.
>
> Excellent, I presume that you are using the latest mmotm or mainline.
> We've had some swap cache leakage fix go in, but those are not as
> serious (they can potentially cause OOM in a cgroup when the leak
> occurs).

Yes, I'm using the 12jun mmotm atop 2.6.30. I use the mmotm timestamp
in my kernel versions to show the base I using. E.g., see the url
above.

Lee

2009-06-19 21:19:46

by Stefan Lankes

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch



Lee Schermerhorn wrote:
> On Thu, 2009-06-18 at 15:04 -0400, Lee Schermerhorn wrote:
>> On Thu, 2009-06-18 at 00:37 -0400, Lee Schermerhorn wrote:
>>> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
>>>>> I've placed the last rebased version in :
>>>>>
>>>>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-
>>>>> 081110/
>>>>>
>>>> OK! I will try to reconstruct the problem.
>>> Stefan:
>>>
>>> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
>>> [along with my shared policy series atop which they sit in my tree].
>>> Patches reside in:
>>>
>>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
>>>
>> I have updated the migrate-on-fault tarball in the above location to fix
>> part of the problems I was seeing. See below.
>>
>>> I did a quick test. I'm afraid the patches have suffered some "bit rot"
>>> vis a vis mainline/mmotm over the past several months. Two possibly
>>> related issues:
>>>
>>> 1) lazy migration doesn't seem to work. Looks like
>>> mbind(<some-policy>+MPOL_MF_MOVE+MPOL_MF_LAZY) is not unmapping the
>>> pages so, of course, migrate on fault won't work. I suspect the
>>> reference count handling has changed since I last tried this. [Note one
>>> of the patch conflicts was in the MPOL_MF_LAZY addition to the mbind
>>> flag definitions in mempolicy.h and I may have botched the resolution
>>> thereof.]
>>>
>>> 2) When the pages get freed on exit/unmap, they are still PageLocked()
>>> and free_pages_check()/bad_page() bugs out with bad page state.
>>>
>>> Note: This is independent of memcg--i.e., happens whether or not memcg
>>> configured.
>>>
>> <snip>
>>
>> OK. Found time to look at this. Turns out I hadn't tested since
>> trylock_page() was introduced. I did a one-for-one replacement of the
>> old API [TestSetPageLocked()], not noticing that the sense of the return
>> was inverted. Thus, I was bailing out of the migrate_pages_unmap_only()
>> loop with the page locked, thinking someone else had locked it and would
>> take care of it. Since the page wasn't unmapped from the page table[s],
>> of course it wouldn't migrate on fault--wouldn't even fault!
>>
>> Fixed this.
>>
>> Now: lazy migration works w/ or w/o memcg configured, but NOT with the
>> swap resource controller configured. I'll look at that as time permits.
>
> Update: I now can't reproduce the lazy migration failure with the swap
> resource controller configured. Perhaps I had booted the wrong kernel
> for the test reported above. Now the updated patch series mentioned
> above seems to be working with both memory and swap resource controllers
> configured for simple memtoy driven lazy migration.
>

The current version of your patch works fine on my system. I tested the
patches with our test applications and got very good performance results!

Stefan

2009-06-20 07:24:08

by Brice Goglin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch

Lee Schermerhorn wrote:
> My patches don't have per process enablement. Rather, I chose to use
> per cpuset enablement. I view cpusets as sort of "numa control groups"
> and thought this was an appropriate level at which to control this sort
> of behavior--analogous to memory_spread_{page|slab}. That probably
> needs to be discussed more widely, tho'.
>

Could you explain why you actually want to enable/disable
migrate-on-fault on a cpuset (or process) basis? Why would an
administrator want to disable it? Aren't the existing cpuset memory
restriction abilities enough?

Brice

2009-06-22 12:33:50

by Brice Goglin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch

Lee Schermerhorn wrote:
> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
>
>>> I've placed the last rebased version in :
>>>
>>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-
>>> 081110/
>>>
>>>
>> OK! I will try to reconstruct the problem.
>>
>
> Stefan:
>
> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
> [along with my shared policy series atop which they sit in my tree].
> Patches reside in:
>
> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
>
>

I gave this patchset a try and indeed it seems to work fine, thanks a
lot. But the migration performance isn't very good. I am seeing about
540MB/s when doing mbind+touch_all_pages on large buffers on a
quad-barcelona machines. move_pages gets 640MB/s there. And my own
next-touch implementation were near 800MB/s in the past.

I wonder if there is a more general migration performance degradation in
latest Linus git. move_pages performance was supposed to increase by 15%
(more than 700MB/s) thanks to commit dfa33d45 but I don't seem to see
the improvement with git or mmotm. Also migrate_pages seems to have
decreased but it might be older than 2.6.30. I need to find some time to
git bisect all this, otherwise it's hard to compare the performance of
your migrate-on-fault with other older implementations :)

When do you plan to actually submit all your patches for inclusion?

thanks,
Brice

2009-06-22 13:50:17

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch

On Sat, 2009-06-20 at 09:24 +0200, Brice Goglin wrote:
> Lee Schermerhorn wrote:
> > My patches don't have per process enablement. Rather, I chose to use
> > per cpuset enablement. I view cpusets as sort of "numa control groups"
> > and thought this was an appropriate level at which to control this sort
> > of behavior--analogous to memory_spread_{page|slab}. That probably
> > needs to be discussed more widely, tho'.
> >
>
> Could you explain why you actually want to enable/disable
> migrate-on-fault on a cpuset (or process) basis? Why would an
> administrator want to disable it? Aren't the existing cpuset memory
> restriction abilities enough?
>
> Brice
>

Hello, Brice:

There are a couple of aspects to this question, I think?

1) why enable/disable at all? why not always enabled?

When I try out some new behavior such as migrate of fault, I start with
the assumption [right or wrong] that not all users will want this
behavior. For migrate-on-fault, one probably won't run into it all that
often unless the MPOL_MF_LAZY flag is used to forcibly unmap regions.
However, with swap read-ahead, one could end up with anon pages in the
swap cache with no pte references, and could experience unexpected
migrations. I've learned that some folks really don't like
surprises :). Now, when you consider the "automigration" feature
["auto" here means "self" more than "automatic"], I think it's more
important to be able to enable/disable it. I've not seen any
performance degradation when using it, but I feared that for some
workloads, thrashing could cause such degradation. Page migration isn't
free.

Also, because Linux runs on such a wide range of platforms, I don't want
to burden smaller, embedded systems with the additional code, so I also
try to make the feature source configurable. I know we worry about the
proliferation of config options, but it's easier to remove one after the
fact, I think, than to retrofit it.

2) Why a per cpuset control?

I consider cpusets to be "numa control groups". They constrain
resources on a numa node [and related cpus] granularity, and control
numa related behavior, such as migration when changing cpusets,
spreading page cache and slab pages over nodes in the cpuset, ... In
fact, I think it would have been appropriate to call the cpuset control
group the "numa control group" when cgroups were introduced, but it's
too late for that now.

Finally, and not a reason to include the controls in the mainline, it's
REALLY useful during development. One can boot a test kernel, and only
enable the feature in a test cpuset, limiting the damage of, e.g., a
reference counting bug or such. It's also useful for measuring the
overhead of the patches absent any actual page migrations. However, if
this feature ever makes it to mainline, the community will have its say
on whether these controls should be included and how.

Hope this helps,
Lee


2009-06-22 14:24:20

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch

On Mon, 2009-06-22 at 14:34 +0200, Brice Goglin wrote:
> Lee Schermerhorn wrote:
> > On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
> >
> >>> I've placed the last rebased version in :
> >>>
> >>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-
> >>> 081110/
> >>>
> >>>
> >> OK! I will try to reconstruct the problem.
> >>
> >
> > Stefan:
> >
> > Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
> > [along with my shared policy series atop which they sit in my tree].
> > Patches reside in:
> >
> > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
> >
> >
>
> I gave this patchset a try and indeed it seems to work fine, thanks a
> lot. But the migration performance isn't very good. I am seeing about
> 540MB/s when doing mbind+touch_all_pages on large buffers on a
> quad-barcelona machines. move_pages gets 640MB/s there. And my own
> next-touch implementation were near 800MB/s in the past.

Interesting. Do you have any idea where the differences come from? Are
you comparing them on the same kernel versions? I don't know the
details of your implementation, but one possible area is the check for
"misplacement". When migrate-on-fault is enabled, I check all pages
with page_mapcount() == 0 for misplacement in the [swap page] fault
path. That, and other filtering to eliminate unnecessary migrations
could cause extra overhead.

Aside: currently, my implementation could migrate a page, only to find
that it will be replaced by a new page due to copy-on-write. I have on
my list to check write access and whether we can reuse the swap page and
avoid the migration if we're going to COW later anyway. This could
improve performance for write accesses, if the snoop traffic doesn't
overshadow any such improvement.

>
> I wonder if there is a more general migration performance degradation in
> latest Linus git. move_pages performance was supposed to increase by 15%
> (more than 700MB/s) thanks to commit dfa33d45 but I don't seem to see
> the improvement with git or mmotm. Also migrate_pages seems to have
> decreased but it might be older than 2.6.30. I need to find some time to
> git bisect all this, otherwise it's hard to compare the performance of
> your migrate-on-fault with other older implementations :)

Confession: I've not measured migration performance directly. Rather,
I've only observed how applications/benchmarks perform with
migrate-on-fault+automigration enabled. On the platforms available to
me back when I was actively working on this, I did see improvements in
real and user time due to improved locality, especially under heavy load
when interconnect bandwidth is at a premium. Of course, system time
increased because of the migration overheads.

>
> When do you plan to actually submit all your patches for inclusion?

I had/have no immediate plans. I held off on these series while other
mm features--reclaim scalability, memory control groups, ...--seemed
higher priority, and the churn in mm made it difficult to keep these
patches up to date. Now that the patches seem to be working again, I
plan to test them on newer platforms with more "interesting" numa
topologies. If they work well there, and with your interest and
cooperation, perhaps we can try again with some variant or combination
of our approaches.

Regards,
Lee

2009-06-22 14:32:25

by Stefan Lankes

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch



Brice Goglin wrote:
> Lee Schermerhorn wrote:
>> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
>>
>>
>> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
>> [along with my shared policy series atop which they sit in my tree].
>> Patches reside in:
>>
>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
>>
>>
>
> I gave this patchset a try and indeed it seems to work fine, thanks a
> lot. But the migration performance isn't very good. I am seeing about
> 540MB/s when doing mbind+touch_all_pages on large buffers on a
> quad-barcelona machines. move_pages gets 640MB/s there. And my own
> next-touch implementation were near 800MB/s in the past.

I used a modified stream benchmark to evaluate the performance of Lee's
and my version of the next-touch implementation. In this low-level
benchmark is Lee's patch better than my patch. I think that Brice and I
use the same technique to realize affinity-on-next-touch. Do you use
another kernel version to evaluate the performance?

Stefan

2009-06-22 14:56:58

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch

On Mon, 2009-06-22 at 16:32 +0200, Stefan Lankes wrote:
>
> Brice Goglin wrote:
> > Lee Schermerhorn wrote:
> >> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
> >>
> >>
> >> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
> >> [along with my shared policy series atop which they sit in my tree].
> >> Patches reside in:
> >>
> >> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
> >>
> >>
> >
> > I gave this patchset a try and indeed it seems to work fine, thanks a
> > lot. But the migration performance isn't very good. I am seeing about
> > 540MB/s when doing mbind+touch_all_pages on large buffers on a
> > quad-barcelona machines. move_pages gets 640MB/s there. And my own
> > next-touch implementation were near 800MB/s in the past.
>
> I used a modified stream benchmark to evaluate the performance of Lee's
> and my version of the next-touch implementation. In this low-level
> benchmark is Lee's patch better than my patch. I think that Brice and I
> use the same technique to realize affinity-on-next-touch. Do you use
> another kernel version to evaluate the performance?

Hi, Stefan:

I also used a [modified!] stream benchmark to test my patches. One of
the modifications was to dump the time it takes for one pass over the
data arrays to a specific file description, if that file description was
open at start time--e.g., via something like "4>stream_times". Then, I
increased the number of iterations to something large so that I could
run other tests during the stream run. I plotted the "time per
iteration" vs iteration number and could see that after any transient
load, the stream benchmark returned to a good [not sure if maximal]
locality state. The time per interation was comparable to hand
affinitized of the threads. Without automigration and hand
affinitization, any transient load would scramble the location of the
threads relative to the data region they were operating on due to load
balancing. The more nodes you have, the less likely you'll end up in a
good state.

I was using a parallel kernel make [-j <2*nr_cpus>] as the load. In
addition to the stream returning to good locality, I noticed that the
kernel build completed much faster in the presence of the stream load
with automigration enabled. I reported these results in a presentation
at LCA'07. Slides and video [yuck! :)] are available on line at the
LCA'07 site.

Regards,
Lee

2009-06-22 15:28:18

by Brice Goglin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch

Lee Schermerhorn wrote:
>> I gave this patchset a try and indeed it seems to work fine, thanks a
>> lot. But the migration performance isn't very good. I am seeing about
>> 540MB/s when doing mbind+touch_all_pages on large buffers on a
>> quad-barcelona machines. move_pages gets 640MB/s there. And my own
>> next-touch implementation were near 800MB/s in the past.
>>
>
> Interesting. Do you have any idea where the differences come from? Are
> you comparing them on the same kernel versions? I don't know the
> details of your implementation, but one possible area is the check for
> "misplacement". When migrate-on-fault is enabled, I check all pages
> with page_mapcount() == 0 for misplacement in the [swap page] fault
> path. That, and other filtering to eliminate unnecessary migrations
> could cause extra overhead.
>

(I'll actually talk about this at the Linux Symposium) I used 2.6.27
initially, with some 2.6.29 patches to fix the throughput of move_pages
for large buffers. So move_pages was getting about 600MB/s there. Then
my own (hacky) next-touch implementation was getting about 800MB/s. The
main difference with your code is that mine only modifies the current
process PTE without touching the other processes if the page is shared.
So my code basically only supports private pages, it duplicates/migrates
them on next-touch. I thought it was faster than move_pages because I
didn't support shared-page migration. But, I found out later that
move_pages could be further improved up to about 750MB/s (it will be in
2.6.31).

So now, I'd expect both the next-touch migration and move_pages to have
similar migration throughput, about 750-800MB/s on my quad-barcelona
machine. Right now, I'm seeing less than that for both, so there might
be a problem deeper. Actually, looking at COW performance when the new
page is allocated on a remote numa node, I also see the throughput much
lower in 2.6.29+ (about 720MB/s) than in 2.6.27 (about 850MB/s). Maybe a
regression in the low-level page copy routine?

Brice

2009-06-22 15:43:00

by Stefan Lankes

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch



Lee Schermerhorn wrote:
> On Mon, 2009-06-22 at 16:32 +0200, Stefan Lankes wrote:
>> Brice Goglin wrote:
>>> Lee Schermerhorn wrote:
>>>> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
>>>>
>>>>
>>>> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
>>>> [along with my shared policy series atop which they sit in my tree].
>>>> Patches reside in:
>>>>
>>>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
>>>>
>>>>
>>> I gave this patchset a try and indeed it seems to work fine, thanks a
>>> lot. But the migration performance isn't very good. I am seeing about
>>> 540MB/s when doing mbind+touch_all_pages on large buffers on a
>>> quad-barcelona machines. move_pages gets 640MB/s there. And my own
>>> next-touch implementation were near 800MB/s in the past.
>> I used a modified stream benchmark to evaluate the performance of Lee's
>> and my version of the next-touch implementation. In this low-level
>> benchmark is Lee's patch better than my patch. I think that Brice and I
>> use the same technique to realize affinity-on-next-touch. Do you use
>> another kernel version to evaluate the performance?
>
> Hi, Stefan:
>
> I also used a [modified!] stream benchmark to test my patches. One of
> the modifications was to dump the time it takes for one pass over the
> data arrays to a specific file description, if that file description was
> open at start time--e.g., via something like "4>stream_times". Then, I
> increased the number of iterations to something large so that I could
> run other tests during the stream run. I plotted the "time per
> iteration" vs iteration number and could see that after any transient
> load, the stream benchmark returned to a good [not sure if maximal]
> locality state. The time per interation was comparable to hand
> affinitized of the threads. Without automigration and hand
> affinitization, any transient load would scramble the location of the
> threads relative to the data region they were operating on due to load
> balancing. The more nodes you have, the less likely you'll end up in a
> good state.
>
> I was using a parallel kernel make [-j <2*nr_cpus>] as the load. In
> addition to the stream returning to good locality, I noticed that the
> kernel build completed much faster in the presence of the stream load
> with automigration enabled. I reported these results in a presentation
> at LCA'07. Slides and video [yuck! :)] are available on line at the
> LCA'07 site.

I think that you use migration-on-fault in the context of automigration.
Brice and I use affinity-on-next-touch/migration-on-fault in another
context. If the access pattern of an application changed, we want to
redistribute the pages in "nearly" ideal matter. Sometimes it is
difficult to determine the ideal page distribution. In such cases,
affinity-on-next-touch could be an attractive solution. In our test
applications, we add at some certain points the system call to use
affinity-on-next-touch and redistribute the pages. Assumed that the next
thread use these pages very often, we improve the performance of our
test applications.

Regards,

Stefan

2009-06-22 16:39:00

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch

On Mon, 2009-06-22 at 17:42 +0200, Stefan Lankes wrote:
>
> Lee Schermerhorn wrote:
> > On Mon, 2009-06-22 at 16:32 +0200, Stefan Lankes wrote:
> >> Brice Goglin wrote:
> >>> Lee Schermerhorn wrote:
> >>>> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
> >>>>
> >>>>
> >>>> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
> >>>> [along with my shared policy series atop which they sit in my tree].
> >>>> Patches reside in:
> >>>>
> >>>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
> >>>>
> >>>>
> >>> I gave this patchset a try and indeed it seems to work fine, thanks a
> >>> lot. But the migration performance isn't very good. I am seeing about
> >>> 540MB/s when doing mbind+touch_all_pages on large buffers on a
> >>> quad-barcelona machines. move_pages gets 640MB/s there. And my own
> >>> next-touch implementation were near 800MB/s in the past.
> >> I used a modified stream benchmark to evaluate the performance of Lee's
> >> and my version of the next-touch implementation. In this low-level
> >> benchmark is Lee's patch better than my patch. I think that Brice and I
> >> use the same technique to realize affinity-on-next-touch. Do you use
> >> another kernel version to evaluate the performance?
> >
> > Hi, Stefan:
> >
> > I also used a [modified!] stream benchmark to test my patches. One of
> > the modifications was to dump the time it takes for one pass over the
> > data arrays to a specific file description, if that file description was
> > open at start time--e.g., via something like "4>stream_times". Then, I
> > increased the number of iterations to something large so that I could
> > run other tests during the stream run. I plotted the "time per
> > iteration" vs iteration number and could see that after any transient
> > load, the stream benchmark returned to a good [not sure if maximal]
> > locality state. The time per interation was comparable to hand
> > affinitized of the threads. Without automigration and hand
> > affinitization, any transient load would scramble the location of the
> > threads relative to the data region they were operating on due to load
> > balancing. The more nodes you have, the less likely you'll end up in a
> > good state.
> >
> > I was using a parallel kernel make [-j <2*nr_cpus>] as the load. In
> > addition to the stream returning to good locality, I noticed that the
> > kernel build completed much faster in the presence of the stream load
> > with automigration enabled. I reported these results in a presentation
> > at LCA'07. Slides and video [yuck! :)] are available on line at the
> > LCA'07 site.
>
> I think that you use migration-on-fault in the context of automigration.
> Brice and I use affinity-on-next-touch/migration-on-fault in another
> context. If the access pattern of an application changed, we want to
> redistribute the pages in "nearly" ideal matter. Sometimes it is
> difficult to determine the ideal page distribution. In such cases,
> affinity-on-next-touch could be an attractive solution. In our test
> applications, we add at some certain points the system call to use
> affinity-on-next-touch and redistribute the pages. Assumed that the next
> thread use these pages very often, we improve the performance of our
> test applications.

I understand. That's one of the motivations for MPOL_MF_LAZY and the
MPOL_MF_NOOP policy mode. It simply unmaps [removes pte refs from] the
pages, priming them for migrate on next touch, if they are "misplaced"
relative to the task touching them. It's useful for testing and that's
my personal primary use case, but I did envision it for use in
applications that know they're entering a new computation phase with
different access patterns.

Lee

2009-06-22 16:55:35

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch

On Mon, 2009-06-22 at 17:28 +0200, Brice Goglin wrote:
> Lee Schermerhorn wrote:
> >> I gave this patchset a try and indeed it seems to work fine, thanks a
> >> lot. But the migration performance isn't very good. I am seeing about
> >> 540MB/s when doing mbind+touch_all_pages on large buffers on a
> >> quad-barcelona machines. move_pages gets 640MB/s there. And my own
> >> next-touch implementation were near 800MB/s in the past.
> >>
> >
> > Interesting. Do you have any idea where the differences come from? Are
> > you comparing them on the same kernel versions? I don't know the
> > details of your implementation, but one possible area is the check for
> > "misplacement". When migrate-on-fault is enabled, I check all pages
> > with page_mapcount() == 0 for misplacement in the [swap page] fault
> > path. That, and other filtering to eliminate unnecessary migrations
> > could cause extra overhead.
> >
>
> (I'll actually talk about this at the Linux Symposium) I used 2.6.27
> initially, with some 2.6.29 patches to fix the throughput of move_pages
> for large buffers. So move_pages was getting about 600MB/s there. Then
> my own (hacky) next-touch implementation was getting about 800MB/s. The
> main difference with your code is that mine only modifies the current
> process PTE without touching the other processes if the page is shared.

The primary difference should be at unmap time, right? In the fault
path, I only update the pte of the faulting task. That's why I require
the [anon] pages to be in the swap cache [or something similar]. I
don't want to be fixing up other tasks' page tables in the context of
the faulting task's fault handler. If, later, another task touches the
page, it will take a minor fault and find the [possibly migrated] page
in the cache. Hmmm, I guess all tasks WILL incur the minor fault if
they touch the page after the unmap. That could be part of the
difference if you compare on the same kernel version.

> So my code basically only supports private pages, it duplicates/migrates
> them on next-touch. I thought it was faster than move_pages because I
> didn't support shared-page migration. But, I found out later that
> move_pages could be further improved up to about 750MB/s (it will be in
> 2.6.31).
>
> So now, I'd expect both the next-touch migration and move_pages to have
> similar migration throughput, about 750-800MB/s on my quad-barcelona
> machine. Right now, I'm seeing less than that for both, so there might
> be a problem deeper.

Try booting with cgroup_disable=memory on the command line, if you have
the memory resource controller configured in. See what that does to
your measurements.

> Actually, looking at COW performance when the new
> page is allocated on a remote numa node, I also see the throughput much
> lower in 2.6.29+ (about 720MB/s) than in 2.6.27 (about 850MB/s). Maybe a
> regression in the low-level page copy routine?

??? I would expect low level page copying to be highly optimized per
arch, and also fairly stable. Based on recent experience, I'd more
likely suspect the mm housekeeping overheads--e.g., global and per memcg
lru management, ... We seen a lot of new code in this area in the past
few releases.

Lee

2009-06-22 17:06:00

by Brice Goglin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch

Lee Schermerhorn wrote:
> The primary difference should be at unmap time, right? In the fault
> path, I only update the pte of the faulting task. That's why I require
> the [anon] pages to be in the swap cache [or something similar]. I
> don't want to be fixing up other tasks' page tables in the context of
> the faulting task's fault handler. If, later, another task touches the
> page, it will take a minor fault and find the [possibly migrated] page
> in the cache. Hmmm, I guess all tasks WILL incur the minor fault if
> they touch the page after the unmap. That could be part of the
> difference if you compare on the same kernel version.
>

Agreed.

> Try booting with cgroup_disable=memory on the command line, if you have
> the memory resource controller configured in. See what that does to
> your measurements.
>

It doesn't seem to help. I'll try to bisect and find where the
performance dropped.

> ??? I would expect low level page copying to be highly optimized per
> arch, and also fairly stable.

I just did a quick copy_page benchmark and didn't see any performance
difference between 2.6.27 and mmotm.

thanks,
Brice

2009-06-22 17:59:46

by Stefan Lankes

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch



Brice Goglin wrote:
> Lee Schermerhorn wrote:
>> The primary difference should be at unmap time, right? In the fault
>> path, I only update the pte of the faulting task. That's why I require
>> the [anon] pages to be in the swap cache [or something similar]. I
>> don't want to be fixing up other tasks' page tables in the context of
>> the faulting task's fault handler. If, later, another task touches the
>> page, it will take a minor fault and find the [possibly migrated] page
>> in the cache. Hmmm, I guess all tasks WILL incur the minor fault if
>> they touch the page after the unmap. That could be part of the
>> difference if you compare on the same kernel version.
>>
>
> Agreed.
>
>> Try booting with cgroup_disable=memory on the command line, if you have
>> the memory resource controller configured in. See what that does to
>> your measurements.
>>
>
> It doesn't seem to help. I'll try to bisect and find where the
> performance dropped.
>

I am not able to reconstruct any performance drawbacks on my system.
Could you send me your low-level benchmark?

Stefan

2009-06-22 19:09:59

by Brice Goglin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch

Stefan Lankes wrote:
> I am not able to reconstruct any performance drawbacks on my system.
> Could you send me your low-level benchmark?

It's attached. As you may see, it's fairly trivial. It just does several
iterations of mbind+touch_all_pages for different power-of-two buffer
sizes. Just replace mbind with madvise in the inner loop if you want to
try with your affinit-on-next-touch.

Which kernels are you using when comparing your next-touch
implementation with Lee's patchset?

Brice


Attachments:
next-touch-mof-cost.c (2.00 kB)

2009-06-22 20:16:20

by Stefan Lankes

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch



Brice Goglin wrote:
> Stefan Lankes wrote:
>> I am not able to reconstruct any performance drawbacks on my system.
>> Could you send me your low-level benchmark?
>
> It's attached. As you may see, it's fairly trivial. It just does several
> iterations of mbind+touch_all_pages for different power-of-two buffer
> sizes. Just replace mbind with madvise in the inner loop if you want to
> try with your affinit-on-next-touch.

I use MPOL_NOOP instead of MPOL_PREFERRED. On my system, MPOL_NOOP is
defined in as 4 (-> include/linux/mempolicy.h).

By the way, do you also add Lee's "shared policy" patches? These patches
add MPOL_MF_SHARED, which is specified as 3. Afterwards, you have to
define MPOL_MF_LAZY as 4.

I got following performance results with MPOL_NOOP:

# Nb_pages Cost(ns)
2 44539
4 44695
8 53937
16 61625
32 87757
64 135070
128 233812
256 428539
512 870476
1024 1695859
2048 3280695
4096 6450328
8192 12719187
16384 25377750
32768 50431375
65536 101970000
131072 216200500
262144 511706000

I got following performance results with MPOL_PREFERRED:

# Nb_pages Cost(ns)
2 50742
4 58656
8 79929
16 117171
32 195304
64 354851
128 744835
256 1354476
512 2759570
1024 5433304
2048 10173390
4096 20178453
8192 36452343
16384 71077375
32768 141738000
65536 281460250
131072 576971000
262144 1231694000

> Which kernels are you using when comparing your next-touch
> implementation with Lee's patchset?
>

The current mmotm tree.

Regards,

Stefan

2009-06-22 20:34:29

by Brice Goglin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch

Stefan Lankes wrote:
> By the way, do you also add Lee's "shared policy" patches? These
> patches add MPOL_MF_SHARED, which is specified as 3. Afterwards, you
> have to define MPOL_MF_LAZY as 4.

Yes, I applied shared-policy-* since migrate-on-fault doesn't apply
without them :)

But I have the following in include/linux/mempolicy.h after applying all
patches:
#define MPOL_MF_LAZY (1<<3) /* Modifies '_MOVE: lazy migrate on
fault */
#define MPOL_F_SHARED (1 << 0) /* identify shared policies */
Where did you get your F_SHARED=3 and MF_LAZY=4?

> I got following performance results with MPOL_NOOP:
>
> # Nb_pages Cost(ns)
> 32768 50431375
> 65536 101970000
> 131072 216200500
> 262144 511706000

Is there any migration here? Don't you just have unmap and fault without
migration? In my test program, the initialization does MPOL_BIND. So the
following MPOL_NOOP should just do nothing since the page is already
correctly placed with regard to the previous MPOL_BIND. I feel like 2us
per page looks too low for a migration and it's also very high for just
unmap and fault-in.

> I got following performance results with MPOL_PREFERRED:
>
> # Nb_pages Cost(ns)
> 32768 141738000

That's about 60% faster than on my machine (quad-barcelona 8347HE
1.9GHz). What machine are you running on?

Brice