I received a question from a customer that was trying to move pages via
the mbind system call. In this specific case, the system had two nodes
and all pages in the range were already present on node 0. They then
called mbind with mode MPOL_INTERLEAVE and the MPOL_MF_MOVE_ALL flag. Their
expectation was that half the pages in the range would be moved to node 1
in an interleaved pattern.
In the above situation, no pages actually get moved. This is because mbind
creates a list of pages to be moved via:
ret = queue_pages_range(mm, start, end, nmask,
flags | MPOL_MF_INVERT, &pagelist);
No page will be added to the list as queue_folio_required is called for each
page to determine if it resides within the set of nodes. And, all page are
within the set.
I have reread the mbind man page several times and agree that one might
expect MPOL_INTERLEAVE with MPOL_MF_MOVE_ALL to move pages and create an
interleaved pattern. My question is should we:
- Change mbind so that pages are moved to an interleaved pattern?
- Update the documentation to be more explicit?
I can do either, but just wanted to get opinions before starting.
--
Mike Kravetz
On 5/1/23 20:58, Mike Kravetz wrote:
> I received a question from a customer that was trying to move pages via
> the mbind system call. In this specific case, the system had two nodes
> and all pages in the range were already present on node 0. They then
> called mbind with mode MPOL_INTERLEAVE and the MPOL_MF_MOVE_ALL flag. Their
> expectation was that half the pages in the range would be moved to node 1
> in an interleaved pattern.
>
> In the above situation, no pages actually get moved. This is because mbind
> creates a list of pages to be moved via:
>
> ret = queue_pages_range(mm, start, end, nmask,
> flags | MPOL_MF_INVERT, &pagelist);
>
> No page will be added to the list as queue_folio_required is called for each
> page to determine if it resides within the set of nodes. And, all page are
> within the set.
>
> I have reread the mbind man page several times and agree that one might
> expect MPOL_INTERLEAVE with MPOL_MF_MOVE_ALL to move pages and create an
> interleaved pattern. My question is should we:
> - Change mbind so that pages are moved to an interleaved pattern?
I guess it could be worth trying, if there's a use case. And hope nobody
else is depending on the current behavior and will complain afterwards :)
> - Update the documentation to be more explicit?
>
> I can do either, but just wanted to get opinions before starting.
On Tue 02-05-23 09:45:40, Vlastimil Babka wrote:
> On 5/1/23 20:58, Mike Kravetz wrote:
> > I received a question from a customer that was trying to move pages via
> > the mbind system call. In this specific case, the system had two nodes
> > and all pages in the range were already present on node 0. They then
> > called mbind with mode MPOL_INTERLEAVE and the MPOL_MF_MOVE_ALL flag. Their
> > expectation was that half the pages in the range would be moved to node 1
> > in an interleaved pattern.
> >
> > In the above situation, no pages actually get moved. This is because mbind
> > creates a list of pages to be moved via:
> >
> > ret = queue_pages_range(mm, start, end, nmask,
> > flags | MPOL_MF_INVERT, &pagelist);
> >
> > No page will be added to the list as queue_folio_required is called for each
> > page to determine if it resides within the set of nodes. And, all page are
> > within the set.
> >
> > I have reread the mbind man page several times and agree that one might
> > expect MPOL_INTERLEAVE with MPOL_MF_MOVE_ALL to move pages and create an
> > interleaved pattern. My question is should we:
> > - Change mbind so that pages are moved to an interleaved pattern?
>
> I guess it could be worth trying, if there's a use case. And hope nobody
> else is depending on the current behavior and will complain afterwards :)
I am not sure this is worth it wrt. complexity. Essentially it would
require to build up the distribution for the whole range first so 2
passes. Also it could become more tricky if the final node mask has
nodes of difference distances (it would be a reasonable expectation to
distribute withe minimum total distances right ;)).
> > - Update the documentation to be more explicit?
Yes, please. I do not think. While this sounds like a neat feature I
think the additional complexity is likely not worth it. A strong usecase
might make a difference though.
--
Michal Hocko
SUSE Labs
On 05/02/23 15:12, Michal Hocko wrote:
> On Tue 02-05-23 09:45:40, Vlastimil Babka wrote:
> > On 5/1/23 20:58, Mike Kravetz wrote:
> > > I received a question from a customer that was trying to move pages via
> > > the mbind system call. In this specific case, the system had two nodes
> > > and all pages in the range were already present on node 0. They then
> > > called mbind with mode MPOL_INTERLEAVE and the MPOL_MF_MOVE_ALL flag. Their
> > > expectation was that half the pages in the range would be moved to node 1
> > > in an interleaved pattern.
> > >
> > > In the above situation, no pages actually get moved. This is because mbind
> > > creates a list of pages to be moved via:
> > >
> > > ret = queue_pages_range(mm, start, end, nmask,
> > > flags | MPOL_MF_INVERT, &pagelist);
> > >
> > > No page will be added to the list as queue_folio_required is called for each
> > > page to determine if it resides within the set of nodes. And, all page are
> > > within the set.
> > >
> > > I have reread the mbind man page several times and agree that one might
> > > expect MPOL_INTERLEAVE with MPOL_MF_MOVE_ALL to move pages and create an
> > > interleaved pattern. My question is should we:
> > > - Change mbind so that pages are moved to an interleaved pattern?
> >
> > I guess it could be worth trying, if there's a use case. And hope nobody
> > else is depending on the current behavior and will complain afterwards :)
>
> I am not sure this is worth it wrt. complexity. Essentially it would
> require to build up the distribution for the whole range first so 2
> passes. Also it could become more tricky if the final node mask has
> nodes of difference distances (it would be a reasonable expectation to
> distribute withe minimum total distances right ;)).
Yes, I was worried about the complexity of such a change. At a high
level, interleave sounds easy. But, like most things the details
could add a bunch of complexity.
> > > - Update the documentation to be more explicit?
>
> Yes, please. I do not think. While this sounds like a neat feature I
> think the additional complexity is likely not worth it. A strong usecase
> might make a difference though.
Well, this user has a 'work around'. They simply make sure to set the
policy of this area (a shared memory segment) before populating. And,
I don't think they would really be happy with the cost of potentially
migrating hundreds of GB of data.
I'll send out a documentation update.
--
Mike Kravetz