Mel,
Russ Anderson recently introduced a patch into ia64 that changes MCA
behavior. When the MCA is caused by a user reference to a users memory,
we put an extra reference on the page and kill the user. This leaves
the working memory available for other jobs while causing a leak of the
bad page.
I don't know if Russ has done any testing with hugetlbfs pages. I preface
the remainder of my comments with a huge "I don't know anything"
disclaimer.
With the new hugepages concept, would it be possible to only mark
the default pagesize portion of a hugepage as bad and then return the
remainder of the hugepage for normal use? What would we basically need
to do to accomplish this? Are there patches in the community which we
should wait to see how they progress before we do any work on this front?
Thanks,
Robin Holt
On Wed, 16 Nov 2005, Robin Holt wrote:
> Russ Anderson recently introduced a patch into ia64 that changes MCA
> behavior. When the MCA is caused by a user reference to a users memory,
> we put an extra reference on the page and kill the user. This leaves
> the working memory available for other jobs while causing a leak of the
> bad page.
>
> I don't know if Russ has done any testing with hugetlbfs pages. I preface
> the remainder of my comments with a huge "I don't know anything"
> disclaimer.
>
> With the new hugepages concept, would it be possible to only mark
> the default pagesize portion of a hugepage as bad and then return the
> remainder of the hugepage for normal use? What would we basically need
> to do to accomplish this? Are there patches in the community which we
> should wait to see how they progress before we do any work on this front?
On IA64 we have one PTE for a huge page in a different region, so we
cannot unmap a page sized section. Other architectures may have PTEs for
each page sized section of a huge page. For those it may make sense
(but then the management of the page is done via the first page_struct,
which likely results in some challenging VM issues).
On Wed, 16 Nov 2005, Robin Holt wrote:
> Russ Anderson recently introduced a patch into ia64 that changes MCA
> behavior. When the MCA is caused by a user reference to a users memory,
> we put an extra reference on the page and kill the user. This leaves
> the working memory available for other jobs while causing a leak of the
> bad page.
>
> I don't know if Russ has done any testing with hugetlbfs pages. I preface
> the remainder of my comments with a huge "I don't know anything"
> disclaimer.
>
Right now, I am not much of an improvement.
> With the new hugepages concept, would it be possible to only mark
> the default pagesize portion of a hugepage as bad and then return the
> remainder of the hugepage for normal use?
The process is dead so the mapping is not a concern. But that huge page is
gone and is no longer useful as a huge page. So, no, it cannot be used for
normal use.
What could be done is something like the following;
1. Go to the struct page that represents the start of the huge page
2. Clear all the flags and fields. Set the count to 1
3. Call __free_pages(smallpage, 0)
Do that for every struct page within the huge page except for the one that
the MCA flagged as bad. The side-effect will be that the buddy allocator
will merge them back to the largest possible blocks and place them on the
free lists. That will give you back all the small pages at least.
> What would we basically need
> to do to accomplish this? Are there patches in the community which we
> should wait to see how they progress before we do any work on this front?
>
Not that I am aware of but that does not mean they do not exist.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
On Wed, Nov 16, 2005 at 11:58:13AM -0800, Christoph Lameter wrote:
> On Wed, 16 Nov 2005, Robin Holt wrote:
>
> > Russ Anderson recently introduced a patch into ia64 that changes MCA
> > behavior. When the MCA is caused by a user reference to a users memory,
> > we put an extra reference on the page and kill the user. This leaves
> > the working memory available for other jobs while causing a leak of the
> > bad page.
> >
> > I don't know if Russ has done any testing with hugetlbfs pages. I preface
> > the remainder of my comments with a huge "I don't know anything"
> > disclaimer.
> >
> > With the new hugepages concept, would it be possible to only mark
> > the default pagesize portion of a hugepage as bad and then return the
> > remainder of the hugepage for normal use? What would we basically need
> > to do to accomplish this? Are there patches in the community which we
> > should wait to see how they progress before we do any work on this front?
>
> On IA64 we have one PTE for a huge page in a different region, so we
> cannot unmap a page sized section. Other architectures may have PTEs for
> each page sized section of a huge page. For those it may make sense
> (but then the management of the page is done via the first page_struct,
> which likely results in some challenging VM issues).
Christoph,
I think you misunderstood me. I was talking about killing the process.
All the mappings get destroyed. I want to reclaim as much of that huge
page as possible.
Once everything is cleared up, I would like to break the huge page back
into normal-size pages and free those.
Thanks,
Robin
On Thu, 17 Nov 2005, Robin Holt wrote:
> On Wed, Nov 16, 2005 at 11:58:13AM -0800, Christoph Lameter wrote:
> > On Wed, 16 Nov 2005, Robin Holt wrote:
> >
> > > Russ Anderson recently introduced a patch into ia64 that changes MCA
> > > behavior. When the MCA is caused by a user reference to a users memory,
> > > we put an extra reference on the page and kill the user. This leaves
> > > the working memory available for other jobs while causing a leak of the
> > > bad page.
> > >
> > > I don't know if Russ has done any testing with hugetlbfs pages. I preface
> > > the remainder of my comments with a huge "I don't know anything"
> > > disclaimer.
> > >
> > > With the new hugepages concept, would it be possible to only mark
> > > the default pagesize portion of a hugepage as bad and then return the
> > > remainder of the hugepage for normal use? What would we basically need
> > > to do to accomplish this? Are there patches in the community which we
> > > should wait to see how they progress before we do any work on this front?
> >
> > On IA64 we have one PTE for a huge page in a different region, so we
> > cannot unmap a page sized section. Other architectures may have PTEs for
> > each page sized section of a huge page. For those it may make sense
> > (but then the management of the page is done via the first page_struct,
> > which likely results in some challenging VM issues).
>
> Christoph,
>
> I think you misunderstood me. I was talking about killing the process.
> All the mappings get destroyed. I want to reclaim as much of that huge
> page as possible.
>
> Once everything is cleared up, I would like to break the huge page back
> into normal-size pages and free those.
>
Then for each struct page making up that huge page, clear all the flags
(see how the flags are cleared in mm/page_alloc.c#bad_page()), set the
count to 1 (set_page_count(page, 1)) and free it as an order 0 page
(__free_page(page, 0)). Some will end up on the per-cpu lists and if the
lists are full, they will end up on the zone free lists as expected. This
is similar to what happens when the buddy allocator is being setup so look
at mm/bootmem.c#free_all_bootmem_core(pg_data_t) to get an idea of what
has to happen.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
On Thu, 17 Nov 2005, Robin Holt wrote:
> > > With the new hugepages concept, would it be possible to only mark
> > > the default pagesize portion of a hugepage as bad and then return the
> > > remainder of the hugepage for normal use? What would we basically need
> > > to do to accomplish this? Are there patches in the community which we
> > > should wait to see how they progress before we do any work on this front?
> >
> > On IA64 we have one PTE for a huge page in a different region, so we
> > cannot unmap a page sized section. Other architectures may have PTEs for
> > each page sized section of a huge page. For those it may make sense
> > (but then the management of the page is done via the first page_struct,
> > which likely results in some challenging VM issues).
>
> I think you misunderstood me. I was talking about killing the process.
> All the mappings get destroyed. I want to reclaim as much of that huge
> page as possible.
You are right. If you can reclaim the whole page (as you can when killing
a process) then you can isolate the bad page and free the rest. There
would have to be special code for that in the hugetlb layer.