When shrinking the size of the hugetlb pool via the nr_hugepages sysctl, we
are careful to keep enough pages around to satisfy reservations. But the
calculation is flawed for the following scenario:
Action Pool Counters (Total, Free, Resv)
====== =============
Set pool to 1 page 1 1 0
Map 1 page MAP_PRIVATE 1 1 0
Touch the page to fault it in 1 0 0
Set pool to 3 pages 3 2 0
Map 2 pages MAP_SHARED 3 2 2
Set pool to 2 pages 2 1 2 <-- Mistake, should be 3 2 2
Touch the 2 shared pages 2 0 1 <-- Program crashes here
The last touch above will terminate the process due to lack of huge pages.
This patch corrects the calculation so that it factors in pages being used
for private mappings. Andrew, this is a standalone fix suitable for
mainline. It is also now corrected in my latest dynamic pool resizing
patchset which I will send out soon.
Signed-off-by: Adam Litke <[email protected]>
---
mm/hugetlb.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 84c795e..7af3908 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -224,14 +224,14 @@ static void try_to_free_low(unsigned long count)
for (i = 0; i < MAX_NUMNODES; ++i) {
struct page *page, *next;
list_for_each_entry_safe(page, next, &hugepage_freelists[i], lru) {
+ if (count >= nr_huge_pages)
+ return;
if (PageHighMem(page))
continue;
list_del(&page->lru);
update_and_free_page(page);
free_huge_pages--;
free_huge_pages_node[page_to_nid(page)]--;
- if (count >= nr_huge_pages)
- return;
}
}
}
@@ -251,7 +251,7 @@ static unsigned long set_max_huge_pages(unsigned long count)
return nr_huge_pages;
spin_lock(&hugetlb_lock);
- count = max(count, resv_huge_pages);
+ count = max(count, resv_huge_pages + nr_huge_pages - free_huge_pages);
try_to_free_low(count);
while (count < nr_huge_pages) {
struct page *page = dequeue_huge_page(NULL, 0);
On Wed, 2007-10-03 at 08:47 -0700, Adam Litke wrote:
> When shrinking the size of the hugetlb pool via the nr_hugepages sysctl, we
> are careful to keep enough pages around to satisfy reservations. But the
> calculation is flawed for the following scenario:
>
> Action Pool Counters (Total, Free, Resv)
> ====== =============
> Set pool to 1 page 1 1 0
> Map 1 page MAP_PRIVATE 1 1 0
> Touch the page to fault it in 1 0 0
> Set pool to 3 pages 3 2 0
> Map 2 pages MAP_SHARED 3 2 2
> Set pool to 2 pages 2 1 2 <-- Mistake, should be 3 2 2
> Touch the 2 shared pages 2 0 1 <-- Program crashes here
>
> The last touch above will terminate the process due to lack of huge pages.
>
> This patch corrects the calculation so that it factors in pages being used
> for private mappings. Andrew, this is a standalone fix suitable for
> mainline. It is also now corrected in my latest dynamic pool resizing
> patchset which I will send out soon.
>
> Signed-off-by: Adam Litke <[email protected]>
> ---
>
> mm/hugetlb.c | 6 +++---
> 1 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 84c795e..7af3908 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -224,14 +224,14 @@ static void try_to_free_low(unsigned long count)
> for (i = 0; i < MAX_NUMNODES; ++i) {
> struct page *page, *next;
> list_for_each_entry_safe(page, next, &hugepage_freelists[i], lru) {
> + if (count >= nr_huge_pages)
> + return;
> if (PageHighMem(page))
> continue;
> list_del(&page->lru);
> update_and_free_page(page);
> free_huge_pages--;
> free_huge_pages_node[page_to_nid(page)]--;
> - if (count >= nr_huge_pages)
> - return;
> }
> }
> }
That's an excellent problem description. I'm just a bit hazy on how the
patch fixes it. :)
What is the actual error in this loop? The fact that we can go trying
to free pages when the count is actually OK?
BTW, try_to_free_low(count) kinda sucks for a function name. Is that
count the number of pages we're trying to end up with, or the total
number of low pages that we're trying to free?
Also, as I look at try_to_free_low(), why do we need to #ifdef it out in
the case of !HIGHMEM? If we have CONFIG_HIGHMEM=yes, we still might not
have any _actual_ high memory. So, they loop obviously doesn't *hurt*
when there is no high memory.
> @@ -251,7 +251,7 @@ static unsigned long set_max_huge_pages(unsigned long count)
> return nr_huge_pages;
>
> spin_lock(&hugetlb_lock);
> - count = max(count, resv_huge_pages);
> + count = max(count, resv_huge_pages + nr_huge_pages - free_huge_pages);
> try_to_free_low(count);
> while (count < nr_huge_pages) {
> struct page *page = dequeue_huge_page(NULL, 0);
The real problem with this line is that "count" is too ambiguous. :)
We could rewrite the original max() line this way:
if (resv_huge_pages > nr_of_pages_to_end_up_with)
nr_of_pages_to_end_up_with = resv_huge_pages;
try_to_make_the_total_nr_of_hpages(nr_of_pages_to_end_up_with);
Which makes it more clear that you're setting the number of total pages
to the number of reserved pages, which is obviously screwy.
OK, so this is actually saying: "count can never go below
resv_huge_pages+nr_huge_pages"?
Could we change try_to_free_low() to free a distinct number of pages?
if (count > free_huge_pages)
count = free_huge_pages;
try_to_free_nr_huge_pages(count);
I feel a bit sketchy about the "resv_huge_pages + nr_huge_pages -
free_huge_pages" logic. Could you elaborate a bit there on what the
rules are?
-- Dave
On Wed, 2007-10-03 at 10:40 -0700, Dave Hansen wrote:
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 84c795e..7af3908 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -224,14 +224,14 @@ static void try_to_free_low(unsigned long count)
> > for (i = 0; i < MAX_NUMNODES; ++i) {
> > struct page *page, *next;
> > list_for_each_entry_safe(page, next, &hugepage_freelists[i], lru) {
> > + if (count >= nr_huge_pages)
> > + return;
> > if (PageHighMem(page))
> > continue;
> > list_del(&page->lru);
> > update_and_free_page(page);
> > free_huge_pages--;
> > free_huge_pages_node[page_to_nid(page)]--;
> > - if (count >= nr_huge_pages)
> > - return;
> > }
> > }
> > }
>
> That's an excellent problem description. I'm just a bit hazy on how the
> patch fixes it. :)
>
> What is the actual error in this loop? The fact that we can go trying
> to free pages when the count is actually OK?
The above hunk serves only to change the behavior of try_to_free_low()
so that rather than always freeing _at_least_ one huge page, it can
return without having freed any pages.
> BTW, try_to_free_low(count) kinda sucks for a function name. Is that
> count the number of pages we're trying to end up with, or the total
> number of low pages that we're trying to free?
I agree the name sucks, but this is a bugfix patch.
> Also, as I look at try_to_free_low(), why do we need to #ifdef it out in
> the case of !HIGHMEM? If we have CONFIG_HIGHMEM=yes, we still might not
> have any _actual_ high memory. So, they loop obviously doesn't *hurt*
> when there is no high memory.
Maybe, but not really in-scope of what this patch is trying to
accomplish.
> > @@ -251,7 +251,7 @@ static unsigned long set_max_huge_pages(unsigned long count)
> > return nr_huge_pages;
> >
> > spin_lock(&hugetlb_lock);
> > - count = max(count, resv_huge_pages);
> > + count = max(count, resv_huge_pages + nr_huge_pages - free_huge_pages);
> > try_to_free_low(count);
> > while (count < nr_huge_pages) {
> > struct page *page = dequeue_huge_page(NULL, 0);
>
> The real problem with this line is that "count" is too ambiguous. :)
I agree, count is almost always a bad variable name :)
> We could rewrite the original max() line this way:
>
> if (resv_huge_pages > nr_of_pages_to_end_up_with)
> nr_of_pages_to_end_up_with = resv_huge_pages;
> try_to_make_the_total_nr_of_hpages(nr_of_pages_to_end_up_with);
>
> Which makes it more clear that you're setting the number of total pages
> to the number of reserved pages, which is obviously screwy.
>
> OK, so this is actually saying: "count can never go below
> resv_huge_pages+nr_huge_pages"?
Not quite. Count can never go below the number of reserved pages plus
pages allocated to MAP_PRIVATE mappings. That number is computed by:
(resv + (total - free)).
> Could we change try_to_free_low() to free a distinct number of pages?
>
> if (count > free_huge_pages)
> count = free_huge_pages;
> try_to_free_nr_huge_pages(count);
>
> I feel a bit sketchy about the "resv_huge_pages + nr_huge_pages -
> free_huge_pages" logic. Could you elaborate a bit there on what the
> rules are?
The key is that we don't want to shrink the pool below the number of
pages we are committed to keeping around. Before this patch, we only
accounted for the pages we plan to hand out (reserved huge pages) but
not the ones we've already handed out (total - free). Does that make
sense?
--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center
On Wed, 2007-10-03 at 13:33 -0500, Adam Litke wrote:
> On Wed, 2007-10-03 at 10:40 -0700, Dave Hansen wrote:
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 84c795e..7af3908 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -224,14 +224,14 @@ static void try_to_free_low(unsigned long count)
> > > for (i = 0; i < MAX_NUMNODES; ++i) {
> > > struct page *page, *next;
> > > list_for_each_entry_safe(page, next, &hugepage_freelists[i], lru) {
> > > + if (count >= nr_huge_pages)
> > > + return;
> > > if (PageHighMem(page))
> > > continue;
> > > list_del(&page->lru);
> > > update_and_free_page(page);
> > > free_huge_pages--;
> > > free_huge_pages_node[page_to_nid(page)]--;
> > > - if (count >= nr_huge_pages)
> > > - return;
> > > }
> > > }
> > > }
> >
> > That's an excellent problem description. I'm just a bit hazy on how the
> > patch fixes it. :)
> >
> > What is the actual error in this loop? The fact that we can go trying
> > to free pages when the count is actually OK?
>
> The above hunk serves only to change the behavior of try_to_free_low()
> so that rather than always freeing _at_least_ one huge page, it can
> return without having freed any pages.
OK, that makes sense. Can you include that in the patch description?
> > BTW, try_to_free_low(count) kinda sucks for a function name. Is that
> > count the number of pages we're trying to end up with, or the total
> > number of low pages that we're trying to free?
>
> I agree the name sucks, but this is a bugfix patch.
Oh, yeah, none of what I was suggesting belongs in this patch. But,
they'd make good followups. I was just trying to get you to look at
what caused that bug to be introduced in the first place.
> > We could rewrite the original max() line this way:
> >
> > if (resv_huge_pages > nr_of_pages_to_end_up_with)
> > nr_of_pages_to_end_up_with = resv_huge_pages;
> > try_to_make_the_total_nr_of_hpages(nr_of_pages_to_end_up_with);
> >
> > Which makes it more clear that you're setting the number of total pages
> > to the number of reserved pages, which is obviously screwy.
> >
> > OK, so this is actually saying: "count can never go below
> > resv_huge_pages+nr_huge_pages"?
>
> Not quite. Count can never go below the number of reserved pages plus
> pages allocated to MAP_PRIVATE mappings. That number is computed by:
> (resv + (total - free)).
So, (total - free) equals the number of MAP_PRIVATE pages? Does that
imply that all reserved pages are shared and that all shared pages are
reserved?
This would be clearer even with
static inline int nr_map_private_hpages(void)
{
return total - free;
}
Especially, if we could get the "shared" name somewhere into the
reserved variable.
It makes a whole ton of sense to see:
int total_in_use = nr_shared() + nr_private();
if (nr_requested_pages < total_in_use)
uh_oh_cant_do_that();
That'd be a wonderful cleanup.
> > Could we change try_to_free_low() to free a distinct number of pages?
> >
> > if (count > free_huge_pages)
> > count = free_huge_pages;
> > try_to_free_nr_huge_pages(count);
> >
> > I feel a bit sketchy about the "resv_huge_pages + nr_huge_pages -
> > free_huge_pages" logic. Could you elaborate a bit there on what the
> > rules are?
>
> The key is that we don't want to shrink the pool below the number of
> pages we are committed to keeping around. Before this patch, we only
> accounted for the pages we plan to hand out (reserved huge pages) but
> not the ones we've already handed out (total - free). Does that make
> sense?
Yes. I'm just starting to get the idea that pages are committed in
different ways. So, it makes good sense to have a nice function like
total_committed_hpages(), which explains all of this, and is called to
compare agaist the evil "count" variable. :)
-- Dave
On 10/3/07, Adam Litke <[email protected]> wrote:
> The key is that we don't want to shrink the pool below the number of
> pages we are committed to keeping around. Before this patch, we only
> accounted for the pages we plan to hand out (reserved huge pages) but
> not the ones we've already handed out (total - free). Does that make
> sense?
Good catch, adam.
>From what I can see, the statement
if (count >= nr_huge_pages)
return nr_huge_pages;
in set_max_huge_pages() is useless because (1) we recalculate "count"
variable below it; and (2) both try_to_free_low() and the while loop
below the call to try_to_free_low() will terminate correctly. If you
feel like it, please clean it up as well.
If not, I'm fine with that.
Acked-by: Ken Chen <[email protected]>
On 10/3/07, Dave Hansen <[email protected]> wrote:
> > Not quite. Count can never go below the number of reserved pages plus
> > pages allocated to MAP_PRIVATE mappings. That number is computed by:
> > (resv + (total - free)).
>
> So, (total - free) equals the number of MAP_PRIVATE pages? Does that
> imply that all reserved pages are shared and that all shared pages are
> reserved?
no, not quite. In-use huge page (total - free) can be both private or
shared. resv_huge_pages counts number of pages that is committed for
shared mapping, but not yet faulted in.
What the equation does essentially is: resv_huge_pages + nr-huge-pages-in-use.
- Ken