2004-06-23 19:33:32

by Chen, Kenneth W

[permalink] [raw]
Subject: More bug fix in mm/hugetlb.c - fix try_to_free_low()

Is it just me having bad luck with hugetlb for x86 lately? Take base
2.6.7, turn on CONFIG_HIGHMEM and CONFIG_HUGETLBFS. Try to config
the hugetlb pool:

[root@quokka]# echo 100 > /proc/sys/vm/nr_hugepages
[root@quokka]# grep HugePage /proc/meminfo
HugePages_Total: 100
HugePages_Free: 100

[root@quokka]# echo 20 > /proc/sys/vm/nr_hugepages
[root@quokka]# grep HugePage /proc/meminfo
HugePages_Total: 0
HugePages_Free: 0

[root@quokka]# echo 100 > /proc/sys/vm/nr_hugepages
[root@quokka]# grep HugePage /proc/meminfo
HugePages_Total: 100
HugePages_Free: 100

[root@quokka]# echo 0 > /proc/sys/vm/nr_hugepages
[root@quokka]# grep HugePage /proc/meminfo
HugePages_Total: 31
HugePages_Free: 31

The argument "count" passed to try_to_free_low() is the config parameter
for desired hugetlb page pool size. But the implementation took that
input argument as number of pages to free. It also decrement the config
parameter as well. All give random behavior depend on how many hugetlb
pages are in normal/highmem zone.

A two line fix in try_to_free_low() would be:

- if (!--count)
- return 0;
+ if (count >= nr_huge_pages)
+ return count;

But more appropriately, that function shouldn't return anything.

diff -Nurp linux-2.6.7.orig/mm/hugetlb.c linux-2.6.7/mm/hugetlb.c
--- linux-2.6.7.orig/mm/hugetlb.c 2004-06-15 22:19:37.000000000 -0700
+++ linux-2.6.7/mm/hugetlb.c 2004-06-23 12:11:31.000000000 -0700
@@ -130,7 +130,7 @@ static void update_and_free_page(struct
}

#ifdef CONFIG_HIGHMEM
-static int try_to_free_low(unsigned long count)
+static void try_to_free_low(unsigned long count)
{
int i;
for (i = 0; i < MAX_NUMNODES; ++i) {
@@ -141,16 +141,14 @@ static int try_to_free_low(unsigned long
list_del(&page->lru);
update_and_free_page(page);
--free_huge_pages;
- if (!--count)
- return 0;
+ if (count >= nr_huge_pages)
+ return;
}
}
- return count;
}
#else
-static inline int try_to_free_low(unsigned long count)
+static inline void try_to_free_low(unsigned long count)
{
- return count;
}
#endif

@@ -170,7 +168,8 @@ static unsigned long set_max_huge_pages(
return nr_huge_pages;

spin_lock(&hugetlb_lock);
- for (count = try_to_free_low(count); count < nr_huge_pages; --free_huge_pages) {
+ try_to_free_low(count);
+ for (; count < nr_huge_pages; --free_huge_pages) {
struct page *page = dequeue_huge_page();
if (!page)
break;



2004-06-23 19:42:49

by William Lee Irwin III

[permalink] [raw]
Subject: Re: More bug fix in mm/hugetlb.c - fix try_to_free_low()

On Wed, Jun 23, 2004 at 12:33:00PM -0700, Chen, Kenneth W wrote:
> The argument "count" passed to try_to_free_low() is the config parameter
> for desired hugetlb page pool size. But the implementation took that
> input argument as number of pages to free. It also decrement the config
> parameter as well. All give random behavior depend on how many hugetlb
> pages are in normal/highmem zone.
> A two line fix in try_to_free_low() would be:

Thanks for cleaning this up; there hasn't been much apparent interest
here lately so I've not gotten much in the way of bugreports to work.

I suspect a general hardening effort should go on here. I'll go about
cleaning up other arches, too.


-- wli

2004-06-23 21:53:56

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: More bug fix in mm/hugetlb.c - fix try_to_free_low()

William Lee Irwin III wrote on Wednesday, June 23, 2004 12:43 PM
> On Wed, Jun 23, 2004 at 12:33:00PM -0700, Chen, Kenneth W wrote:
> > The argument "count" passed to try_to_free_low() is the config parameter
> > for desired hugetlb page pool size. But the implementation took that
> > input argument as number of pages to free. It also decrement the config
> > parameter as well. All give random behavior depend on how many hugetlb
> > pages are in normal/highmem zone.
> > A two line fix in try_to_free_low() would be:
>
> Thanks for cleaning this up; there hasn't been much apparent interest
> here lately so I've not gotten much in the way of bugreports to work.
>
> I suspect a general hardening effort should go on here. I'll go about
> cleaning up other arches, too.

Great! would you also consider merge the following patch? It adds per
node huge page stats in sysfs. Diff'ed relative to base 2.6.7 + 2 bug
fixes in try_to_free_low.

- Ken



diff -Nurp linux-2.6.7.orig/drivers/base/node.c linux-2.6.7/drivers/base/node.c
--- linux-2.6.7.orig/drivers/base/node.c 2004-06-15 22:18:59.000000000 -0700
+++ linux-2.6.7/drivers/base/node.c 2004-06-23 13:58:27.000000000 -0700
@@ -31,12 +31,6 @@ static ssize_t node_read_cpumap(struct s

static SYSDEV_ATTR(cpumap,S_IRUGO,node_read_cpumap,NULL);

-/* Can be overwritten by architecture specific code. */
-int __attribute__((weak)) hugetlb_report_node_meminfo(int node, char *buf)
-{
- return 0;
-}
-
#define K(x) ((x) << (PAGE_SHIFT - 10))
static ssize_t node_read_meminfo(struct sys_device * dev, char * buf)
{
diff -Nurp linux-2.6.7.orig/include/linux/hugetlb.h linux-2.6.7/include/linux/hugetlb.h
--- linux-2.6.7.orig/include/linux/hugetlb.h 2004-06-15 22:19:23.000000000 -0700
+++ linux-2.6.7/include/linux/hugetlb.h 2004-06-23 14:00:08.000000000 -0700
@@ -19,6 +19,7 @@ void zap_hugepage_range(struct vm_area_s
void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
int hugetlb_report_meminfo(char *);
+int hugetlb_report_node_meminfo(int, char *);
int is_hugepage_mem_enough(size_t);
unsigned long hugetlb_total_pages(void);
struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
@@ -72,6 +73,7 @@ static inline unsigned long hugetlb_tota
#define unmap_hugepage_range(vma, start, end) BUG()
#define is_hugepage_mem_enough(size) 0
#define hugetlb_report_meminfo(buf) 0
+#define hugetlb_report_node_meminfo(n, buf) 0
#define mark_mm_hugetlb(mm, vma) do { } while (0)
#define follow_huge_pmd(mm, addr, pmd, write) 0
#define is_aligned_hugepage_range(addr, len) 0
diff -Nurp linux-2.6.7.orig/mm/hugetlb.c linux-2.6.7/mm/hugetlb.c
--- linux-2.6.7.orig/mm/hugetlb.c 2004-06-23 13:37:27.000000000 -0700
+++ linux-2.6.7/mm/hugetlb.c 2004-06-23 14:01:31.000000000 -0700
@@ -15,12 +15,16 @@ const unsigned long hugetlb_zero = 0, hu
static unsigned long nr_huge_pages, free_huge_pages;
unsigned long max_huge_pages;
static struct list_head hugepage_freelists[MAX_NUMNODES];
+static unsigned int nr_huge_pages_node[MAX_NUMNODES];
+static unsigned int free_huge_pages_node[MAX_NUMNODES];
static spinlock_t hugetlb_lock = SPIN_LOCK_UNLOCKED;

static void enqueue_huge_page(struct page *page)
{
- list_add(&page->lru,
- &hugepage_freelists[page_zone(page)->zone_pgdat->node_id]);
+ int nid = page_zone(page)->zone_pgdat->node_id;
+ list_add(&page->lru, &hugepage_freelists[nid]);
+ free_huge_pages++;
+ free_huge_pages_node[nid]++;
}

static struct page *dequeue_huge_page(void)
@@ -38,6 +42,8 @@ static struct page *dequeue_huge_page(vo
page = list_entry(hugepage_freelists[nid].next,
struct page, lru);
list_del(&page->lru);
+ free_huge_pages--;
+ free_huge_pages_node[nid]--;
}
return page;
}
@@ -49,6 +55,10 @@ static struct page *alloc_fresh_huge_pag
page = alloc_pages_node(nid, GFP_HIGHUSER|__GFP_COMP,
HUGETLB_PAGE_ORDER);
nid = (nid + 1) % numnodes;
+ if (page) {
+ nr_huge_pages++;
+ nr_huge_pages_node[page_zone(page)->zone_pgdat->node_id]++;
+ }
return page;
}

@@ -61,7 +71,6 @@ void free_huge_page(struct page *page)

spin_lock(&hugetlb_lock);
enqueue_huge_page(page);
- free_huge_pages++;
spin_unlock(&hugetlb_lock);
}

@@ -76,7 +85,6 @@ struct page *alloc_huge_page(void)
spin_unlock(&hugetlb_lock);
return NULL;
}
- free_huge_pages--;
spin_unlock(&hugetlb_lock);
set_page_count(page, 1);
page[1].mapping = (void *)free_huge_page;
@@ -119,6 +127,7 @@ static void update_and_free_page(struct
{
int i;
nr_huge_pages--;
+ nr_huge_pages_node[page_zone(page)->zone_pgdat->node_id]--;
for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
@@ -132,7 +141,7 @@ static void update_and_free_page(struct
#ifdef CONFIG_HIGHMEM
static void try_to_free_low(unsigned long count)
{
- int i;
+ int i, nid;
for (i = 0; i < MAX_NUMNODES; ++i) {
struct page *page, *next;
list_for_each_entry(page, next, &hugepage_freelists[i], lru) {
@@ -140,7 +149,9 @@ static void try_to_free_low(unsigned lon
continue;
list_del(&page->lru);
update_and_free_page(page);
- --free_huge_pages;
+ nid = page_zone(page)->zone_pgdat->node_id;
+ free_huge_pages--;
+ free_huge_pages_node[nid]--;
if (count >= nr_huge_pages)
return;
}
@@ -160,8 +171,6 @@ static unsigned long set_max_huge_pages(
return nr_huge_pages;
spin_lock(&hugetlb_lock);
enqueue_huge_page(page);
- free_huge_pages++;
- nr_huge_pages++;
spin_unlock(&hugetlb_lock);
}
if (count >= nr_huge_pages)
@@ -169,7 +178,7 @@ static unsigned long set_max_huge_pages(

spin_lock(&hugetlb_lock);
try_to_free_low(count);
- for (; count < nr_huge_pages; --free_huge_pages) {
+ while (count < nr_huge_pages) {
struct page *page = dequeue_huge_page();
if (!page)
break;
@@ -201,6 +210,15 @@ int hugetlb_report_meminfo(char *buf)
HPAGE_SIZE/1024);
}

+int hugetlb_report_node_meminfo(int nid, char *buf)
+{
+ return sprintf(buf,
+ "Node %d HugePages_Total: %5u\n"
+ "Node %d HugePages_Free: %5u\n",
+ nid, nr_huge_pages_node[nid],
+ nid, free_huge_pages_node[nid]);
+}
+
int is_hugepage_mem_enough(size_t size)
{
return (size + ~HPAGE_MASK)/HPAGE_SIZE <= free_huge_pages;


2004-06-23 22:20:13

by Keith Owens

[permalink] [raw]
Subject: Re: More bug fix in mm/hugetlb.c - fix try_to_free_low()

On Wed, 23 Jun 2004 12:42:43 -0700,
William Lee Irwin III <[email protected]> wrote:
>On Wed, Jun 23, 2004 at 12:33:00PM -0700, Chen, Kenneth W wrote:
>> The argument "count" passed to try_to_free_low() is the config parameter
>> for desired hugetlb page pool size. But the implementation took that
>> input argument as number of pages to free. It also decrement the config
>> parameter as well. All give random behavior depend on how many hugetlb
>> pages are in normal/highmem zone.
>> A two line fix in try_to_free_low() would be:
>
>Thanks for cleaning this up; there hasn't been much apparent interest
>here lately so I've not gotten much in the way of bugreports to work.

While we are discussing hugetlb, what is the official method of
identifying a hugetlb page - at the page level, not through a vma?

When taking a crash dump, hugetlb pages must be treated as user data,
not as kernel pages. LKCD must be able to identify hugetlb pages from
the page struct, dumping cannot assume that any mm context is valid so
vma scans are out. The identification method must work whether the
hugetlb pages are in use or not. In 2.4 LKCD I added PG_hugetlb, but I
would prefer a test that did not require yet another PG flag.

2004-06-23 22:28:04

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: More bug fix in mm/hugetlb.c - fix try_to_free_low()

>>>> Keith Owens wrote on Wednesday, June 23, 2004 3:17 PM
>William Lee Irwin III <[email protected]> wrote:
>>On Wed, Jun 23, 2004 at 12:33:00PM -0700, Chen, Kenneth W wrote:
>>> The argument "count" passed to try_to_free_low() is the config parameter
>>> for desired hugetlb page pool size. But the implementation took that
>>> input argument as number of pages to free. It also decrement the config
>>> parameter as well. All give random behavior depend on how many hugetlb
>>> pages are in normal/highmem zone.
>>> A two line fix in try_to_free_low() would be:
>>
>>Thanks for cleaning this up; there hasn't been much apparent interest
>>here lately so I've not gotten much in the way of bugreports to work.
>
>While we are discussing hugetlb, what is the official method of
>identifying a hugetlb page - at the page level, not through a vma?
>
>When taking a crash dump, hugetlb pages must be treated as user data,
>not as kernel pages. LKCD must be able to identify hugetlb pages from
>the page struct, dumping cannot assume that any mm context is valid so
>vma scans are out. The identification method must work whether the
>hugetlb pages are in use or not. In 2.4 LKCD I added PG_hugetlb, but I
>would prefer a test that did not require yet another PG flag.

There is one flag already in the page structure: PG_compound.

- Ken


2004-06-23 22:38:56

by William Lee Irwin III

[permalink] [raw]
Subject: Re: More bug fix in mm/hugetlb.c - fix try_to_free_low()

On Thu, Jun 24, 2004 at 08:17:29AM +1000, Keith Owens wrote:
> While we are discussing hugetlb, what is the official method of
> identifying a hugetlb page - at the page level, not through a vma?
> When taking a crash dump, hugetlb pages must be treated as user data,
> not as kernel pages. LKCD must be able to identify hugetlb pages from
> the page struct, dumping cannot assume that any mm context is valid so
> vma scans are out. The identification method must work whether the
> hugetlb pages are in use or not. In 2.4 LKCD I added PG_hugetlb, but I
> would prefer a test that did not require yet another PG flag.

PG_compound should get you going. You might want to do some extra
checks for order and release function just to be sure.


-- wli