LinuxLists.cc - RFC: Block reservation for hugetlbfs

2006-02-21 02:22:18

Subject: RFC: Block reservation for hugetlbfs

These days, hugepages are demand-allocated at first fault time.
There's a somewhat dubious (and racy) heuristic when making a new
mmap() to check if there are enough available hugepages to fully
satisfy that mapping.

A particularly obvious case where the heuristic breaks down is where a
process maps its hugepages not as a single chunk, but as a bunch of
individually mmap()ed (or shmat()ed) blocks without touching and
instantiating the blocks in between allocations. In this case the
size of each block is compared against the total number of available
hugepages. It's thus easy for the process to become overcommitted,
because each block mapping will succeed, although the total number of
hugepages required by all blocks exceeds the number available. In
particular, this defeats such a program which will detect a mapping
failure and adjust its hugepage usage downward accordingly.

The patch below is a draft attempt to address this problem, by
strictly reserving a number of physical hugepages for hugepages inodes
which have been mapped, but not instatiated. MAP_SHARED mappings are
thus "safe" - they will fail on mmap(), not later with a SIGBUS.
MAP_PRIVATE mappings can still SIGBUS.

This patch appears to address the problem at hand - it allows DB2 to
start correctly, for instance, which previously suffered the failure
described above. I'm almost certain I'm missing some locking or other
synchronization - I am entirely bewildered as to what I need to hold
to safely update i_blocks as below. Corrections for my ignorance
solicited...

Signed-off-by: David Gibson <[email protected]>

Index: working-2.6/mm/hugetlb.c
===================================================================
--- working-2.6.orig/mm/hugetlb.c 2006-02-17 14:44:04.000000000 +1100
+++ working-2.6/mm/hugetlb.c 2006-02-20 14:10:24.000000000 +1100
@@ -20,7 +20,7 @@
#include <linux/hugetlb.h>

const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
-static unsigned long nr_huge_pages, free_huge_pages;
+static unsigned long nr_huge_pages, free_huge_pages, reserved_huge_pages;
unsigned long max_huge_pages;
static struct list_head hugepage_freelists[MAX_NUMNODES];
static unsigned int nr_huge_pages_node[MAX_NUMNODES];
@@ -94,21 +94,87 @@ void free_huge_page(struct page *page)

struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr)
{
+ struct inode *inode = vma->vm_file->f_dentry->d_inode;
struct page *page;
int i;
+ int use_reserve = 0;
+
+ if (vma->vm_flags & VM_MAYSHARE) {
+ unsigned long idx;
+
+ idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
+ + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+
+ if (idx < inode->i_blocks)
+ use_reserve = 1;
+ }

spin_lock(&hugetlb_lock);
- page = dequeue_huge_page(vma, addr);
- if (!page) {
- spin_unlock(&hugetlb_lock);
- return NULL;
+ if (! use_reserve) {
+ if (free_huge_pages <= reserved_huge_pages)
+ goto fail;
+ } else {
+ reserved_huge_pages--;
}
+
+ page = dequeue_huge_page(vma, addr);
+ if (!page)
+ goto fail;
+
spin_unlock(&hugetlb_lock);
+
set_page_count(page, 1);
page[1].lru.next = (void *)free_huge_page; /* set dtor */
for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
clear_user_highpage(&page[i], addr);
return page;
+
+ fail:
+ WARN_ON(use_reserve); /* reserved allocations shouldn't fail */
+ spin_unlock(&hugetlb_lock);
+ return NULL;
+}
+
+int hugetlb_reserve_for_inode(struct inode *inode, unsigned long npages)
+{
+ struct address_space *mapping = inode->i_mapping;
+ unsigned long pg;
+ long change_in_reserve = 0;
+ struct page *page;
+ int ret = -ENOMEM;
+
+ read_lock_irq(&inode->i_mapping->tree_lock);
+ for (pg = inode->i_blocks; pg < npages; pg++) {
+ page = radix_tree_lookup(&mapping->page_tree, pg);
+ if (! page)
+ change_in_reserve++;
+ }
+
+ for (pg = npages; pg < inode->i_blocks; pg++) {
+ page = radix_tree_lookup(&mapping->page_tree, pg);
+ if (! page)
+ change_in_reserve--;
+ }
+ spin_lock(&hugetlb_lock);
+ if ((change_in_reserve > 0)
+ && (change_in_reserve > (free_huge_pages-reserved_huge_pages))) {
+ printk("Failing: change=%ld free=%lu\n",
+ change_in_reserve, free_huge_pages - reserved_huge_pages);
+ goto out;
+ }
+
+ reserved_huge_pages += change_in_reserve;
+ inode->i_blocks = npages;
+
+ ret = 0;
+
+ out:
+ spin_unlock(&hugetlb_lock);
+ read_unlock_irq(&inode->i_mapping->tree_lock);
+
+/* printk("hugetlb_reserve_for_inode(.., %lu) = %d delta = %ld\n", */
+/* npages, ret, change_in_reserve); */
+ return ret;
}

static int __init hugetlb_init(void)
@@ -225,9 +291,11 @@ int hugetlb_report_meminfo(char *buf)
return sprintf(buf,
"HugePages_Total: %5lu\n"
"HugePages_Free: %5lu\n"
+ "HugePages_Rsvd: %5lu\n"
"Hugepagesize: %5lu kB\n",
nr_huge_pages,
free_huge_pages,
+ reserved_huge_pages,
HPAGE_SIZE/1024);
}

Index: working-2.6/fs/hugetlbfs/inode.c
===================================================================
--- working-2.6.orig/fs/hugetlbfs/inode.c 2006-02-02 10:08:13.000000000 +1100
+++ working-2.6/fs/hugetlbfs/inode.c 2006-02-20 14:10:24.000000000 +1100
@@ -56,48 +56,9 @@ static void huge_pagevec_release(struct
pagevec_reinit(pvec);
}

-/*
- * huge_pages_needed tries to determine the number of new huge pages that
- * will be required to fully populate this VMA. This will be equal to
- * the size of the VMA in huge pages minus the number of huge pages
- * (covered by this VMA) that are found in the page cache.
- *
- * Result is in bytes to be compatible with is_hugepage_mem_enough()
- */
-static unsigned long
-huge_pages_needed(struct address_space *mapping, struct vm_area_struct *vma)
-{
- int i;
- struct pagevec pvec;
- unsigned long start = vma->vm_start;
- unsigned long end = vma->vm_end;
- unsigned long hugepages = (end - start) >> HPAGE_SHIFT;
- pgoff_t next = vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT);
- pgoff_t endpg = next + hugepages;
-
- pagevec_init(&pvec, 0);
- while (next < endpg) {
- if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE))
- break;
- for (i = 0; i < pagevec_count(&pvec); i++) {
- struct page *page = pvec.pages[i];
- if (page->index > next)
- next = page->index;
- if (page->index >= endpg)
- break;
- next++;
- hugepages--;
- }
- huge_pagevec_release(&pvec);
- }
- return hugepages << HPAGE_SHIFT;
-}
-
static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct inode *inode = file->f_dentry->d_inode;
- struct address_space *mapping = inode->i_mapping;
- unsigned long bytes;
loff_t len, vma_len;
int ret;

@@ -113,13 +74,8 @@ static int hugetlbfs_file_mmap(struct fi
if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
return -EINVAL;

- bytes = huge_pages_needed(mapping, vma);
- if (!is_hugepage_mem_enough(bytes))
- return -ENOMEM;
-
vma_len = (loff_t)(vma->vm_end - vma->vm_start);

- mutex_lock(&inode->i_mutex);
file_accessed(file);
vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
vma->vm_ops = &hugetlb_vm_ops;
@@ -129,13 +85,22 @@ static int hugetlbfs_file_mmap(struct fi
if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size)
goto out;

- ret = 0;
+ if (inode->i_blocks < (len >> HPAGE_SHIFT)) {
+ ret = hugetlb_reserve_for_inode(inode, len >> HPAGE_SHIFT);
+ if (ret) {
+ printk("Reservation failed: %lu pages\n",
+ (unsigned long)(len >> HPAGE_SHIFT));
+ goto out;
+ }
+ }
+
hugetlb_prefault_arch_hook(vma->vm_mm);
if (inode->i_size < len)
inode->i_size = len;
-out:
- mutex_unlock(&inode->i_mutex);

+ ret = 0;
+
+out:
return ret;
}

@@ -227,12 +192,17 @@ static void truncate_huge_page(struct pa
put_page(page);
}

-static void truncate_hugepages(struct address_space *mapping, loff_t lstart)
+static void truncate_hugepages(struct inode *inode, loff_t lstart)
{
+ struct address_space *mapping = &inode->i_data;
const pgoff_t start = lstart >> HPAGE_SHIFT;
struct pagevec pvec;
pgoff_t next;
int i;
+ int ret;
+
+ ret = hugetlb_reserve_for_inode(inode, lstart >> HPAGE_SHIFT);
+ WARN_ON(ret); /* truncation should never cause a reservation failure */

pagevec_init(&pvec, 0);
next = start;
@@ -262,8 +232,7 @@ static void truncate_hugepages(struct ad

static void hugetlbfs_delete_inode(struct inode *inode)
{
- if (inode->i_data.nrpages)
- truncate_hugepages(&inode->i_data, 0);
+ truncate_hugepages(inode, 0);
clear_inode(inode);
}

@@ -296,8 +265,7 @@ static void hugetlbfs_forget_inode(struc
inode->i_state |= I_FREEING;
inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);
- if (inode->i_data.nrpages)
- truncate_hugepages(&inode->i_data, 0);
+ truncate_hugepages(inode, 0);
clear_inode(inode);
destroy_inode(inode);
}
@@ -356,7 +324,7 @@ static int hugetlb_vmtruncate(struct ino
if (!prio_tree_empty(&mapping->i_mmap))
hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
spin_unlock(&mapping->i_mmap_lock);
- truncate_hugepages(mapping, offset);
+ truncate_hugepages(inode, offset);
return 0;
}

Index: working-2.6/include/linux/hugetlb.h
===================================================================
--- working-2.6.orig/include/linux/hugetlb.h 2006-01-16 13:02:29.000000000 +1100
+++ working-2.6/include/linux/hugetlb.h 2006-02-20 14:10:24.000000000 +1100
@@ -22,6 +22,7 @@ int hugetlb_report_meminfo(char *);
int hugetlb_report_node_meminfo(int, char *);
int is_hugepage_mem_enough(size_t);
unsigned long hugetlb_total_pages(void);
+int hugetlb_reserve_for_inode(struct inode *inode, unsigned long npages);
struct page *alloc_huge_page(struct vm_area_struct *, unsigned long);
void free_huge_page(struct page *);
int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

2006-02-21 07:51:21

by Nick Piggin

[permalink] [raw]

Subject: Re: RFC: Block reservation for hugetlbfs

David Gibson wrote:
> These days, hugepages are demand-allocated at first fault time.
> There's a somewhat dubious (and racy) heuristic when making a new
> mmap() to check if there are enough available hugepages to fully
> satisfy that mapping.
>
> A particularly obvious case where the heuristic breaks down is where a
> process maps its hugepages not as a single chunk, but as a bunch of
> individually mmap()ed (or shmat()ed) blocks without touching and
> instantiating the blocks in between allocations. In this case the
> size of each block is compared against the total number of available
> hugepages. It's thus easy for the process to become overcommitted,
> because each block mapping will succeed, although the total number of
> hugepages required by all blocks exceeds the number available. In
> particular, this defeats such a program which will detect a mapping
> failure and adjust its hugepage usage downward accordingly.
>
> The patch below is a draft attempt to address this problem, by
> strictly reserving a number of physical hugepages for hugepages inodes
> which have been mapped, but not instatiated. MAP_SHARED mappings are
> thus "safe" - they will fail on mmap(), not later with a SIGBUS.
> MAP_PRIVATE mappings can still SIGBUS.
>
> This patch appears to address the problem at hand - it allows DB2 to
> start correctly, for instance, which previously suffered the failure
> described above. I'm almost certain I'm missing some locking or other
> synchronization - I am entirely bewildered as to what I need to hold
> to safely update i_blocks as below. Corrections for my ignorance
> solicited...
>
> Signed-off-by: David Gibson <[email protected]>
>

This introduces
tree_lock(r) -> hugetlb_lock

And we already have
hugetlb_lock -> lru_lock

So we now have tree_lock(r) -> lru_lock, which would deadlock
against lru_lock -> tree_lock(w), right?

From a quick glance it looks safe, but I'd _really_ rather not
introduce something like this.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-02-21 19:25:40

by Dave Hansen

[permalink] [raw]

Subject: Re: RFC: Block reservation for hugetlbfs

On Tue, 2006-02-21 at 13:21 +1100, David Gibson wrote:
> The patch below is a draft attempt to address this problem, by
> strictly reserving a number of physical hugepages for hugepages inodes
> which have been mapped, but not instatiated. MAP_SHARED mappings are
> thus "safe" - they will fail on mmap(), not later with a SIGBUS.
> MAP_PRIVATE mappings can still SIGBUS.
...
> + read_lock_irq(&inode->i_mapping->tree_lock);
> + for (pg = inode->i_blocks; pg < npages; pg++) {
> + page = radix_tree_lookup(&mapping->page_tree, pg);
> + if (! page)
> + change_in_reserve++;
> + }
> +
> + for (pg = npages; pg < inode->i_blocks; pg++) {
> + page = radix_tree_lookup(&mapping->page_tree, pg);
> + if (! page)
> + change_in_reserve--;
> + }
> + spin_lock(&hugetlb_lock);

I'm a bit confused by this. The for loops goes through and looks for
pages that have indexes greater than the current i_blocks, but less than
the number of pages requested by this reservation. With demand
faulting, can there be holes in the file? Will this code account for
them?

I think this would also be a bit more clear if the two for loops were a
bit more explicit in what they are doing. The first is growing the
reservation when "npages > inode->i_blocks" and the second is shrinking
the reservation when "npages < inode->i_blocks", right? Is this
completely clear from the code? ;)

Also, since the operation you are performing is actually counting pages
in the radix tree at certain indexes, would it be sane to have a patch
such as the (completely untested) attached one to do just that, but in
the radix code?

-- Dave

Attachments:

radix-tree-count.patch (1.78 kB)

2006-02-21 23:47:20

by David Gibson

[permalink] [raw]

Subject: Re: RFC: Block reservation for hugetlbfs

On Tue, Feb 21, 2006 at 03:18:59PM +1100, Nick Piggin wrote:
> David Gibson wrote:
> >These days, hugepages are demand-allocated at first fault time.
> >There's a somewhat dubious (and racy) heuristic when making a new
> >mmap() to check if there are enough available hugepages to fully
> >satisfy that mapping.
> >
> >A particularly obvious case where the heuristic breaks down is where a
> >process maps its hugepages not as a single chunk, but as a bunch of
> >individually mmap()ed (or shmat()ed) blocks without touching and
> >instantiating the blocks in between allocations. In this case the
> >size of each block is compared against the total number of available
> >hugepages. It's thus easy for the process to become overcommitted,
> >because each block mapping will succeed, although the total number of
> >hugepages required by all blocks exceeds the number available. In
> >particular, this defeats such a program which will detect a mapping
> >failure and adjust its hugepage usage downward accordingly.
> >
> >The patch below is a draft attempt to address this problem, by
> >strictly reserving a number of physical hugepages for hugepages inodes
> >which have been mapped, but not instatiated. MAP_SHARED mappings are
> >thus "safe" - they will fail on mmap(), not later with a SIGBUS.
> >MAP_PRIVATE mappings can still SIGBUS.
> >
> >This patch appears to address the problem at hand - it allows DB2 to
> >start correctly, for instance, which previously suffered the failure
> >described above. I'm almost certain I'm missing some locking or other
> >synchronization - I am entirely bewildered as to what I need to hold
> >to safely update i_blocks as below. Corrections for my ignorance
> >solicited...
> >
> >Signed-off-by: David Gibson <[email protected]>
> >
>
> This introduces
> tree_lock(r) -> hugetlb_lock
>
> And we already have
> hugetlb_lock -> lru_lock
>
> So we now have tree_lock(r) -> lru_lock, which would deadlock
> against lru_lock -> tree_lock(w), right?
>
> From a quick glance it looks safe, but I'd _really_ rather not
> introduce something like this.

Urg.. good point. I hadn't even thought of that consequence - I was
more worried about whether I need i_lock or i_mutex to protect my
updates to i_blocks.

Would hugetlb_lock -> tree_lock(r) be any preferable (I think that's a
possible alternative).

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

2006-02-21 23:47:11

by David Gibson

[permalink] [raw]

Subject: Re: RFC: Block reservation for hugetlbfs

On Tue, Feb 21, 2006 at 11:25:36AM -0800, Dave Hansen wrote:
> On Tue, 2006-02-21 at 13:21 +1100, David Gibson wrote:
> > The patch below is a draft attempt to address this problem, by
> > strictly reserving a number of physical hugepages for hugepages inodes
> > which have been mapped, but not instatiated. MAP_SHARED mappings are
> > thus "safe" - they will fail on mmap(), not later with a SIGBUS.
> > MAP_PRIVATE mappings can still SIGBUS.
> ...
> > + read_lock_irq(&inode->i_mapping->tree_lock);
> > + for (pg = inode->i_blocks; pg < npages; pg++) {
> > + page = radix_tree_lookup(&mapping->page_tree, pg);
> > + if (! page)
> > + change_in_reserve++;
> > + }
> > +
> > + for (pg = npages; pg < inode->i_blocks; pg++) {
> > + page = radix_tree_lookup(&mapping->page_tree, pg);
> > + if (! page)
> > + change_in_reserve--;
> > + }
> > + spin_lock(&hugetlb_lock);
>
> I'm a bit confused by this. The for loops goes through and looks for
> pages that have indexes greater than the current i_blocks, but less than
> the number of pages requested by this reservation. With demand
> faulting, can there be holes in the file? Will this code account for
> them?

There can be holes in terms of which pages are instantiated, and in
terms of what's mapped but with this patch we always reserve space
from the beginning of the file. This is a simplifying assumption, so
that i_blocks alone can store sufficient information about what pages
are pre-reserved. Otherwise we'd either need individual information
about the reservation status of each page - and there's no obvious
place to store it, or we'd have to deduce it in multiple places by
working through each vma which might map the page. Andy Whitcroft's
old hugepage strict reservation patch used the latter approach, I
invented this i_blocks method because it seems to be simpler.

> I think this would also be a bit more clear if the two for loops were a
> bit more explicit in what they are doing. The first is growing the
> reservation when "npages > inode->i_blocks" and the second is shrinking
> the reservation when "npages < inode->i_blocks", right? Is this
> completely clear from the code? ;)

Ok, I'll add some more explanation here.

> Also, since the operation you are performing is actually counting pages
> in the radix tree at certain indexes, would it be sane to have a patch
> such as the (completely untested) attached one to do just that, but in
> the radix code?

Given that the patch below seems to introduce more code that the
above, but scattered across several places, and the hugepage
accounting code would be its only user, I don't see the point.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

2006-02-22 01:37:26

by Nick Piggin

[permalink] [raw]

Subject: Re: RFC: Block reservation for hugetlbfs

David Gibson wrote:
> On Tue, Feb 21, 2006 at 03:18:59PM +1100, Nick Piggin wrote:

>>This introduces
>>tree_lock(r) -> hugetlb_lock
>>
>>And we already have
>>hugetlb_lock -> lru_lock
>>
>>So we now have tree_lock(r) -> lru_lock, which would deadlock
>>against lru_lock -> tree_lock(w), right?
>>
>>From a quick glance it looks safe, but I'd _really_ rather not
>>introduce something like this.
>
>
> Urg.. good point. I hadn't even thought of that consequence - I was
> more worried about whether I need i_lock or i_mutex to protect my
> updates to i_blocks.
>
> Would hugetlb_lock -> tree_lock(r) be any preferable (I think that's a
> possible alternative).
>

Yes I think that should avoid the introduction of new lock dependency.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-02-22 02:27:06

by David Gibson

[permalink] [raw]

Subject: Re: RFC: Block reservation for hugetlbfs

On Wed, Feb 22, 2006 at 11:38:42AM +1100, Nick Piggin wrote:
> David Gibson wrote:
> >On Tue, Feb 21, 2006 at 03:18:59PM +1100, Nick Piggin wrote:
>
> >>This introduces
> >>tree_lock(r) -> hugetlb_lock
> >>
> >>And we already have
> >>hugetlb_lock -> lru_lock
> >>
> >>So we now have tree_lock(r) -> lru_lock, which would deadlock
> >>against lru_lock -> tree_lock(w), right?
> >>
> >>From a quick glance it looks safe, but I'd _really_ rather not
> >>introduce something like this.
> >
> >
> >Urg.. good point. I hadn't even thought of that consequence - I was
> >more worried about whether I need i_lock or i_mutex to protect my
> >updates to i_blocks.
> >
> >Would hugetlb_lock -> tree_lock(r) be any preferable (I think that's a
> >possible alternative).
> >
>
> Yes I think that should avoid the introduction of new lock dependency.

Err... "Yes" appears to contradict the rest of you statement, since my
suggestion would still introduce a lock dependency, just a different
one one. It is not at all obvious to me how to avoid a lock
dependency entirely.

Also, any thoughts on whether I need i_lock or i_mutex or something
else would be handy..

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

2006-02-22 05:16:13

by Nick Piggin

[permalink] [raw]

Subject: Re: RFC: Block reservation for hugetlbfs

David Gibson wrote:
> On Wed, Feb 22, 2006 at 11:38:42AM +1100, Nick Piggin wrote:
>
>>David Gibson wrote:
>>
>>>On Tue, Feb 21, 2006 at 03:18:59PM +1100, Nick Piggin wrote:
>>
>>>>This introduces
>>>>tree_lock(r) -> hugetlb_lock
>>>>
>>>>And we already have
>>>>hugetlb_lock -> lru_lock
>>>>
>>>>So we now have tree_lock(r) -> lru_lock, which would deadlock
>>>>against lru_lock -> tree_lock(w), right?
>>>>
>>>
>>>>From a quick glance it looks safe, but I'd _really_ rather not
>>>
>>>>introduce something like this.
>>>
>>>
>>>Urg.. good point. I hadn't even thought of that consequence - I was
>>>more worried about whether I need i_lock or i_mutex to protect my
>>>updates to i_blocks.
>>>
>>>Would hugetlb_lock -> tree_lock(r) be any preferable (I think that's a
>>>possible alternative).
>>>
>>
>>Yes I think that should avoid the introduction of new lock dependency.
>
>
> Err... "Yes" appears to contradict the rest of you statement, since my
> suggestion would still introduce a lock dependency, just a different
> one one. It is not at all obvious to me how to avoid a lock
> dependency entirely.
>

I mean a new core mm lock depenency (ie. lru_lock -> tree_lock).

But I must have been smoking something last night: for the life
of me I can't see why I thought there was already a hugetlb_lock
-> lru_lock dependency in there...?!

So I retract my statement. What you have there seems OK.

> Also, any thoughts on whether I need i_lock or i_mutex or something
> else would be handy..
>

I'm not much of an fs guy. How come you don't use i_size?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-02-24 04:46:38

by David Gibson

[permalink] [raw]

Subject: Re: RFC: Block reservation for hugetlbfs

On Wed, Feb 22, 2006 at 02:09:09PM +1100, Nick Piggin wrote:
> David Gibson wrote:
> >On Wed, Feb 22, 2006 at 11:38:42AM +1100, Nick Piggin wrote:
> >
> >>David Gibson wrote:
> >>
> >>>On Tue, Feb 21, 2006 at 03:18:59PM +1100, Nick Piggin wrote:
> >>
> >>>>This introduces
> >>>>tree_lock(r) -> hugetlb_lock
> >>>>
> >>>>And we already have
> >>>>hugetlb_lock -> lru_lock
> >>>>
> >>>>So we now have tree_lock(r) -> lru_lock, which would deadlock
> >>>>against lru_lock -> tree_lock(w), right?
> >>>>
> >>>
> >>>>From a quick glance it looks safe, but I'd _really_ rather not
> >>>
> >>>>introduce something like this.
> >>>
> >>>
> >>>Urg.. good point. I hadn't even thought of that consequence - I was
> >>>more worried about whether I need i_lock or i_mutex to protect my
> >>>updates to i_blocks.
> >>>
> >>>Would hugetlb_lock -> tree_lock(r) be any preferable (I think that's a
> >>>possible alternative).
> >>>
> >>
> >>Yes I think that should avoid the introduction of new lock dependency.
> >
> >
> >Err... "Yes" appears to contradict the rest of you statement, since my
> >suggestion would still introduce a lock dependency, just a different
> >one one. It is not at all obvious to me how to avoid a lock
> >dependency entirely.
> >
>
> I mean a new core mm lock depenency (ie. lru_lock -> tree_lock).
>
> But I must have been smoking something last night: for the life
> of me I can't see why I thought there was already a hugetlb_lock
> -> lru_lock dependency in there...?!
>
> So I retract my statement. What you have there seems OK.

Sadly, you weren't smoking something, and it's not OK. As akpm
pointed out later, the lru_lock dependecy is via __free_pages() which
is called from update_and_free_page() with hugetlb_lock held.

> >Also, any thoughts on whether I need i_lock or i_mutex or something
> >else would be handy..
>
> I'm not much of an fs guy. How come you don't use i_size?

i_size is already used for a hard limit on the file size - faulting a
page beyond i_size will SIGBUS, whereas faulting a page beyond
i_blocks just isn't guaranteed. In particular, we always extend
i_size when makiing a new mapping, whereas we only extend i_blocks
(and thus reserve pages) on a SHARED mapping (because space is being
guaranteed for things in the mapping, not for a random processes
MAP_PRIVATE copy).

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

2006-02-24 06:22:11

by Nick Piggin

[permalink] [raw]

Subject: Re: RFC: Block reservation for hugetlbfs

David Gibson wrote:
> On Wed, Feb 22, 2006 at 02:09:09PM +1100, Nick Piggin wrote:

>>I mean a new core mm lock depenency (ie. lru_lock -> tree_lock).
>>
>>But I must have been smoking something last night: for the life
>>of me I can't see why I thought there was already a hugetlb_lock
>>-> lru_lock dependency in there...?!
>>
>>So I retract my statement. What you have there seems OK.
>
>
> Sadly, you weren't smoking something, and it's not OK. As akpm
> pointed out later, the lru_lock dependecy is via __free_pages() which
> is called from update_and_free_page() with hugetlb_lock held.
>

You're either thinking of zone->lock, or put_page/page_cache_release.
The former is happy to nest inside anything because it is confined to
page_alloc. The latter AFAIKS is not called from inside the hugepage
lock.

>
>>>Also, any thoughts on whether I need i_lock or i_mutex or something
>>>else would be handy..
>>
>>I'm not much of an fs guy. How come you don't use i_size?
>
>
> i_size is already used for a hard limit on the file size - faulting a
> page beyond i_size will SIGBUS, whereas faulting a page beyond
> i_blocks just isn't guaranteed. In particular, we always extend
> i_size when makiing a new mapping, whereas we only extend i_blocks
> (and thus reserve pages) on a SHARED mapping (because space is being
> guaranteed for things in the mapping, not for a random processes
> MAP_PRIVATE copy).
>

Oh OK I misread how you're using it. I thought you wanted to be
able to guarantee the whole thing would be guaranteed.

The other thing you might be able to do is use hugetlbfs inode
private data so you don't have to overload vfs things?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-02-27 00:19:11

by David Gibson

[permalink] [raw]

Subject: Re: RFC: Block reservation for hugetlbfs

On Fri, Feb 24, 2006 at 05:22:06PM +1100, Nick Piggin wrote:
> David Gibson wrote:
> >On Wed, Feb 22, 2006 at 02:09:09PM +1100, Nick Piggin wrote:
>
> >>I mean a new core mm lock depenency (ie. lru_lock -> tree_lock).
> >>
> >>But I must have been smoking something last night: for the life
> >>of me I can't see why I thought there was already a hugetlb_lock
> >>-> lru_lock dependency in there...?!
> >>
> >>So I retract my statement. What you have there seems OK.
> >
> >
> >Sadly, you weren't smoking something, and it's not OK. As akpm
> >pointed out later, the lru_lock dependecy is via __free_pages() which
> >is called from update_and_free_page() with hugetlb_lock held.
>
> You're either thinking of zone->lock, or put_page/page_cache_release.
> The former is happy to nest inside anything because it is confined to
> page_alloc. The latter AFAIKS is not called from inside the hugepage
> lock.

Um.. I guess so. I know I spotted the zone->lock in the
__free_pages() path, but I also thought I found the lru_lock. Can't
find it now though, guess I was mistaken.

> >>>Also, any thoughts on whether I need i_lock or i_mutex or something
> >>>else would be handy..
> >>
> >>I'm not much of an fs guy. How come you don't use i_size?
> >
> >
> >i_size is already used for a hard limit on the file size - faulting a
> >page beyond i_size will SIGBUS, whereas faulting a page beyond
> >i_blocks just isn't guaranteed. In particular, we always extend
> >i_size when makiing a new mapping, whereas we only extend i_blocks
> >(and thus reserve pages) on a SHARED mapping (because space is being
> >guaranteed for things in the mapping, not for a random processes
> >MAP_PRIVATE copy).
>
> Oh OK I misread how you're using it. I thought you wanted to be
> able to guarantee the whole thing would be guaranteed.

No. Well... "whole thing" is kind of misleading. I don't want to
reserve on MAP_PRIVATE mappings (because I haven't found any semantics
for that which really make any sense). But we do extend i_size on
MAP_PRIVATE mappings, so we can (privately) instantiate things in that
space without getting a SIGBUS.

> The other thing you might be able to do is use hugetlbfs inode
> private data so you don't have to overload vfs things?

Yeah, I'm thinking that would be more sensible. Only curly bit is
that it would need to be set up from fs/hugetlbfs/inode.c, but
accessed from mm/hugetlb.c.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson