From: Zhang Yanmin <[email protected]>
In function hugetlb_no_page, backout path always calls hugetlb_put_quota.
It's incorrect when find_lock_page gets the page or the new page is added
into page cache.
In addition, if the vma->vm_flags doesn't include VM_SHARED, the quota
shouldn't be decreased.
My patch against 2.6.16-rc5-mm2 fixes it.
Signed-off-by: Zhang Yanmin <[email protected]>
---
diff -Nraup linux-2.6.16-rc5-mm2/mm/hugetlb.c linux-2.6.16-rc5-mm2_quota/mm/hugetlb.c
--- linux-2.6.16-rc5-mm2/mm/hugetlb.c 2006-03-06 18:48:18.000000000 +0800
+++ linux-2.6.16-rc5-mm2_quota/mm/hugetlb.c 2006-03-06 18:48:58.000000000 +0800
@@ -562,11 +562,12 @@ int hugetlb_no_page(struct mm_struct *mm
retry:
page = find_lock_page(mapping, idx);
if (!page) {
- if (hugetlb_get_quota(mapping))
+ if (vma->vm_flags & VM_SHARED && hugetlb_get_quota(mapping))
goto out;
page = alloc_huge_page(vma, address);
if (!page) {
- hugetlb_put_quota(mapping);
+ if (vma->vm_flags & VM_SHARED)
+ hugetlb_put_quota(mapping);
ret = VM_FAULT_OOM;
goto out;
}
@@ -613,7 +614,6 @@ out:
backout:
spin_unlock(&mm->page_table_lock);
- hugetlb_put_quota(mapping);
unlock_page(page);
put_page(page);
goto out;
Zhang, Yanmin wrote on Sunday, March 05, 2006 10:22 PM
> In function hugetlb_no_page, backout path always calls hugetlb_put_quota.
> It's incorrect when find_lock_page gets the page or the new page is added
> into page cache.
While I acknowledge the bug, this patch is not complete. It makes file
system quota consistent with respect to page cache state. But such quota
(more severely, the page cache state) is still buggy, for example under
ftruncate case: if one ftrucate hugetlb file and then tries to fault a
page outside ftruncate area, a new hugetlb page is allocated and then
added into page cache along with file system quota; and at the end
returning VM_FAULT_SIGBUS. In this case, kernel traps an unreachable
page until possibly next mmap that extends it. That need to be fixed.
Which means we will be adding back conditional call to
hugetlb_put_quota(mapping) in the backout path.
> In addition, if the vma->vm_flags doesn't include VM_SHARED, the quota
> shouldn't be decreased.
Why? Private hugetlb page should be charged against the quota. Or is
there a better reason not to do so?
- Ken
On Mon, 2006-03-06 at 16:15, Chen, Kenneth W wrote:
> Zhang, Yanmin wrote on Sunday, March 05, 2006 10:22 PM
> > In function hugetlb_no_page, backout path always calls hugetlb_put_quota.
> > It's incorrect when find_lock_page gets the page or the new page is added
> > into page cache.
>
> While I acknowledge the bug, this patch is not complete. It makes file
> system quota consistent with respect to page cache state. But such quota
> (more severely, the page cache state) is still buggy, for example under
> ftruncate case: if one ftrucate hugetlb file and then tries to fault a
> page outside ftruncate area, a new hugetlb page is allocated and then
> added into page cache along with file system quota; and at the end
> returning VM_FAULT_SIGBUS. In this case, kernel traps an unreachable
> page until possibly next mmap that extends it. That need to be fixed.
I have another patch to fix it. The second patch is to delete checking
(!(vma->vm_flags & VM_WRITE) && len > inode->i_size) in function
hugetlbfs_file_mmap, and add a checking in hugetlb_no_page,
to implement a capability for application to mmap
an zero-length huge page area. It's useful for process to protect one area.
As a side effect, the second patch also fixes the problem you said.
> Which means we will be adding back conditional call to
> hugetlb_put_quota(mapping) in the backout path.
>
>
> > In addition, if the vma->vm_flags doesn't include VM_SHARED, the quota
> > shouldn't be decreased.
>
> Why? Private hugetlb page should be charged against the quota. Or is
> there a better reason not to do so?
I checked tmpfs and other fs. Most of them don't charge private
page against the quota.
On Mon, 2006-03-06 at 17:06, Zhang, Yanmin wrote:
> On Mon, 2006-03-06 at 16:15, Chen, Kenneth W wrote:
> > Zhang, Yanmin wrote on Sunday, March 05, 2006 10:22 PM
> > > In function hugetlb_no_page, backout path always calls hugetlb_put_quota.
> > > It's incorrect when find_lock_page gets the page or the new page is added
> > > into page cache.
> >
> > While I acknowledge the bug, this patch is not complete. It makes file
> > system quota consistent with respect to page cache state. But such quota
> > (more severely, the page cache state) is still buggy, for example under
> > ftruncate case: if one ftrucate hugetlb file and then tries to fault a
> > page outside ftruncate area, a new hugetlb page is allocated and then
> > added into page cache along with file system quota; and at the end
> > returning VM_FAULT_SIGBUS. In this case, kernel traps an unreachable
> > page until possibly next mmap that extends it. That need to be fixed.
> I have another patch to fix it. The second patch is to delete checking
> (!(vma->vm_flags & VM_WRITE) && len > inode->i_size) in function
> hugetlbfs_file_mmap, and add a checking in hugetlb_no_page,
> to implement a capability for application to mmap
> an zero-length huge page area. It's useful for process to protect one area.
> As a side effect, the second patch also fixes the problem you said.
>
Here is the second patch.
From: Zhang Yanmin <[email protected]>
Sometimes, applications need below call to be successful although
"/mnt/hugepages/file1" doesn't exist.
fd = open("/mnt/hugepages/file1", O_CREAT|O_RDWR, 0755);
*addr = mmap(NULL, 0x1024*1024*256, PROT_NONE, MAP_SHARED, fd, 0);
As for regular pages (or files), above call does work, but as for huge pages,
above call would fail because hugetlbfs_file_mmap would fail if
(!(vma->vm_flags & VM_WRITE) && len > inode->i_size).
This capability on huge page is useful on ia64 when the process wants to
protect one area on region 4, so other threads couldn't read/write this
area.
My patch against 2.6.16-rc5-mm3 implements it.
In addition, my patch fixes a bug caught by Ken. Consider below scenario:
Process A mmaps a hugetlb file with 5 huge pages, then process B truncates
the file with length of 2 huge pages. Later on, process A still could fault
on huge page 3-5, and these huge pages would be added into page cache.
Then, no processes could access these huge pages. My patch fixes it by
moving checking (idx >= size) to the begining of function hugetlb_no_page.
Thanks for Ken's good suggestions.
Signed-off-by: Zhang Yanmin <[email protected]>
---
diff -Nraup linux-2.6.16-rc5-mm3/fs/hugetlbfs/inode.c linux-2.6.16-rc5-mm3_huge_check/fs/hugetlbfs/inode.c
--- linux-2.6.16-rc5-mm3/fs/hugetlbfs/inode.c 2006-03-08 17:59:10.000000000 +0800
+++ linux-2.6.16-rc5-mm3_huge_check/fs/hugetlbfs/inode.c 2006-03-08 18:17:15.000000000 +0800
@@ -84,8 +84,6 @@ static int hugetlbfs_file_mmap(struct fi
ret = -ENOMEM;
len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
- if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size)
- goto out;
if (vma->vm_flags & VM_MAYSHARE)
if (hugetlb_extend_reservation(info, len >> HPAGE_SHIFT) != 0)
diff -Nraup linux-2.6.16-rc5-mm3/mm/hugetlb.c linux-2.6.16-rc5-mm3_huge_check/mm/hugetlb.c
--- linux-2.6.16-rc5-mm3/mm/hugetlb.c 2006-03-08 17:59:11.000000000 +0800
+++ linux-2.6.16-rc5-mm3_huge_check/mm/hugetlb.c 2006-03-08 18:01:21.000000000 +0800
@@ -555,6 +555,10 @@ int hugetlb_no_page(struct mm_struct *mm
idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+ size = i_size_read(mapping->host) >> HPAGE_SHIFT;
+ if (idx >= size)
+ return ret;
+
/*
* Use page lock to guard against racing truncation
* before we get page_table_lock.
@@ -588,9 +592,6 @@ retry:
}
spin_lock(&mm->page_table_lock);
- size = i_size_read(mapping->host) >> HPAGE_SHIFT;
- if (idx >= size)
- goto backout;
ret = VM_FAULT_MINOR;
if (!pte_none(*ptep))
From: Zhang Yanmin <[email protected]>
Currently, ftruncate on hugetlb files couldn't extend them. My patch enables it.
This patch is against 2.6.16-rc5-mm3 and on the top of the patch which
implements mmap on zero-length hugetlb files with PROT_NONE.
Thanks for Ken's good idea.
Signed-off-by: Zhang Yanmin <[email protected]>
---
diff -Nraup linux-2.6.16-rc5-mm3_huge_check/fs/hugetlbfs/inode.c linux-2.6.16-rc5-mm3_truncate/fs/hugetlbfs/inode.c
--- linux-2.6.16-rc5-mm3_huge_check/fs/hugetlbfs/inode.c 2006-03-08 18:17:15.000000000 +0800
+++ linux-2.6.16-rc5-mm3_truncate/fs/hugetlbfs/inode.c 2006-03-08 18:50:40.000000000 +0800
@@ -308,18 +308,22 @@ static int hugetlb_vmtruncate(struct ino
unsigned long pgoff;
struct address_space *mapping = inode->i_mapping;
- if (offset > inode->i_size)
- return -EINVAL;
-
BUG_ON(offset & ~HPAGE_MASK);
pgoff = offset >> HPAGE_SHIFT;
-
- inode->i_size = offset;
- spin_lock(&mapping->i_mmap_lock);
- if (!prio_tree_empty(&mapping->i_mmap))
- hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
- spin_unlock(&mapping->i_mmap_lock);
- truncate_hugepages(inode, offset);
+ if (offset > inode->i_size) {
+ if (hugetlb_extend_reservation(HUGETLBFS_I(inode), pgoff) != 0)
+ return -ENOMEM;
+ inode->i_size = offset;
+ }
+ else {
+
+ inode->i_size = offset;
+ spin_lock(&mapping->i_mmap_lock);
+ if (!prio_tree_empty(&mapping->i_mmap))
+ hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
+ spin_unlock(&mapping->i_mmap_lock);
+ truncate_hugepages(inode, offset);
+ }
return 0;
}
On Wed, 2006-03-08 at 14:21, Andrew Morton wrote:
> "Zhang, Yanmin" <[email protected]> wrote:
> >
> > - if (!prio_tree_empty(&mapping->i_mmap))
> > - hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
> > - spin_unlock(&mapping->i_mmap_lock);
> > - truncate_hugepages(inode, offset);
> > + if (offset > inode->i_size) {
>
> whitespace is broken here.
Sorry for the stupid problem.
> } else {
>
> > +
> > + inode->i_size = offset;
>
> i_size_write(), unless i_sem is held. I guess it is, so OK.
Yes.
>
> > + spin_lock(&mapping->i_mmap_lock);
> > + if (!prio_tree_empty(&mapping->i_mmap))
> > + hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
> > + spin_unlock(&mapping->i_mmap_lock);
> > + truncate_hugepages(inode, offset);
> > + }
> > return 0;
Here is the new patch.
From: Zhang Yanmin <[email protected]>
Currently, ftruncate on hugetlb files couldn't extend them. My patch enables it.
This patch is against 2.6.16-rc5-mm3 and on the top of the patch which
implements mmap on zero-length hugetlb files with PROT_NONE.
Thanks for Ken's good idea.
Signed-off-by: Zhang Yanmin <[email protected]>
---
diff -Nraup linux-2.6.16-rc5-mm3_huge_check/fs/hugetlbfs/inode.c linux-2.6.16-rc5-mm3_truncate/fs/hugetlbfs/inode.c
--- linux-2.6.16-rc5-mm3_huge_check/fs/hugetlbfs/inode.c 2006-03-08 18:17:15.000000000 +0800
+++ linux-2.6.16-rc5-mm3_truncate/fs/hugetlbfs/inode.c 2006-03-08 22:43:54.000000000 +0800
@@ -308,18 +308,20 @@ static int hugetlb_vmtruncate(struct ino
unsigned long pgoff;
struct address_space *mapping = inode->i_mapping;
- if (offset > inode->i_size)
- return -EINVAL;
-
BUG_ON(offset & ~HPAGE_MASK);
pgoff = offset >> HPAGE_SHIFT;
-
- inode->i_size = offset;
- spin_lock(&mapping->i_mmap_lock);
- if (!prio_tree_empty(&mapping->i_mmap))
- hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
- spin_unlock(&mapping->i_mmap_lock);
- truncate_hugepages(inode, offset);
+ if (offset > inode->i_size) {
+ if (hugetlb_extend_reservation(HUGETLBFS_I(inode), pgoff) != 0)
+ return -ENOMEM;
+ inode->i_size = offset;
+ } else {
+ inode->i_size = offset;
+ spin_lock(&mapping->i_mmap_lock);
+ if (!prio_tree_empty(&mapping->i_mmap))
+ hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
+ spin_unlock(&mapping->i_mmap_lock);
+ truncate_hugepages(inode, offset);
+ }
return 0;
}
Zhang, Yanmin wrote on Tuesday, March 07, 2006 7:25 PM
> Currently, ftruncate on hugetlb files couldn't extend them. My patch enables it.
>
> This patch is against 2.6.16-rc5-mm3 and on the top of the patch which
> implements mmap on zero-length hugetlb files with PROT_NONE.
> -
> - inode->i_size = offset;
> - spin_lock(&mapping->i_mmap_lock);
> - if (!prio_tree_empty(&mapping->i_mmap))
> - hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
> - spin_unlock(&mapping->i_mmap_lock);
> - truncate_hugepages(inode, offset);
> + if (offset > inode->i_size) {
> + if (hugetlb_extend_reservation(HUGETLBFS_I(inode), pgoff) != 0)
> + return -ENOMEM;
> + inode->i_size = offset;
> + }
> + else {
> +
> + inode->i_size = offset;
> + spin_lock(&mapping->i_mmap_lock);
> + if (!prio_tree_empty(&mapping->i_mmap))
> + hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
> + spin_unlock(&mapping->i_mmap_lock);
> + truncate_hugepages(inode, offset);
> + }
> return 0;
> }
Hmm?? I don't think you need to extend the reservation when extending
hugetlb file via ftruncate. You don't have any vma that pass beyond
current size. So making a reservation is a wrong thing to do here.
- Ken
Zhang, Yanmin wrote on Tuesday, March 07, 2006 10:54 PM
> From: Zhang Yanmin <[email protected]>
>
> Currently, ftruncate on hugetlb files couldn't extend them. My patch
enables it.
>
> This patch is against 2.6.16-rc5-mm3 and on the top of the patch which
> implements mmap on zero-length hugetlb files with PROT_NONE.
Reservation should be already done at the mmap time.
Like this?
Signed-off-by: Ken Chen <[email protected]>
--- linux-2.6.15/fs/hugetlbfs/inode.c.orig 2006-03-08
11:39:49.782708398 -0800
+++ linux-2.6.15/fs/hugetlbfs/inode.c 2006-03-08 11:51:04.382309509
-0800
@@ -337,20 +337,18 @@ hugetlb_vmtruncate_list(struct prio_tree
}
}
-/*
- * Expanding truncates are not allowed.
- */
static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
{
unsigned long pgoff;
struct address_space *mapping = inode->i_mapping;
- if (offset > inode->i_size)
- return -EINVAL;
-
BUG_ON(offset & ~HPAGE_MASK);
pgoff = offset >> HPAGE_SHIFT;
+ if (offset > inode->i_size) {
+ inode->i_size = offset;
+ return 0;
+ }
inode->i_size = offset;
spin_lock(&mapping->i_mmap_lock);
if (!prio_tree_empty(&mapping->i_mmap))
On Wed, Mar 08, 2006 at 10:28:41AM -0800, Chen, Kenneth W wrote:
> Zhang, Yanmin wrote on Tuesday, March 07, 2006 7:25 PM
> > Currently, ftruncate on hugetlb files couldn't extend them. My patch enables it.
> >
> > This patch is against 2.6.16-rc5-mm3 and on the top of the patch which
> > implements mmap on zero-length hugetlb files with PROT_NONE.
>
> > -
> > - inode->i_size = offset;
> > - spin_lock(&mapping->i_mmap_lock);
> > - if (!prio_tree_empty(&mapping->i_mmap))
> > - hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
> > - spin_unlock(&mapping->i_mmap_lock);
> > - truncate_hugepages(inode, offset);
> > + if (offset > inode->i_size) {
> > + if (hugetlb_extend_reservation(HUGETLBFS_I(inode), pgoff) != 0)
> > + return -ENOMEM;
> > + inode->i_size = offset;
> > + }
> > + else {
> > +
> > + inode->i_size = offset;
> > + spin_lock(&mapping->i_mmap_lock);
> > + if (!prio_tree_empty(&mapping->i_mmap))
> > + hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
> > + spin_unlock(&mapping->i_mmap_lock);
> > + truncate_hugepages(inode, offset);
> > + }
> > return 0;
> > }
>
> Hmm?? I don't think you need to extend the reservation when extending
> hugetlb file via ftruncate. You don't have any vma that pass beyond
> current size. So making a reservation is a wrong thing to do here.
Fwiw, I think truncate *should* extend the reservation. We have a
separate thread arguing about whether we should be reserving by inode
length, as I've implemented, or by which ranges are actually mapped
(as apw's old path implemented). As long as it *is* by inode length -
so it's conceptually all about the logical file in hugetlbfs, not
about any of its mappings - I think it makes sense for an extending
truncate() to extend the reservation. It's not reserving them for any
particular mapping, it's reserving them for page cache pages.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
David Gibson wrote on Wednesday, March 08, 2006 3:58 PM
> > Hmm?? I don't think you need to extend the reservation when extending
> > hugetlb file via ftruncate. You don't have any vma that pass beyond
> > current size. So making a reservation is a wrong thing to do here.
>
> Fwiw, I think truncate *should* extend the reservation. We have a
> separate thread arguing about whether we should be reserving by inode
> length, as I've implemented, or by which ranges are actually mapped
> (as apw's old path implemented). As long as it *is* by inode length -
> so it's conceptually all about the logical file in hugetlbfs, not
> about any of its mappings - I think it makes sense for an extending
> truncate() to extend the reservation. It's not reserving them for any
> particular mapping, it's reserving them for page cache pages.
But you already make reservation at mmap time. If you reserve it again
when extending the file, won't you double count?
- Ken
On Wed, Mar 08, 2006 at 04:12:13PM -0800, Chen, Kenneth W wrote:
> David Gibson wrote on Wednesday, March 08, 2006 3:58 PM
> > > Hmm?? I don't think you need to extend the reservation when extending
> > > hugetlb file via ftruncate. You don't have any vma that pass beyond
> > > current size. So making a reservation is a wrong thing to do here.
> >
> > Fwiw, I think truncate *should* extend the reservation. We have a
> > separate thread arguing about whether we should be reserving by inode
> > length, as I've implemented, or by which ranges are actually mapped
> > (as apw's old path implemented). As long as it *is* by inode length -
> > so it's conceptually all about the logical file in hugetlbfs, not
> > about any of its mappings - I think it makes sense for an extending
> > truncate() to extend the reservation. It's not reserving them for any
> > particular mapping, it's reserving them for page cache pages.
>
> But you already make reservation at mmap time. If you reserve it again
> when extending the file, won't you double count?
Well, I'd generally expect extending truncate() to come before mmap(),
but in any case hugetlb_extend_reservation() is safe against double
counting (it's idempotent if called twice with the same number of
pages). The semantics are "ensure the this many pages total are
guaranteed available, that is, either reserved or already
instantiated".
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
David Gibson wrote on Wednesday, March 08, 2006 4:23 PM
> > But you already make reservation at mmap time. If you reserve it again
> > when extending the file, won't you double count?
>
> Well, I'd generally expect extending truncate() to come before mmap(),
> but in any case hugetlb_extend_reservation() is safe against double
> counting (it's idempotent if called twice with the same number of
> pages). The semantics are "ensure the this many pages total are
> guaranteed available, that is, either reserved or already
> instantiated".
It's kind of peculiar that kernel reserve hugetlb page at the time of
extending truncate. Maybe there is a close correlation between mmap
size to the file size. But these two aren't the same and and shouldn't
be mixed.
- Ken
On Wed, Mar 08, 2006 at 04:33:10PM -0800, Chen, Kenneth W wrote:
> David Gibson wrote on Wednesday, March 08, 2006 4:23 PM
> > > But you already make reservation at mmap time. If you reserve it again
> > > when extending the file, won't you double count?
> >
> > Well, I'd generally expect extending truncate() to come before mmap(),
> > but in any case hugetlb_extend_reservation() is safe against double
> > counting (it's idempotent if called twice with the same number of
> > pages). The semantics are "ensure the this many pages total are
> > guaranteed available, that is, either reserved or already
> > instantiated".
>
> It's kind of peculiar that kernel reserve hugetlb page at the time of
> extending truncate. Maybe there is a close correlation between mmap
> size to the file size. But these two aren't the same and and shouldn't
> be mixed.
Indeed. The fundamental basis of my approach as opposed to that in
apw's old accounting patches is that reservation is related to
filesize, not mmap() size. The only connection to mmap() is that
(shared writable) mmap() on hugetlbfs also extends filesize and
reservation size as a side effect.
There's a slight wrinkle in the different between i_size and reserved
pages. i_size can be extended beyond the "reserved portion" of the
file. We need that behaviour because i_size also acts as a hard
offset limit for MAP_PRIVATE mappings, but privately instantiated
pages don't come from the file's reservation.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson