2006-01-10 19:22:36

by Adam Litke

[permalink] [raw]
Subject: Hugetlb: Shared memory race

I have discovered a race caused by the interaction of demand faulting
with the hugetlb overcommit accounting patch. Attached is a workaround
for the problem. Can anyone suggest a better approach to solving the
race I'll describe below? If not, would the attached workaround be
acceptable?

The race occurs when multiple threads shmat a hugetlb area and begin
faulting in it's pages. During a hugetlb fault, hugetlb_no_page checks
for the page in the page cache. If not found, it allocates (and zeroes)
a new page and tries to add it to the page cache. If this fails, the
huge page is freed and we retry the page cache lookup (assuming someone
else beat us to the add_to_page_cache call).

The above works fine, but due to the large window (while zeroing the
huge page) it is possible that many threads could be "borrowing" pages
only to return them later. This causes free_hugetlb_pages to be lower
than the logical number of free pages and some threads trying to shmat
can falsely fail the accounting check.

The workaround disables the accounting check that happens at shmat time.
It was already done at shmget time (which is the normal semantics
anyway).

Signed-off-by: Adam Litke <[email protected]>

inode.c | 10 ++++++++++
1 files changed, 10 insertions(+)
diff -upN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c
--- reference/fs/hugetlbfs/inode.c
+++ current/fs/hugetlbfs/inode.c
@@ -74,6 +74,14 @@ huge_pages_needed(struct address_space *
pgoff_t next = vma->vm_pgoff;
pgoff_t endpg = next + ((end - start) >> PAGE_SHIFT);

+ /*
+ * Accounting for shared memory segments is done at shmget time
+ * so we can skip the check now to avoid a race where hugetlb_no_page
+ * is zeroing hugetlb pages not yet in the page cache.
+ */
+ if (vma->vm_file->f_dentry->d_inode->i_blocks != 0)
+ return 0;
+
pagevec_init(&pvec, 0);
while (next < endpg) {
if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE))
@@ -832,6 +840,8 @@ struct file *hugetlb_zero_setup(size_t s

d_instantiate(dentry, inode);
inode->i_size = size;
+ /* Mark this file is used for shared memory */
+ inode->i_blocks = 1;
inode->i_nlink = 0;
file->f_vfsmnt = mntget(hugetlbfs_vfsmount);
file->f_dentry = dentry;

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center


2006-01-10 19:45:23

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Hugetlb: Shared memory race

On Tue, Jan 10, 2006 at 01:22:31PM -0600, Adam Litke wrote:
> I have discovered a race caused by the interaction of demand faulting
> with the hugetlb overcommit accounting patch. Attached is a workaround
> for the problem. Can anyone suggest a better approach to solving the
> race I'll describe below? If not, would the attached workaround be
> acceptable?
> The race occurs when multiple threads shmat a hugetlb area and begin
> faulting in it's pages. During a hugetlb fault, hugetlb_no_page checks
> for the page in the page cache. If not found, it allocates (and zeroes)
> a new page and tries to add it to the page cache. If this fails, the
> huge page is freed and we retry the page cache lookup (assuming someone
> else beat us to the add_to_page_cache call).
> The above works fine, but due to the large window (while zeroing the
> huge page) it is possible that many threads could be "borrowing" pages
> only to return them later. This causes free_hugetlb_pages to be lower
> than the logical number of free pages and some threads trying to shmat
> can falsely fail the accounting check.
> The workaround disables the accounting check that happens at shmat time.
> It was already done at shmget time (which is the normal semantics
> anyway).

So that's where the ->i_blocks bit came from. This is too hacky for me.
Disabling the check raises the spectre of failures when there shouldn't
be. I'd rather have a more invasive fix than a workaround, however tiny.


-- wli

2006-01-11 22:02:46

by Adam Litke

[permalink] [raw]
Subject: [PATCH 1/2] hugetlb: Delay page zeroing for faulted pages

I've come up with a much better idea to resolve the issue I mention
below. The attached patch changes hugetlb_no_page to allocate unzeroed
huge pages initially. For shared mappings, we wait until after
inserting the page into the page_cache succeeds before we zero it. This
has a side benefit of preventing the wasted zeroing that happened often
in the original code. The page_lock should guard against someone else
using the page before it has been zeroed (but correct me if I am wrong
here). The patch doesn't completely close the race (there is a much
smaller window without the zeroing though). The next patch should close
the race window completely.

On Tue, 2006-01-10 at 13:22 -0600, Adam Litke wrote:
> The race occurs when multiple threads shmat a hugetlb area and begin
> faulting in it's pages. During a hugetlb fault, hugetlb_no_page checks
> for the page in the page cache. If not found, it allocates (and zeroes)
> a new page and tries to add it to the page cache. If this fails, the
> huge page is freed and we retry the page cache lookup (assuming someone
> else beat us to the add_to_page_cache call).
>
> The above works fine, but due to the large window (while zeroing the
> huge page) it is possible that many threads could be "borrowing" pages
> only to return them later. This causes free_hugetlb_pages to be lower
> than the logical number of free pages and some threads trying to shmat
> can falsely fail the accounting check.

Signed-off-by: Adam Litke <[email protected]>

hugetlb.c | 26 +++++++++++++++++++++++---
1 files changed, 23 insertions(+), 3 deletions(-)
diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c
--- reference/mm/hugetlb.c
+++ current/mm/hugetlb.c
@@ -92,10 +92,10 @@ void free_huge_page(struct page *page)
spin_unlock(&hugetlb_lock);
}

-struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr)
+struct page *alloc_unzeroed_huge_page(struct vm_area_struct *vma,
+ unsigned long addr)
{
struct page *page;
- int i;

spin_lock(&hugetlb_lock);
page = dequeue_huge_page(vma, addr);
@@ -106,8 +106,26 @@ struct page *alloc_huge_page(struct vm_a
spin_unlock(&hugetlb_lock);
set_page_count(page, 1);
page[1].mapping = (void *)free_huge_page;
+
+ return page;
+}
+
+void zero_huge_page(struct page *page)
+{
+ int i;
+
for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
clear_highpage(&page[i]);
+}
+
+struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct page *page;
+
+ page = alloc_unzeroed_huge_page(vma, addr);
+ if (page)
+ zero_huge_page(page);
+
return page;
}

@@ -441,7 +459,7 @@ retry:
if (!page) {
if (hugetlb_get_quota(mapping))
goto out;
- page = alloc_huge_page(vma, address);
+ page = alloc_unzeroed_huge_page(vma, address);
if (!page) {
hugetlb_put_quota(mapping);
goto out;
@@ -460,6 +478,8 @@ retry:
}
} else
lock_page(page);
+
+ zero_huge_page(page);
}

spin_lock(&mm->page_table_lock);


--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2006-01-11 22:24:26

by Adam Litke

[permalink] [raw]
Subject: [PATCH 2/2] hugetlb: synchronize alloc with page cache insert

On Wed, 2006-01-11 at 16:02 -0600, Adam Litke wrote:
> here). The patch doesn't completely close the race (there is a much
> smaller window without the zeroing though). The next patch should close
> the race window completely.

My only concern is if I am using the correct lock for the job here.

hugetlb.c | 14 ++++++++++----
1 files changed, 10 insertions(+), 4 deletions(-)
diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c
--- reference/mm/hugetlb.c
+++ current/mm/hugetlb.c
@@ -445,6 +445,7 @@ int hugetlb_no_page(struct mm_struct *mm
struct page *page;
struct address_space *mapping;
pte_t new_pte;
+ int shared = vma->vm_flags & VM_SHARED;

mapping = vma->vm_file->f_mapping;
idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
@@ -454,26 +455,31 @@ int hugetlb_no_page(struct mm_struct *mm
* Use page lock to guard against racing truncation
* before we get page_table_lock.
*/
-retry:
page = find_lock_page(mapping, idx);
if (!page) {
if (hugetlb_get_quota(mapping))
goto out;
+
+ if (shared)
+ spin_lock(&mapping->host->i_lock);
+
page = alloc_unzeroed_huge_page(vma, address);
if (!page) {
hugetlb_put_quota(mapping);
+ if (shared)
+ spin_unlock(&mapping->host->i_lock);
goto out;
}

- if (vma->vm_flags & VM_SHARED) {
+ if (shared) {
int err;

err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+ spin_unlock(&mapping->host->i_lock);
if (err) {
put_page(page);
hugetlb_put_quota(mapping);
- if (err == -EEXIST)
- goto retry;
+ BUG_ON(-EEXIST);
goto out;
}
} else


--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2006-01-11 22:43:20

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH 1/2] hugetlb: Delay page zeroing for faulted pages

On Wed, Jan 11, 2006 at 04:02:40PM -0600, Adam Litke wrote:
> I've come up with a much better idea to resolve the issue I mention
> below. The attached patch changes hugetlb_no_page to allocate unzeroed
> huge pages initially. For shared mappings, we wait until after
> inserting the page into the page_cache succeeds before we zero it. This
> has a side benefit of preventing the wasted zeroing that happened often
> in the original code. The page_lock should guard against someone else
> using the page before it has been zeroed (but correct me if I am wrong
> here). The patch doesn't completely close the race (there is a much
> smaller window without the zeroing though). The next patch should close
> the race window completely.

Looks better to me.

Signed-off-by: William Irwin <[email protected]>


-- wli

2006-01-11 22:52:26

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH 2/2] hugetlb: synchronize alloc with page cache insert

On Wed, Jan 11, 2006 at 04:24:23PM -0600, Adam Litke wrote:
> My only concern is if I am using the correct lock for the job here.

->i_lock looks rather fishy. It may have been necessary when ->i_blocks
was used for nefarious purposes, but without ->i_blocks fiddling, it's
not needed unless I somehow missed the addition of some custom fields
to hugetlbfs inodes and their modifications by any of these functions.


-- wli

2006-01-11 23:03:29

by Adam Litke

[permalink] [raw]
Subject: Re: [PATCH 2/2] hugetlb: synchronize alloc with page cache insert

On Wed, 2006-01-11 at 14:52 -0800, William Lee Irwin III wrote:
> On Wed, Jan 11, 2006 at 04:24:23PM -0600, Adam Litke wrote:
> > My only concern is if I am using the correct lock for the job here.
>
> ->i_lock looks rather fishy. It may have been necessary when ->i_blocks
> was used for nefarious purposes, but without ->i_blocks fiddling, it's
> not needed unless I somehow missed the addition of some custom fields
> to hugetlbfs inodes and their modifications by any of these functions.

Nope, all the i_blocks stuff is gone. I was just looking for a
spin_lock for serializing all allocations for a particular hugeltbfs
file and i_lock seemed to fit that bill. It could be said, however,
that the locking strategy used in the patch protects a section of code,
not a data structure (which can be a bad idea). Any thoughts on a less
"fishy" locking strategy for this case?

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2006-01-11 23:25:20

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH 2/2] hugetlb: synchronize alloc with page cache insert

On Wed, 2006-01-11 at 14:52 -0800, William Lee Irwin III wrote:
>> ->i_lock looks rather fishy. It may have been necessary when ->i_blocks
>> was used for nefarious purposes, but without ->i_blocks fiddling, it's
>> not needed unless I somehow missed the addition of some custom fields
>> to hugetlbfs inodes and their modifications by any of these functions.

On Wed, Jan 11, 2006 at 05:03:25PM -0600, Adam Litke wrote:
> Nope, all the i_blocks stuff is gone. I was just looking for a
> spin_lock for serializing all allocations for a particular hugeltbfs
> file and i_lock seemed to fit that bill. It could be said, however,
> that the locking strategy used in the patch protects a section of code,
> not a data structure (which can be a bad idea). Any thoughts on a less
> "fishy" locking strategy for this case?

That's not really something that needs to be synchronized per se. hugetlb
data structures need protection against concurrent modification, but
they have that from the functions you're calling.


-- wli

2006-01-11 23:47:03

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [PATCH 2/2] hugetlb: synchronize alloc with page cache insert

Adam Litke wrote on Wednesday, January 11, 2006 3:03 PM
> Nope, all the i_blocks stuff is gone. I was just looking for a
> spin_lock for serializing all allocations for a particular hugeltbfs
> file and i_lock seemed to fit that bill.

I hope you are aware of the consequence of serializing page allocation:
It won't scale on large numa machine. I don't have any issue per se at
the moment. But in the past, SGI folks had screamed their heads off
for people doing something like that.

- Ken

2006-01-12 00:41:16

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [PATCH 2/2] hugetlb: synchronize alloc with page cache insert

Adam Litke wrote on Wednesday, January 11, 2006 2:24 PM
> > here). The patch doesn't completely close the race (there is a much
> > smaller window without the zeroing though). The next patch should close
> > the race window completely.
>
> My only concern is if I am using the correct lock for the job here.

I don't think so.


> @@ -454,26 +455,31 @@ int hugetlb_no_page(struct mm_struct *mm
> * Use page lock to guard against racing truncation
> * before we get page_table_lock.
> */
> -retry:
> page = find_lock_page(mapping, idx);
> if (!page) {
> if (hugetlb_get_quota(mapping))
> goto out;
> +
> + if (shared)
> + spin_lock(&mapping->host->i_lock);
> +
> page = alloc_unzeroed_huge_page(vma, address);
> if (!page) {
> hugetlb_put_quota(mapping);
> + if (shared)
> + spin_unlock(&mapping->host->i_lock);
> goto out;
> }

What if two processes fault on the same page and races with find_lock_page(),
both find page not in the page cache. The process won the race proceed to
allocate last hugetlb page. While the other will exit with SIGBUS. In theory,
both processes should be OK.

- Ken

2006-01-12 01:05:28

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH 2/2] hugetlb: synchronize alloc with page cache insert

On Wed, Jan 11, 2006 at 04:40:37PM -0800, Chen, Kenneth W wrote:
> What if two processes fault on the same page and races with find_lock_page(),
> both find page not in the page cache. The process won the race proceed to
> allocate last hugetlb page. While the other will exit with SIGBUS.
> In theory, both processes should be OK.

This is supposed to fix the incarnation of that as a preexisting
problem, but you're right, there is no fallback or retry for the case
of hugepage queue exhaustion. For some reason I saw a phantom page
allocator fallback in the hugepage allocator changes.

Looks like back to the drawing board for this pair of patches, though
I'd be more than happy to get a solution to this.


-- wli

2006-01-12 17:26:10

by Adam Litke

[permalink] [raw]
Subject: Re: [PATCH 2/2] hugetlb: synchronize alloc with page cache insert

On Wed, 2006-01-11 at 17:05 -0800, William Lee Irwin III wrote:
> On Wed, Jan 11, 2006 at 04:40:37PM -0800, Chen, Kenneth W wrote:
> > What if two processes fault on the same page and races with find_lock_page(),
> > both find page not in the page cache. The process won the race proceed to
> > allocate last hugetlb page. While the other will exit with SIGBUS.
> > In theory, both processes should be OK.
>
> This is supposed to fix the incarnation of that as a preexisting
> problem, but you're right, there is no fallback or retry for the case
> of hugepage queue exhaustion. For some reason I saw a phantom page
> allocator fallback in the hugepage allocator changes.
>
> Looks like back to the drawing board for this pair of patches, though
> I'd be more than happy to get a solution to this.

I still think patch 1 (delayed zeroing) is a good thing to have. It
will definitely improve performance for multi-threaded hugetlb
applications by avoiding unnecessary hugetlb page zeroing. It also
shrinks the race window we have been talking about to a tiny fraction of
what it was. This should ease the problem while we figure out a way to
handle the "last free page" case.

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2006-01-12 19:08:35

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [PATCH 2/2] hugetlb: synchronize alloc with page cache insert

Adam Litke wrote on Thursday, January 12, 2006 9:26 AM
> On Wed, 2006-01-11 at 17:05 -0800, William Lee Irwin III wrote:
> > On Wed, Jan 11, 2006 at 04:40:37PM -0800, Chen, Kenneth W wrote:
> > > What if two processes fault on the same page and races with find_lock_page(),
> > > both find page not in the page cache. The process won the race proceed to
> > > allocate last hugetlb page. While the other will exit with SIGBUS.
> > > In theory, both processes should be OK.
> >
> > This is supposed to fix the incarnation of that as a preexisting
> > problem, but you're right, there is no fallback or retry for the case
> > of hugepage queue exhaustion. For some reason I saw a phantom page
> > allocator fallback in the hugepage allocator changes.
> >
> > Looks like back to the drawing board for this pair of patches, though
> > I'd be more than happy to get a solution to this.
>
> I still think patch 1 (delayed zeroing) is a good thing to have. It
> will definitely improve performance for multi-threaded hugetlb
> applications by avoiding unnecessary hugetlb page zeroing. It also
> shrinks the race window we have been talking about to a tiny fraction of
> what it was. This should ease the problem while we figure out a way to
> handle the "last free page" case.

Sorry, I don't think patch 1 by itself is functionally correct. It opens
a can of worms with race window all over the place. It does more damage
than what it is trying to solve. Here is one case:

1 thread fault on hugetlb page, allocate a non-zero page, insert into the
page cache, then proceed to zero it. While in the middle of the zeroing,
2nd thread comes along fault on the same hugetlb page. It find it in the
page cache, went ahead install a pte and return to the user. User code
modify some parts of the hugetlb page while the 1st thread is still
zeroing. A potential silent data corruption.

The scenario could be even worst that after 1st thread finish zeroing, find
a pte in the page table is already instantiated and then it proceed to back
out that newly allocated hugetlb page.

What is needed here is the code that does find-alloc-insert should be one
atomic step under corner cases. I was thinking using a semaphore to
protect the code sequence. But I haven't come up with a reasonable
solution yet.

- Ken



2006-01-12 19:49:08

by Adam Litke

[permalink] [raw]
Subject: RE: [PATCH 2/2] hugetlb: synchronize alloc with page cache insert

On Thu, 2006-01-12 at 11:07 -0800, Chen, Kenneth W wrote:
> Sorry, I don't think patch 1 by itself is functionally correct. It opens
> a can of worms with race window all over the place. It does more damage
> than what it is trying to solve. Here is one case:
>
> 1 thread fault on hugetlb page, allocate a non-zero page, insert into the
> page cache, then proceed to zero it. While in the middle of the zeroing,
> 2nd thread comes along fault on the same hugetlb page. It find it in the
> page cache, went ahead install a pte and return to the user. User code
> modify some parts of the hugetlb page while the 1st thread is still
> zeroing. A potential silent data corruption.

I don't think the above case is possible because of find_lock_page().
The second thread would wait on the page to be unlocked by the thread
zeroing it before it could proceed.

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2006-01-12 20:07:15

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [PATCH 2/2] hugetlb: synchronize alloc with page cache insert

Adam Litke wrote on Thursday, January 12, 2006 11:49 AM
> On Thu, 2006-01-12 at 11:07 -0800, Chen, Kenneth W wrote:
> > Sorry, I don't think patch 1 by itself is functionally correct. It opens
> > a can of worms with race window all over the place. It does more damage
> > than what it is trying to solve. Here is one case:
> >
> > 1 thread fault on hugetlb page, allocate a non-zero page, insert into the
> > page cache, then proceed to zero it. While in the middle of the zeroing,
> > 2nd thread comes along fault on the same hugetlb page. It find it in the
> > page cache, went ahead install a pte and return to the user. User code
> > modify some parts of the hugetlb page while the 1st thread is still
> > zeroing. A potential silent data corruption.
>
> I don't think the above case is possible because of find_lock_page().
> The second thread would wait on the page to be unlocked by the thread
> zeroing it before it could proceed.

I think you are correct. Sorry for the noise.

- Ken