2005-10-18 21:15:21

by Rohit Seth

[permalink] [raw]
Subject: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

Linus,

[PATCH]: Handle spurious page fault for hugetlb region

The hugetlb pages are currently pre-faulted. At the time of mmap of
hugepages, we populate the new PTEs. It is possible that HW has already cached
some of the unused PTEs internally. These stale entries never get a chance to
be purged in existing control flow.

This patch extends the check in page fault code for hugepages. Check if
a faulted address falls with in size for the hugetlb file backing it. We
return VM_FAULT_MINOR for these cases (assuming that the arch specific
page-faulting code purges the stale entry for the archs that need it).

Thanks,
-rohit

Signed-off-by: Rohit Seth <[email protected]>


--- linux-2.6.14-rc4-git5-x86/include/linux/hugetlb.h 2005-10-18 13:14:24.879947360 -0700
+++ b/include/linux/hugetlb.h 2005-10-18 13:13:55.711381656 -0700
@@ -155,11 +155,24 @@
{
file->f_op = &hugetlbfs_file_operations;
}
+
+static inline int valid_hugetlb_file_off(struct vm_area_struct *vma,
+ unsigned long address)
+{
+ struct inode *inode = vma->vm_file->f_dentry->d_inode;
+ loff_t file_off = address - vma->vm_start;
+
+ file_off += (vma->vm_pgoff << PAGE_SHIFT);
+
+ return (file_off < inode->i_size);
+}
+
#else /* !CONFIG_HUGETLBFS */

#define is_file_hugepages(file) 0
#define set_file_hugepages(file) BUG()
#define hugetlb_zero_setup(size) ERR_PTR(-ENOSYS)
+#define valid_hugetlb_file_off(vma, address) 0

#endif /* !CONFIG_HUGETLBFS */

--- linux-2.6.14-rc4-git5-x86/mm/memory.c 2005-10-18 12:24:14.153647328 -0700
+++ b/mm/memory.c 2005-10-18 10:39:41.980162632 -0700
@@ -2045,8 +2045,18 @@

inc_page_state(pgfault);

- if (is_vm_hugetlb_page(vma))
- return VM_FAULT_SIGBUS; /* mapping truncation does this. */
+ if (unlikely(is_vm_hugetlb_page(vma))) {
+ if (valid_hugetlb_file_off(vma, address))
+ /* We get here only if there was a stale(zero) TLB entry
+ * (because of HW prefetching).
+ * Low-level arch code (if needed) should have already
+ * purged the stale entry as part of this fault handling.
+ * Here we just return.
+ */
+ return VM_FAULT_MINOR;
+ else
+ return VM_FAULT_SIGBUS; /* mapping truncation does this. */
+ }

/*
* We need the page table lock to synchronize with kswapd


2005-10-18 21:34:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

"Seth, Rohit" <[email protected]> wrote:
>
> Linus,
>
> [PATCH]: Handle spurious page fault for hugetlb region
>
> The hugetlb pages are currently pre-faulted. At the time of mmap of
> hugepages, we populate the new PTEs. It is possible that HW has already cached
> some of the unused PTEs internally.

What's an "unused pte"? One which maps a regular-sized page at the same
virtual address? How can such a thing come about, and why isn't it already
a problem for regular-sized pages? From where does the hardware prefetch
the pte contents?

IOW: please tell us more about this hardware pte-fetcher.

> These stale entries never get a chance to
> be purged in existing control flow.

I'd have thought that invalidating those ptes at mmap()-time would be a
more consistent approach.

> This patch extends the check in page fault code for hugepages. Check if
> a faulted address falls with in size for the hugetlb file backing it. We
> return VM_FAULT_MINOR for these cases (assuming that the arch specific
> page-faulting code purges the stale entry for the archs that need it).

Do you have an example of the code which does this purging?

> --- linux-2.6.14-rc4-git5-x86/include/linux/hugetlb.h 2005-10-18 13:14:24.879947360 -0700
> +++ b/include/linux/hugetlb.h 2005-10-18 13:13:55.711381656 -0700
> @@ -155,11 +155,24 @@
> {
> file->f_op = &hugetlbfs_file_operations;
> }
> +
> +static inline int valid_hugetlb_file_off(struct vm_area_struct *vma,
> + unsigned long address)
> +{
> + struct inode *inode = vma->vm_file->f_dentry->d_inode;
> + loff_t file_off = address - vma->vm_start;
> +
> + file_off += (vma->vm_pgoff << PAGE_SHIFT);
> +
> + return (file_off < inode->i_size);
> +}

I suppose we should use i_size_read() here.

> + if (valid_hugetlb_file_off(vma, address))
> + /* We get here only if there was a stale(zero) TLB entry
> + * (because of HW prefetching).
> + * Low-level arch code (if needed) should have already
> + * purged the stale entry as part of this fault handling.
> + * Here we just return.
> + */

If the low-level code has purged the stale pte then it knows what's
happening. Perhaps it shouldn't call into handle_mm_fault() at all?

2005-10-19 00:04:57

by Rohit Seth

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

On Tue, 2005-10-18 at 14:34 -0700, Andrew Morton wrote:
> "Seth, Rohit" <[email protected]> wrote:
> >
> > Linus,
> >
> > [PATCH]: Handle spurious page fault for hugetlb region
> >
> > The hugetlb pages are currently pre-faulted. At the time of mmap of
> > hugepages, we populate the new PTEs. It is possible that HW has already cached
> > some of the unused PTEs internally.
>
> What's an "unused pte"? One which maps a regular-sized page at the same
> virtual address? How can such a thing come about, and why isn't it already
> a problem for regular-sized pages? From where does the hardware prefetch
> the pte contents?
>

Unsused pte is where *pte == 0. Basically entries in leaf page table
that does not map anything. If such a pte ends up mapping a normal page
then you get a page fault and HW/handler does the right thing (in terms
of purging old entries).


> IOW: please tell us more about this hardware pte-fetcher.
>
> > These stale entries never get a chance to
> > be purged in existing control flow.
>
> I'd have thought that invalidating those ptes at mmap()-time would be a
> more consistent approach.

That would be adding too many unconditional purges (which are very
expensive operations) during mmap. And as we are only talking of
speculative pre-fetches that are done by HW so IMO we should do this as
lazily as possible (only if required).

>
> > This patch extends the check in page fault code for hugepages. Check if
> > a faulted address falls with in size for the hugetlb file backing it. We
> > return VM_FAULT_MINOR for these cases (assuming that the arch specific
> > page-faulting code purges the stale entry for the archs that need it).
>
> Do you have an example of the code which does this purging?
>

Tha page_not_present vector in IA-64 for example.

> > --- linux-2.6.14-rc4-git5-x86/include/linux/hugetlb.h 2005-10-18 13:14:24.879947360 -0700
> > +++ b/include/linux/hugetlb.h 2005-10-18 13:13:55.711381656 -0700
> > @@ -155,11 +155,24 @@
> > {
> > file->f_op = &hugetlbfs_file_operations;
> > }
> > +
> > +static inline int valid_hugetlb_file_off(struct vm_area_struct *vma,
> > + unsigned long address)
> > +{
> > + struct inode *inode = vma->vm_file->f_dentry->d_inode;
> > + loff_t file_off = address - vma->vm_start;
> > +
> > + file_off += (vma->vm_pgoff << PAGE_SHIFT);
> > +
> > + return (file_off < inode->i_size);
> > +}
>
> I suppose we should use i_size_read() here.

You are right.

>
> > + if (valid_hugetlb_file_off(vma, address))
> > + /* We get here only if there was a stale(zero) TLB entry
> > + * (because of HW prefetching).
> > + * Low-level arch code (if needed) should have already
> > + * purged the stale entry as part of this fault handling.
> > + * Here we just return.
> > + */
>
> If the low-level code has purged the stale pte then it knows what's
> happening. Perhaps it shouldn't call into handle_mm_fault() at all?

Well, at that time the code does not know if the address belong to
hugetlbfile. The archs that needs those purges in low level code need
them for all (for example) page not present faults.


2005-10-19 00:25:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

Rohit Seth <[email protected]> wrote:
>
> On Tue, 2005-10-18 at 14:34 -0700, Andrew Morton wrote:
> > "Seth, Rohit" <[email protected]> wrote:
> > >
> > > Linus,
> > >
> > > [PATCH]: Handle spurious page fault for hugetlb region
> > >
> > > The hugetlb pages are currently pre-faulted. At the time of mmap of
> > > hugepages, we populate the new PTEs. It is possible that HW has already cached
> > > some of the unused PTEs internally.
> >
> > What's an "unused pte"? One which maps a regular-sized page at the same
> > virtual address? How can such a thing come about, and why isn't it already
> > a problem for regular-sized pages? From where does the hardware prefetch
> > the pte contents?
> >
>
> Unsused pte is where *pte == 0. Basically entries in leaf page table
> that does not map anything.

Oh. Then I'm still not understanding.

> If such a pte ends up mapping a normal page
> then you get a page fault and HW/handler does the right thing (in terms
> of purging old entries).

For starters, we don't actually _use_ pte's for the hugepage. We use an
entry in a pmd page. Does that concept still apply for ia64?

If so, are you saying that the hardware prefetching is happening at the pmd
level? If not, how can it happen at the pte level when the hardware
doesn't know where the pte page _is_?

Has this problem been observed in testing?

>
> > IOW: please tell us more about this hardware pte-fetcher.
> >
> > > These stale entries never get a chance to
> > > be purged in existing control flow.
> >
> > I'd have thought that invalidating those ptes at mmap()-time would be a
> > more consistent approach.
>
> That would be adding too many unconditional purges (which are very
> expensive operations) during mmap. And as we are only talking of
> speculative pre-fetches that are done by HW so IMO we should do this as
> lazily as possible (only if required).

Is an mmap of a hugepage region very common? I guess it is, on ia32, or on
64-bit apps which are (poorly?) designed to also run on ia32.

> > If the low-level code has purged the stale pte then it knows what's
> > happening. Perhaps it shouldn't call into handle_mm_fault() at all?
>
> Well, at that time the code does not know if the address belong to
> hugetlbfile. The archs that needs those purges in low level code need
> them for all (for example) page not present faults.

I still don't understand what's different about hugepages here. If the
prefetching problem also occurs on regular pages and is handled OK for
regular pages, why do hugepages need special treatment?

2005-10-19 03:50:34

by Rohit Seth

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

On Tue, 2005-10-18 at 17:25 -0700, Andrew Morton wrote:
> Rohit Seth <[email protected]> wrote:
> >
> >
> > Unsused pte is where *pte == 0. Basically entries in leaf page table
> > that does not map anything.
>
> Oh. Then I'm still not understanding.


Following is the scenario:

Kernel is going to prefault the huge page for user virtual address V.

Let us say the pointer pte points to the page table entry (for x86 it is
PDE/PMD) which maps the hugepage.

At this point *pte == 0. CPU also has this entry cached in its TLB.
Meaning, unless this entry is purged or displaced, for virtual address V
CPU will generate the page fault (as the P bit is not set and assuming
this fault has the highest precedence).

Kernel updates the *pte so that it now maps the hugepage at virtual
address V to physical address P.

Later when the user process make a reference to V, because of stale TLB
entry, the processor gets PAGE_FAULT.

>
> > If such a pte ends up mapping a normal page
> > then you get a page fault and HW/handler does the right thing (in terms
> > of purging old entries).
>
> For starters, we don't actually _use_ pte's for the hugepage. We use an
> entry in a pmd page. Does that concept still apply for ia64?
>

With pte entry, I mean the entry that is mapping the hugepage (leaf
level entry). This entry could be sitting in PDE/PMD for x86 or VHPT
for IA-64. This entry is referring to what the processor uses for
translation to get to huge page.

> If so, are you saying that the hardware prefetching is happening at the pmd
> level? If not, how can it happen at the pte level when the hardware
> doesn't know where the pte page _is_?
>

IA-64, for example, directly maps the last level of page table.
Processor directly uses this table to find the translations. So, there
is no pmd here.

But it is entirely possible on x86 arch as well that any level of PDE
(PMD), PTE could be getting prefetched. That is why architecture
requires invalidate/purge operations whenever any of this entry changes.

> Has this problem been observed in testing?
>

Yes. On IA-64.

>
> > > I'd have thought that invalidating those ptes at mmap()-time would be a
> > > more consistent approach.
> >
> > That would be adding too many unconditional purges (which are very
> > expensive operations) during mmap. And as we are only talking of
> > speculative pre-fetches that are done by HW so IMO we should do this as
> > lazily as possible (only if required).
>
> Is an mmap of a hugepage region very common? I guess it is, on ia32, or on
> 64-bit apps which are (poorly?) designed to also run on ia32.
>
>
On x86 it surely is. For 64-bit there is 100G or more of mmaps...

And then you will have to do this purge for all the processes that are trying
to mmap hugepages.

> > > If the low-level code has purged the stale pte then it knows what's
> > > happening. Perhaps it shouldn't call into handle_mm_fault() at all?
> >
> > Well, at that time the code does not know if the address belong to
> > hugetlbfile. The archs that needs those purges in low level code need
> > them for all (for example) page not present faults.
>
> I still don't understand what's different about hugepages here. If the
> prefetching problem also occurs on regular pages and is handled OK for
> regular pages, why do hugepages need special treatment?

The prefetching problem is handled OK for regular pages because we can
handle page faults corresponding to those pages. That is currently not
true for hugepages. Currently the kernel assumes that PAGE_FAULT
happening against a hugetlb page is caused by truncate and returns
SIGBUS.




2005-10-19 04:08:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

Rohit Seth <[email protected]> wrote:
>
> The prefetching problem is handled OK for regular pages because we can
> handle page faults corresponding to those pages. That is currently not
> true for hugepages. Currently the kernel assumes that PAGE_FAULT
> happening against a hugetlb page is caused by truncate and returns
> SIGBUS.

Doh. No fault handler. The penny finally drops.

Adam, I think this patch is temporary?


From: "Seth, Rohit" <[email protected]>

We prefault hugepages at mmap() time, but hardware TLB prefetching may mean
that the TLB has NULL pagetable entries in the places where the pagetable
in fact has the desired virtual->physical translation.

For regular pages this problem is resolved via the resulting pagefault, in
the pagefault handler. But hugepages don't support pagefaults - they're
supposed to be prefaulted.

So we need minimal pagefault handling for these stale hugepage TLB entries.

An alternative is to invalidate the relevant TLB entries at hugepage
mmap()-time, but this is apparently too expensive.

Note: Adam Litke <[email protected]>'s demand-paging-for-hugepages patches are
now in -mm. If/when these are merged up, this fix should probably be
reverted.

Signed-off-by: Rohit Seth <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

include/linux/hugetlb.h | 13 +++++++++++++
mm/memory.c | 14 ++++++++++++--
2 files changed, 25 insertions(+), 2 deletions(-)

diff -puN include/linux/hugetlb.h~handling-spurious-page-fault-for-hugetlb-region include/linux/hugetlb.h
--- devel/include/linux/hugetlb.h~handling-spurious-page-fault-for-hugetlb-region 2005-10-18 21:04:34.000000000 -0700
+++ devel-akpm/include/linux/hugetlb.h 2005-10-18 21:04:34.000000000 -0700
@@ -155,11 +155,24 @@ static inline void set_file_hugepages(st
{
file->f_op = &hugetlbfs_file_operations;
}
+
+static inline int valid_hugetlb_file_off(struct vm_area_struct *vma,
+ unsigned long address)
+{
+ struct inode *inode = vma->vm_file->f_dentry->d_inode;
+ loff_t file_off = address - vma->vm_start;
+
+ file_off += (vma->vm_pgoff << PAGE_SHIFT);
+
+ return (file_off < inode->i_size);
+}
+
#else /* !CONFIG_HUGETLBFS */

#define is_file_hugepages(file) 0
#define set_file_hugepages(file) BUG()
#define hugetlb_zero_setup(size) ERR_PTR(-ENOSYS)
+#define valid_hugetlb_file_off(vma, address) 0

#endif /* !CONFIG_HUGETLBFS */

diff -puN mm/memory.c~handling-spurious-page-fault-for-hugetlb-region mm/memory.c
--- devel/mm/memory.c~handling-spurious-page-fault-for-hugetlb-region 2005-10-18 21:04:34.000000000 -0700
+++ devel-akpm/mm/memory.c 2005-10-18 21:04:34.000000000 -0700
@@ -2045,8 +2045,18 @@ int __handle_mm_fault(struct mm_struct *

inc_page_state(pgfault);

- if (is_vm_hugetlb_page(vma))
- return VM_FAULT_SIGBUS; /* mapping truncation does this. */
+ if (unlikely(is_vm_hugetlb_page(vma))) {
+ if (valid_hugetlb_file_off(vma, address))
+ /* We get here only if there was a stale(zero) TLB entry
+ * (because of HW prefetching).
+ * Low-level arch code (if needed) should have already
+ * purged the stale entry as part of this fault handling.
+ * Here we just return.
+ */
+ return VM_FAULT_MINOR;
+ else
+ return VM_FAULT_SIGBUS; /* mapping truncation does this. */
+ }

/*
* We need the page table lock to synchronize with kswapd
_

2005-10-19 14:33:25

by Adam Litke

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

On Tue, 2005-10-18 at 21:07 -0700, Andrew Morton wrote:
> Rohit Seth <[email protected]> wrote:
> >
> > The prefetching problem is handled OK for regular pages because we can
> > handle page faults corresponding to those pages. That is currently not
> > true for hugepages. Currently the kernel assumes that PAGE_FAULT
> > happening against a hugetlb page is caused by truncate and returns
> > SIGBUS.
>
> Doh. No fault handler. The penny finally drops.
>
> Adam, I think this patch is temporary?

Yeah, looks like it can be dropped now since we handle this case (in
hugetlb_fault) as:

> size = i_size_read(mapping->host) >> HPAGE_SHIFT;
> if (idx >= size)
> goto backout;
>
> ret = VM_FAULT_MINOR;
> if (!pte_none(*pte))
> goto backout;

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2005-10-19 15:23:54

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

On Tue, 18 Oct 2005, Rohit Seth wrote:
> On Tue, 2005-10-18 at 17:25 -0700, Andrew Morton wrote:
> > Rohit Seth <[email protected]> wrote:
> > >
> > >
> > > Unsused pte is where *pte == 0. Basically entries in leaf page table
> > > that does not map anything.
> >
> > Oh. Then I'm still not understanding.

I've had great trouble understanding it too.
This mail was the most helpful, thanks.

> Following is the scenario:
>
> Kernel is going to prefault the huge page for user virtual address V.
>
> Let us say the pointer pte points to the page table entry (for x86 it is
> PDE/PMD) which maps the hugepage.
>
> At this point *pte == 0. CPU also has this entry cached in its TLB.

I thought that the CPU never caches !present entries in the TLB?
Or is that true of i386 (and x86_64), but untrue of ia64?
Or do you have some new model or errata on some CPU where it's true?
Or, final ghastly possibility ;), am I simply altogether wrong?

> Meaning, unless this entry is purged or displaced, for virtual address V

When you say "purged", is that what we elsewhere call "flushed"
in relation to the TLB, or something else?

> CPU will generate the page fault (as the P bit is not set and assuming
> this fault has the highest precedence).
>
> Kernel updates the *pte so that it now maps the hugepage at virtual
> address V to physical address P.
>
> Later when the user process make a reference to V, because of stale TLB
> entry, the processor gets PAGE_FAULT.

You seem to be saying that strictly, we ought to flush TLB even when we
make a page present where none was before, but that the likelihood of it
being needed is so low, and the overhead of TLB flush so high, and the
existing code almost everywhere recovering safely from this condition,
that the most effective thing to do is just fix up the hugetlb case.
Is that correct?

> > Has this problem been observed in testing?
>
> Yes. On IA-64.

But not on i386 or x86_64.

> > I still don't understand what's different about hugepages here. If the
> > prefetching problem also occurs on regular pages and is handled OK for
> > regular pages, why do hugepages need special treatment?

Okay, I get that part of the puzzle.

> The prefetching problem is handled OK for regular pages because we can

But here you mention prefetching, and I think in another of these
explanations you mention that this only happens via speculative prefetch.

I thought prefetch was designed to have no such visible effects?
Same series of doubts as with !present entries in the TLB; but after
looking at the ia64 fault handler, that does seem to have stuff about
speculative loads, so I'm guessing i386 and x86_64 prefetch does not
cause faults (modulo errata), but ia64 does.

> handle page faults corresponding to those pages. That is currently not
> true for hugepages. Currently the kernel assumes that PAGE_FAULT
> happening against a hugetlb page is caused by truncate and returns
> SIGBUS.

Testing against i_size is not good enough, but I'll explain that
in other mail against Andrew's latest version of the patch.

Once I started to understand this thread, I thought you were quite
wrong to be changing hugetlb fault handling, thought I'd find several
other places which would need fixing too e.g. kmap_atomic, remap_pfn_range.

But no, I've found no others. Either miraculously, or by good design,
all the kernel misfaults should be seamlessly handled by the lazy vmalloc
path (on i386 anyway: I don't know what happens for ia64 there), and the
userspace misfaults by handle_pte_fault's pte_present check. I think.

Even with Nick's PageReserved changes in -mm, where he might have chosen
to put a warning on VM_RESERVED higher in the fault path, it looks like
it needs no change for this issue.

Hugh

2005-10-19 15:48:52

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

On Tue, 18 Oct 2005, Andrew Morton wrote:
> Rohit Seth <[email protected]> wrote:
> >
> > The prefetching problem is handled OK for regular pages because we can
> > handle page faults corresponding to those pages. That is currently not
> > true for hugepages. Currently the kernel assumes that PAGE_FAULT
> > happening against a hugetlb page is caused by truncate and returns
> > SIGBUS.

Is Rohit's intended to be a late 2.6.14 fix? We seem to have done well
without it for several years, and are just on the point of changing to
prefaulting the hugetlb pages anyway, which will fix it up.

> @@ -2045,8 +2045,18 @@ int __handle_mm_fault(struct mm_struct *
>
> inc_page_state(pgfault);
>
> - if (is_vm_hugetlb_page(vma))
> - return VM_FAULT_SIGBUS; /* mapping truncation does this. */
> + if (unlikely(is_vm_hugetlb_page(vma))) {
> + if (valid_hugetlb_file_off(vma, address))
> + /* We get here only if there was a stale(zero) TLB entry
> + * (because of HW prefetching).
> + * Low-level arch code (if needed) should have already
> + * purged the stale entry as part of this fault handling.
> + * Here we just return.
> + */
> + return VM_FAULT_MINOR;
> + else
> + return VM_FAULT_SIGBUS; /* mapping truncation does this. */
> + }

(I'm not surprised that the low-level arch code fixes this up as part of the
fault handling. I am surprised that it does so only after giving the fault
to the kernel. Sounds like something's gone wrong.)

What happens when the hugetlb file is truncated down and back up after
the mmap? Truncating down will remove a page from the mmap and flush TLB.
Truncating up will let accesses to the gone page pass the valid...off test.
But we've no support for hugetlb faulting in this version: so won't it get
get stuck in a tight loop?

If it's decided that this issue is actually an ia64 one, and does need to
be fixed right now, then I'd have thought your idea of fixing it at the
ia64 end better: arch/ia64/mm/fault.c already has code for discarding
faults on speculative loads, and ia64 has an RGN_HPAGE set aside for
hugetlb, so shouldn't it just discard speculative loads on that region?

Hugh

2005-10-19 18:40:26

by Rohit Seth

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

On Wed, 2005-10-19 at 16:23 +0100, Hugh Dickins wrote:

> I thought that the CPU never caches !present entries in the TLB?
> Or is that true of i386 (and x86_64), but untrue of ia64?

IA-64 can prefetch any entry from VHPT (last level page table)
irrespective of its value. You are right that i386 and x86_64 does not
cache !present entry. Though OS is suppose to handle those faults if
happen.

> Or do you have some new model or errata on some CPU where it's true?

No errata here.

> Or, final ghastly possibility ;), am I simply altogether wrong?
>

You are asking the right questions here.

> > Meaning, unless this entry is purged or displaced, for virtual address V
>
> When you say "purged", is that what we elsewhere call "flushed"
> in relation to the TLB, or something else?
>

I should use flush to be consistent.

> > CPU will generate the page fault (as the P bit is not set and assuming
> > this fault has the highest precedence).
> >
> > Kernel updates the *pte so that it now maps the hugepage at virtual
> > address V to physical address P.
> >
> > Later when the user process make a reference to V, because of stale TLB
> > entry, the processor gets PAGE_FAULT.
>
> You seem to be saying that strictly, we ought to flush TLB even when we
> make a page present where none was before, but that the likelihood of it
> being needed is so low, and the overhead of TLB flush so high, and the
> existing code almost everywhere recovering safely from this condition,
> that the most effective thing to do is just fix up the hugetlb case.
> Is that correct?
>

Yes. At least for the architectures that can cache any translation in
its TLB. IA-64 is again a good example here. It flushes the entry only
at the fault time so that next time around you get the updated entry
(for the cases where the fault happened because of any stale TLB).

> > > Has this problem been observed in testing?
> >
> > Yes. On IA-64.
>
> But not on i386 or x86_64.
>

No.

> Same series of doubts as with !present entries in the TLB; but after
> looking at the ia64 fault handler, that does seem to have stuff about
> speculative loads, so I'm guessing i386 and x86_64 prefetch does not
> cause faults (modulo errata), but ia64 does.
>

Those speculative loads (are more of advanced loads generated by
compiler in anticipation that they will be helpful) on IA-64 are
different from prefetches that HW does for TLBs.

HW Speculative loads never generates any fault.

Whereas prefetched TLB entries in i386, x86_64 or IA-64 can cause fault
if they are not flushed after updates.

> Once I started to understand this thread, I thought you were quite
> wrong to be changing hugetlb fault handling, thought I'd find several
> other places which would need fixing too e.g. kmap_atomic, remap_pfn_range.
>
> But no, I've found no others. Either miraculously, or by good design,
> all the kernel misfaults should be seamlessly handled by the lazy vmalloc
> path (on i386 anyway: I don't know what happens for ia64 there), and the
> userspace misfaults by handle_pte_fault's pte_present check. I think.
>

Good OS design :-) Though on IA-64 there was recently a similar issue
for vmalloc area that got fixed in low level arch specific code.

-rohit

2005-10-19 18:58:27

by Rohit Seth

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

On Wed, 2005-10-19 at 16:48 +0100, Hugh Dickins wrote:
> On Tue, 18 Oct 2005, Andrew Morton wrote:
> > Rohit Seth <[email protected]> wrote:
> > >
> > > The prefetching problem is handled OK for regular pages because we can
> > > handle page faults corresponding to those pages. That is currently not
> > > true for hugepages. Currently the kernel assumes that PAGE_FAULT
> > > happening against a hugetlb page is caused by truncate and returns
> > > SIGBUS.
>
> Is Rohit's intended to be a late 2.6.14 fix? We seem to have done well
> without it for several years, and are just on the point of changing to
> prefaulting the hugetlb pages anyway, which will fix it up.
>

I understand this is (very) late in the cycle for 2.6.14 But this issue
is happening on existing HW. Moreover, if demand paging of hugepages is
going to make into next version release from Linus then it makes all the
more sense to fill in the gap for prefaulting in this release itself.

It is also quite possible that some OSVs use 2.6.14 as the base for
their next releases. FWIW this patch is applicable to all the previous
kernel versions as well.

That done well for years logic ..... does not really hold good. There
is always better HW and SW that brings up new issues in existing code
that happened to be running well for years.

> >
> What happens when the hugetlb file is truncated down and back up after
> the mmap? Truncating down will remove a page from the mmap and flush TLB.
> Truncating up will let accesses to the gone page pass the valid...off test.
> But we've no support for hugetlb faulting in this version: so won't it get
> get stuck in a tight loop?
>

First of all, there is currently no support of extending the hugetlb
file size using truncate in 2.6.14. (I agree that part is broken). So
the above scenario can not happen.

For the prefaulting hugepage case that exist today, I would assume that
(if someone does extend ) using truncate to extend the hugetlb file size
support, would update the PTEs of all the processes that have those
hugepages mapped (just like when truncating it down currently kernel
clear the ptes from all the processes PT).

> If it's decided that this issue is actually an ia64 one, and does need to
> be fixed right now, then I'd have thought your idea of fixing it at the
> ia64 end better: arch/ia64/mm/fault.c already has code for discarding
> faults on speculative loads, and ia64 has an RGN_HPAGE set aside for
> hugetlb, so shouldn't it just discard speculative loads on that region?
>

This has nothing to do with speculative load fault on IA-64.

The first time kernel finds out about if a page belongs to hugepage in
the path of page_fault is in mm/memory.c Though the duplicated logic
could be added in the arch specific tree.

I wonder if other archs like ppc is also exposed to this behavior.

-rohit


2005-10-19 20:01:40

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

Many thanks for your helpful answers in other mail: much appreciated.

On Wed, 19 Oct 2005, Rohit Seth wrote:
> On Wed, 2005-10-19 at 16:48 +0100, Hugh Dickins wrote:
> >
> > What happens when the hugetlb file is truncated down and back up after
> > the mmap? Truncating down will remove a page from the mmap and flush TLB.
> > Truncating up will let accesses to the gone page pass the valid...off test.
> > But we've no support for hugetlb faulting in this version: so won't it get
> > get stuck in a tight loop?
>
> First of all, there is currently no support of extending the hugetlb
> file size using truncate in 2.6.14. (I agree that part is broken). So
> the above scenario can not happen.

I was forgetting that extending ftruncate wasn't supported in 2.6.14 and
earlier, yes. But I'm afraid the above scenario can still happen there:
extending is done, not by ftruncate, but by (somewhere else) mmapping the
larger size. So your fix may still cause a tight infinite fault loop.

> For the prefaulting hugepage case that exist today, I would assume that
> (if someone does extend ) using truncate to extend the hugetlb file size
> support, would update the PTEs of all the processes that have those
> hugepages mapped (just like when truncating it down currently kernel
> clear the ptes from all the processes PT).

Sorry, no, you're dreaming there. And I very much doubt we'd want
to implement such a behaviour! (But yes, it's a good example of why
several of us are quite unhappy with the extend-to-fill-mmap behaviour.)

> > If it's decided that this issue is actually an ia64 one, and does need to
> > be fixed right now, then I'd have thought your idea of fixing it at the
> > ia64 end better: arch/ia64/mm/fault.c already has code for discarding
> > faults on speculative loads, and ia64 has an RGN_HPAGE set aside for
> > hugetlb, so shouldn't it just discard speculative loads on that region?
>
> This has nothing to do with speculative load fault on IA-64.

As you made clear in your other mail: sorry for wasting your time
with my ignorant confusions!

Hugh

2005-10-19 20:19:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

Hugh Dickins <[email protected]> wrote:
>
> On Wed, 19 Oct 2005, Rohit Seth wrote:
> > On Wed, 2005-10-19 at 16:48 +0100, Hugh Dickins wrote:
> > >
> > > What happens when the hugetlb file is truncated down and back up after
> > > the mmap? Truncating down will remove a page from the mmap and flush TLB.
> > > Truncating up will let accesses to the gone page pass the valid...off test.
> > > But we've no support for hugetlb faulting in this version: so won't it get
> > > get stuck in a tight loop?
> >
> > First of all, there is currently no support of extending the hugetlb
> > file size using truncate in 2.6.14. (I agree that part is broken). So
> > the above scenario can not happen.
>
> I was forgetting that extending ftruncate wasn't supported in 2.6.14 and
> earlier, yes. But I'm afraid the above scenario can still happen there:
> extending is done, not by ftruncate, but by (somewhere else) mmapping the
> larger size. So your fix may still cause a tight infinite fault loop.

Will it? Whenever we mmap a hugetlbfs file we prepopulate the entire vma
with hugepages. So I don't think there's ever any part of an address space
which ia a) inside a hugepage vma and b) doesn't have a hugepage backing
it.

2005-10-19 20:29:10

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

On Wed, 19 Oct 2005, Andrew Morton wrote:
> Hugh Dickins <[email protected]> wrote:
> >
> > I was forgetting that extending ftruncate wasn't supported in 2.6.14 and
> > earlier, yes. But I'm afraid the above scenario can still happen there:
> > extending is done, not by ftruncate, but by (somewhere else) mmapping the
> > larger size. So your fix may still cause a tight infinite fault loop.
>
> Will it? Whenever we mmap a hugetlbfs file we prepopulate the entire vma
> with hugepages. So I don't think there's ever any part of an address space
> which ia a) inside a hugepage vma and b) doesn't have a hugepage backing
> it.

The new vma, sure, will be fully populated. But the old vma, in this
or some other process, which was created before the hugetlbfs file was
truncated down, will be left with a hole at the end.

Hugh

2005-10-19 20:53:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5



On Wed, 19 Oct 2005, Rohit Seth wrote:
>
> IA-64 can prefetch any entry from VHPT (last level page table)
> irrespective of its value. You are right that i386 and x86_64 does not
> cache !present entry. Though OS is suppose to handle those faults if
> happen.

Well..

The fact is, the VM layer is designed for systems that do not cache
not-present entries in their TLB. See for example the end of do_no_page()
in mm/memory.c:

/* no need to invalidate: a not-present page shouldn't be cached */
update_mmu_cache(vma, address, entry);
lazy_mmu_prot_update(entry);
spin_unlock(&mm->page_table_lock);
out:
return ret;

which _allows_ for hardware that caches not-present pages, but the
architecture needs to catch them in the "update_mmu_cache()".

IOW, the kernel is largely designed for present-only caching, and only has
explicit tlb flush macros for that case.

If ia64 caches non-present TLB entries, then that would seem to be a bug
in the Linux ia64 port:

- include/asm-ia64/pgtable.h:
#define update_mmu_cache(vma, address, pte) do { } while (0)

(Of course, you can and maybe do handle it differently: you can also
decide to just take the TLB fault, and flush the TLB at fault time in your
handler. I don't see that either on ia64, though. Although I didn't look
into any of the asm code, so maybe it's hidden somewhere there).

Linus

2005-10-19 21:59:46

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

On 10/19/05, Linus Torvalds <[email protected]> wrote:
> (Of course, you can and maybe do handle it differently: you can also
> decide to just take the TLB fault, and flush the TLB at fault time in your
> handler. I don't see that either on ia64, though. Although I didn't look
> into any of the asm code, so maybe it's hidden somewhere there).

Yes. In arch/ia64/kernel/ivt.S:

ENTRY(page_not_present)
DBG_FAULT(20)
mov r16=cr.ifa
rsm psr.dt
/*
* The Linux page fault handler doesn't expect non-present
pages to be in
* the TLB. Flush the existing entry now, so we meet that expectation.
*/
mov r17=PAGE_SHIFT<<2
;;
ptc.l r16,r17

-Tony

2005-10-19 23:46:16

by Rohit Seth

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

On Wed, 2005-10-19 at 21:28 +0100, Hugh Dickins wrote:
> On Wed, 19 Oct 2005, Andrew Morton wrote:
> > Hugh Dickins <[email protected]> wrote:
> > >
> > > I was forgetting that extending ftruncate wasn't supported in 2.6.14 and
> > > earlier, yes. But I'm afraid the above scenario can still happen there:
> > > extending is done, not by ftruncate, but by (somewhere else) mmapping the
> > > larger size. So your fix may still cause a tight infinite fault loop.
> >
> > Will it? Whenever we mmap a hugetlbfs file we prepopulate the entire vma
> > with hugepages. So I don't think there's ever any part of an address space
> > which ia a) inside a hugepage vma and b) doesn't have a hugepage backing
> > it.
>
> The new vma, sure, will be fully populated. But the old vma, in this
> or some other process, which was created before the hugetlbfs file was
> truncated down, will be left with a hole at the end.
>

Excellent catch. This broken truncation thing....

And I don't know what the right solution should be for this scenario at
this point for 2.6.14....may be to actually look at the HUGEPTE
corresponding to the hugetlb faulting address or don't allow mmaps to
grow the hugetlb file bigger (except the first mmap). I understand that
both of them don't sound too good...

Any suggestions.

-rohit

2005-10-19 23:58:36

by Rohit Seth

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

On Wed, 2005-10-19 at 13:53 -0700, Linus Torvalds wrote:
>

> The fact is, the VM layer is designed for systems that do not cache
> not-present entries in their TLB. See for example the end of do_no_page()
> in mm/memory.c:
>
> /* no need to invalidate: a not-present page shouldn't be cached */
> update_mmu_cache(vma, address, entry);
> lazy_mmu_prot_update(entry);
> spin_unlock(&mm->page_table_lock);
> out:
> return ret;
>
> which _allows_ for hardware that caches not-present pages, but the
> architecture needs to catch them in the "update_mmu_cache()".
>

I agree that one way would be to flush the TLB as part of
update_mmu_cache. But that would result in too many extra global
flushes. Instead in IA-64, we wait till fault time to do local flushes
whenever needed.

> If ia64 caches non-present TLB entries, then that would seem to be a bug
> in the Linux ia64 port:
>
> - include/asm-ia64/pgtable.h:
> #define update_mmu_cache(vma, address, pte) do { } while (0)
>
> (Of course, you can and maybe do handle it differently: you can also
> decide to just take the TLB fault, and flush the TLB at fault time in your
> handler. I don't see that either on ia64, though. Although I didn't look
> into any of the asm code, so maybe it's hidden somewhere there).
>

The low level page_not_present vector (asm code) flushes the stale entry
that could be sitting in TLB resulting in current page fault.


But anyways now there is another scenario that Hugh has pointed out in
the last mail that needs to be taken care of too...

-rohit

2005-10-20 01:29:40

by Rohit Seth

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

On Wed, 2005-10-19 at 16:53 -0700, Rohit Seth wrote:
> On Wed, 2005-10-19 at 21:28 +0100, Hugh Dickins wrote:
> > On Wed, 19 Oct 2005, Andrew Morton wrote:
> > > Hugh Dickins <[email protected]> wrote:
> > > >
> > > > I was forgetting that extending ftruncate wasn't supported in 2.6.14 and
> > > > earlier, yes. But I'm afraid the above scenario can still happen there:
> > > > extending is done, not by ftruncate, but by (somewhere else) mmapping the
> > > > larger size. So your fix may still cause a tight infinite fault loop.
> > >
> > > Will it? Whenever we mmap a hugetlbfs file we prepopulate the entire vma
> > > with hugepages. So I don't think there's ever any part of an address space
> > > which ia a) inside a hugepage vma and b) doesn't have a hugepage backing
> > > it.
> >
> > The new vma, sure, will be fully populated. But the old vma, in this
> > or some other process, which was created before the hugetlbfs file was
> > truncated down, will be left with a hole at the end.
> >
>
> Excellent catch. This broken truncation thing....
>
> And I don't know what the right solution should be for this scenario at
> this point for 2.6.14....may be to actually look at the HUGEPTE
> corresponding to the hugetlb faulting address or don't allow mmaps to
> grow the hugetlb file bigger (except the first mmap). I understand that
> both of them don't sound too good...
>
> Any suggestions.
>


I would like to keep this patch. This at least takes care of bad things
happening behind application's back by OS/HW. If the scenario that you
mentioned happens then the application is knowingly doing unsupported
things (at this point truncate is a broken operation on hugetlb). The
application can be killed in this case.

...trying to avoid any heavy changes at this time for 2.6.14

-rohit

2005-10-20 01:38:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

Rohit Seth <[email protected]> wrote:
>
> > Excellent catch. This broken truncation thing....
> >
> > And I don't know what the right solution should be for this scenario at
> > this point for 2.6.14....may be to actually look at the HUGEPTE
> > corresponding to the hugetlb faulting address or don't allow mmaps to
> > grow the hugetlb file bigger (except the first mmap). I understand that
> > both of them don't sound too good...
> >
> > Any suggestions.
> >
>
>
> I would like to keep this patch. This at least takes care of bad things
> happening behind application's back by OS/HW. If the scenario that you
> mentioned happens then the application is knowingly doing unsupported
> things (at this point truncate is a broken operation on hugetlb). The
> application can be killed in this case.

Yes, we had an infinite-page-fault bug in the vmalloc code ages ago and the
app was indeed killable. It'd be nice to confirm that this is still the
case. If so, the stopgap approach seems reasonable.

As long as the demand-paging patches get in, that is.

2005-10-20 06:18:29

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH]: Handling spurious page fault for hugetlb region for 2.6.14-rc4-git5

On Wed, 19 Oct 2005, Rohit Seth wrote:
> On Wed, 2005-10-19 at 16:53 -0700, Rohit Seth wrote:
> >
> > And I don't know what the right solution should be for this scenario at
> > this point for 2.6.14....may be to actually look at the HUGEPTE
> > corresponding to the hugetlb faulting address or don't allow mmaps to
> > grow the hugetlb file bigger (except the first mmap). I understand that
> > both of them don't sound too good...
>
> I would like to keep this patch. This at least takes care of bad things
> happening behind application's back by OS/HW. If the scenario that you
> mentioned happens then the application is knowingly doing unsupported
> things (at this point truncate is a broken operation on hugetlb). The
> application can be killed in this case.
>
> ...trying to avoid any heavy changes at this time for 2.6.14

I think your first suggestion, actually looking at the hugepte, is the
right one. I've dived into Adam's forthcoming hugetlb fault-on-demand
patch and extracted the minimum we need (but thought we might as well
take all the arguments to hugetlb_fault, even those not yet used).
This works fine so far as I can test it now, how about for you?

[Adam, if this goes in, your patch shouldn't need to touch memory.c or
hugetlb.h; and we hadn't noticed, yours was missing the hugetlb_fault
definition for the #else config, giving compile warning - fixed here.]

Signed-off-by: Hugh Dickins <[email protected]>
---

include/linux/hugetlb.h | 3 +++
mm/hugetlb.c | 22 ++++++++++++++++++++++
mm/memory.c | 2 +-
3 files changed, 26 insertions(+), 1 deletion(-)

--- 2.6.14-rc4-git7/include/linux/hugetlb.h 2005-10-11 12:07:52.000000000 +0100
+++ linux/include/linux/hugetlb.h 2005-10-20 06:27:51.000000000 +0100
@@ -25,6 +25,8 @@ int is_hugepage_mem_enough(size_t);
unsigned long hugetlb_total_pages(void);
struct page *alloc_huge_page(void);
void free_huge_page(struct page *);
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, int write_access);

extern unsigned long max_huge_pages;
extern const unsigned long hugetlb_zero, hugetlb_infinity;
@@ -99,6 +101,7 @@ static inline unsigned long hugetlb_tota
do { } while (0)
#define alloc_huge_page() ({ NULL; })
#define free_huge_page(p) ({ (void)(p); BUG(); })
+#define hugetlb_fault(mm, vma, addr, write) ({ BUG(); 0; })

#ifndef HPAGE_MASK
#define HPAGE_MASK 0 /* Keep the compiler happy */
--- 2.6.14-rc4-git7/mm/hugetlb.c 2005-10-11 12:07:55.000000000 +0100
+++ linux/mm/hugetlb.c 2005-10-20 06:27:51.000000000 +0100
@@ -393,6 +393,28 @@ out:
return ret;
}

+/*
+ * On ia64 at least, it is possible to receive a hugetlb fault from a
+ * stale zero entry left in the TLB from earlier hardware prefetching.
+ * Low-level arch code should already have flushed the stale entry as
+ * part of its fault handling, but we do need to accept this minor fault
+ * and return successfully. Whereas the "normal" case is that this is
+ * an access to a hugetlb page which has been truncated off since mmap.
+ */
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, int write_access)
+{
+ int ret = VM_FAULT_SIGBUS;
+ pte_t *pte;
+
+ spin_lock(&mm->page_table_lock);
+ pte = huge_pte_offset(mm, address);
+ if (pte && !pte_none(*pte))
+ ret = VM_FAULT_MINOR;
+ spin_unlock(&mm->page_table_lock);
+ return ret;
+}
+
int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page **pages, struct vm_area_struct **vmas,
unsigned long *position, int *length, int i)
--- 2.6.14-rc4-git7/mm/memory.c 2005-10-11 12:07:55.000000000 +0100
+++ linux/mm/memory.c 2005-10-20 06:27:51.000000000 +0100
@@ -2046,7 +2046,7 @@ int __handle_mm_fault(struct mm_struct *
inc_page_state(pgfault);

if (is_vm_hugetlb_page(vma))
- return VM_FAULT_SIGBUS; /* mapping truncation does this. */
+ return hugetlb_fault(mm, vma, address, write_access);

/*
* We need the page table lock to synchronize with kswapd