LinuxLists.cc - PFs on pages pinned with get_user

2009-01-29 08:15:35

Subject: PFs on pages pinned with get_user_pages()

Hi,

please could someone explain me under which circumstances a pagefault,
either generated from kernel code or from userland code, can occur on
pages which are pinned with get_user_pages()?

So far my understanding was that this can _never_ happen but I seems to
be wrong. Under high memory pressure I get PFs on such pages raised from
kernel code and the PFs are handled by do_swap_page(). When this happens,
page_count is 3 but page_mapped() returns false.

Thanks in advance,

Frank
--
Dr.-Ing. Frank Mehnert Sun Microsystems http://www.sun.com/

Attachments:

(No filename) (558.00 B)
signature.asc (197.00 B)
This is a digitally signed message part. Download all attachments

2009-01-29 12:28:28

by Peter Zijlstra

[permalink] [raw]

Subject: Re: PFs on pages pinned with get_user_pages()

On Thu, 2009-01-29 at 09:05 +0100, Frank Mehnert wrote:
> Hi,
>
> please could someone explain me under which circumstances a pagefault,
> either generated from kernel code or from userland code, can occur on
> pages which are pinned with get_user_pages()?
>
> So far my understanding was that this can _never_ happen but I seems to
> be wrong. Under high memory pressure I get PFs on such pages raised from
> kernel code and the PFs are handled by do_swap_page(). When this happens,
> page_count is 3 but page_mapped() returns false.

Under memory pressure the page reclaim will first unmap the physical
page from the virtual address range, and then try to free it.

Obviously the freeing bit fails if you hold a reference to it, but the
unmap will work.

After that, userspace will have to (minor) fault the stuff back in.

Also, that same page-reclaim, or pdflush might decide to write out dirty
data, which will also result in (minor) faults when userspace will
re-dirty the pages.

Having a page reference will only avoid the physical page from getting
removed from its current mapping (and thereby also pins the mapping).

2009-01-29 13:08:48

by Frank Mehnert

[permalink] [raw]

Subject: Re: PFs on pages pinned with get_user_pages()

Peter,

On Thursday 29 January 2009, Peter Zijlstra wrote:
> On Thu, 2009-01-29 at 09:05 +0100, Frank Mehnert wrote:
> > please could someone explain me under which circumstances a pagefault,
> > either generated from kernel code or from userland code, can occur on
> > pages which are pinned with get_user_pages()?
> >
> > So far my understanding was that this can _never_ happen but I seems to
> > be wrong. Under high memory pressure I get PFs on such pages raised from
> > kernel code and the PFs are handled by do_swap_page(). When this happens,
> > page_count is 3 but page_mapped() returns false.
>
> Under memory pressure the page reclaim will first unmap the physical
> page from the virtual address range, and then try to free it.

Which means the page table entry is removed but the physical page
is not swapped out, right?

> Obviously the freeing bit fails if you hold a reference to it, but the
> unmap will work.

Right.

> After that, userspace will have to (minor) fault the stuff back in.

So do_swap_page does only 'restore' the page table entry, no further
reading from the swapfile is necessary?

> Also, that same page-reclaim, or pdflush might decide to write out dirty
> data, which will also result in (minor) faults when userspace will
> re-dirty the pages.
>
> Having a page reference will only avoid the physical page from getting
> removed from its current mapping (and thereby also pins the mapping).

Question: Is it possible to prevent these minor page faults at all?

Thank you very much for your answer!

Frank
--
Dr.-Ing. Frank Mehnert Sun Microsystems http://www.sun.com/

Attachments:

(No filename) (1.58 kB)
signature.asc (197.00 B)
This is a digitally signed message part. Download all attachments

2009-01-29 13:44:23

by Peter Zijlstra

[permalink] [raw]

Subject: Re: PFs on pages pinned with get_user_pages()

On Thu, 2009-01-29 at 14:08 +0100, Frank Mehnert wrote:
> Peter,

(please retain CC's)

> On Thursday 29 January 2009, Peter Zijlstra wrote:
> > On Thu, 2009-01-29 at 09:05 +0100, Frank Mehnert wrote:
> > > please could someone explain me under which circumstances a pagefault,
> > > either generated from kernel code or from userland code, can occur on
> > > pages which are pinned with get_user_pages()?
> > >
> > > So far my understanding was that this can _never_ happen but I seems to
> > > be wrong. Under high memory pressure I get PFs on such pages raised from
> > > kernel code and the PFs are handled by do_swap_page(). When this happens,
> > > page_count is 3 but page_mapped() returns false.
> >
> > Under memory pressure the page reclaim will first unmap the physical
> > page from the virtual address range, and then try to free it.
>
> Which means the page table entry is removed but the physical page
> is not swapped out, right?

Correct.

> > Obviously the freeing bit fails if you hold a reference to it, but the
> > unmap will work.
>
> Right.
>
> > After that, userspace will have to (minor) fault the stuff back in.
>
> So do_swap_page does only 'restore' the page table entry, no further
> reading from the swapfile is necessary?

Indeed.

> > Also, that same page-reclaim, or pdflush might decide to write out dirty
> > data, which will also result in (minor) faults when userspace will
> > re-dirty the pages.
> >
> > Having a page reference will only avoid the physical page from getting
> > removed from its current mapping (and thereby also pins the mapping).
>
> Question: Is it possible to prevent these minor page faults at all?

Not without some serious tinkering to the VM -- and in the case of the
dirty fault, not at all.

Why are you asking?

2009-01-29 14:03:22

by Frank Mehnert

[permalink] [raw]

Subject: Re: PFs on pages pinned with get_user_pages()

On Thursday 29 January 2009, Peter Zijlstra wrote:
> On Thu, 2009-01-29 at 14:08 +0100, Frank Mehnert wrote:
> > Peter,
>
> (please retain CC's)
>
> > On Thursday 29 January 2009, Peter Zijlstra wrote:
> > > On Thu, 2009-01-29 at 09:05 +0100, Frank Mehnert wrote:
> > > > please could someone explain me under which circumstances a
> > > > pagefault, either generated from kernel code or from userland code,
> > > > can occur on pages which are pinned with get_user_pages()?
> > > >
> > > > So far my understanding was that this can _never_ happen but I seems
> > > > to be wrong. Under high memory pressure I get PFs on such pages
> > > > raised from kernel code and the PFs are handled by do_swap_page().
> > > > When this happens, page_count is 3 but page_mapped() returns false.
> > >
> > > Under memory pressure the page reclaim will first unmap the physical
> > > page from the virtual address range, and then try to free it.
> >
> > Which means the page table entry is removed but the physical page
> > is not swapped out, right?
>
> Correct.
>
> > > Obviously the freeing bit fails if you hold a reference to it, but the
> > > unmap will work.
> >
> > Right.
> >
> > > After that, userspace will have to (minor) fault the stuff back in.

[...]

> > Question: Is it possible to prevent these minor page faults at all?
>
> Not without some serious tinkering to the VM -- and in the case of the
> dirty fault, not at all.
>
> Why are you asking?

I'm one of the VirtualBox developers. We are trying to fix the annoying
kerneloops warning 'BUG: sleeping function called from invalid context'
reported by the Fedora folks. This warning occurs when do_swap_page()
calls lock_page() and in_atomic() returns true.

This warning appears when we touch into memory which is pinned with
get_user_pages(). In VT-x/AMD-V mode we are executing some code in the
context of the Linux kernel. To prevent scheduling of the current CPU
core we disable the interripts. preempt_disable() would be probably the
better choice but this would oops as well if CONFIG_PREEMPT is enabled.

Kind regards,

Frank
--
Dr.-Ing. Frank Mehnert Sun Microsystems http://www.sun.com/

Attachments:

(No filename) (2.11 kB)
signature.asc (197.00 B)
This is a digitally signed message part. Download all attachments

2009-01-29 14:21:19

by Peter Zijlstra

[permalink] [raw]

Subject: Re: PFs on pages pinned with get_user_pages()

On Thu, 2009-01-29 at 15:02 +0100, Frank Mehnert wrote:

> I'm one of the VirtualBox developers. We are trying to fix the annoying
> kerneloops warning 'BUG: sleeping function called from invalid context'
> reported by the Fedora folks. This warning occurs when do_swap_page()
> calls lock_page() and in_atomic() returns true.
>
> This warning appears when we touch into memory which is pinned with
> get_user_pages(). In VT-x/AMD-V mode we are executing some code in the
> context of the Linux kernel. To prevent scheduling of the current CPU
> core we disable the interripts. preempt_disable() would be probably the
> better choice but this would oops as well if CONFIG_PREEMPT is enabled.

but to get there, you'd have to have called handle_mm_fault() which
requires the mmap_sem, which should also give that might_sleep()
warning.

That aside, is there any reason you have to avoid scheduling? Otherwise
I would just allow so and be done with it.

2009-01-29 14:42:06

by Frank Mehnert

[permalink] [raw]

Subject: Re: PFs on pages pinned with get_user_pages()

On Thursday 29 January 2009, Peter Zijlstra wrote:
> On Thu, 2009-01-29 at 15:02 +0100, Frank Mehnert wrote:
> > I'm one of the VirtualBox developers. We are trying to fix the annoying
> > kerneloops warning 'BUG: sleeping function called from invalid context'
> > reported by the Fedora folks. This warning occurs when do_swap_page()
> > calls lock_page() and in_atomic() returns true.
> >
> > This warning appears when we touch into memory which is pinned with
> > get_user_pages(). In VT-x/AMD-V mode we are executing some code in the
> > context of the Linux kernel. To prevent scheduling of the current CPU
> > core we disable the interripts. preempt_disable() would be probably the
> > better choice but this would oops as well if CONFIG_PREEMPT is enabled.
>
> but to get there, you'd have to have called handle_mm_fault() which
> requires the mmap_sem, which should also give that might_sleep()
> warning.

The stacktrace is

__might_sleep()
lock_page()
handle_mm_fault()
do_page_fault()
error_code

So yes, handle_mm_fault() is called. But I assume that down_read_trylock()
succeeded before we were forced to call down_read().

> That aside, is there any reason you have to avoid scheduling? Otherwise
> I would just allow so and be done with it.

The reason is that our code expects that to ensure syncing of the CPU
state with the saved state. I fear it is quite difficult to change that...

Kind regards,

Frank
--
Dr.-Ing. Frank Mehnert Sun Microsystems http://www.sun.com/

Attachments:

(No filename) (1.47 kB)
signature.asc (197.00 B)
This is a digitally signed message part. Download all attachments

2009-01-29 14:52:59

by Peter Zijlstra

[permalink] [raw]

Subject: Re: PFs on pages pinned with get_user_pages()

On Thu, 2009-01-29 at 15:41 +0100, Frank Mehnert wrote:
> On Thursday 29 January 2009, Peter Zijlstra wrote:
> > On Thu, 2009-01-29 at 15:02 +0100, Frank Mehnert wrote:
> > > I'm one of the VirtualBox developers. We are trying to fix the annoying
> > > kerneloops warning 'BUG: sleeping function called from invalid context'
> > > reported by the Fedora folks. This warning occurs when do_swap_page()
> > > calls lock_page() and in_atomic() returns true.
> > >
> > > This warning appears when we touch into memory which is pinned with
> > > get_user_pages(). In VT-x/AMD-V mode we are executing some code in the
> > > context of the Linux kernel. To prevent scheduling of the current CPU
> > > core we disable the interripts. preempt_disable() would be probably the
> > > better choice but this would oops as well if CONFIG_PREEMPT is enabled.
> >
> > but to get there, you'd have to have called handle_mm_fault() which
> > requires the mmap_sem, which should also give that might_sleep()
> > warning.
>
> The stacktrace is
>
> __might_sleep()
> lock_page()
> handle_mm_fault()
> do_page_fault()
> error_code
>
> So yes, handle_mm_fault() is called. But I assume that down_read_trylock()
> succeeded before we were forced to call down_read().
>
> > That aside, is there any reason you have to avoid scheduling? Otherwise
> > I would just allow so and be done with it.
>
> The reason is that our code expects that to ensure syncing of the CPU
> state with the saved state. I fear it is quite difficult to change that...

Ah, is that what KVM uses the preempt notifiers for? Could you too?

2009-01-29 14:57:12

by Peter Zijlstra

[permalink] [raw]

Subject: [PATCH] x86: add might_sleep() to do_page_fault()

VirtualBox calls do_page_fault() from an atomic context but runs into a
might_sleep() way pas this point, cure that.

Signed-off-by: Peter Zijlstra <[email protected]>
---
arch/x86/mm/fault.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 67e4df5..bb7f946 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -908,6 +908,11 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
}
down_read(&mm->mmap_sem);
}
+ /*
+ * The above down_read_trylock() might have succeeded in which case
+ * we'll have missed the might_sleep() from down_read().
+ */
+ might_sleep();

vma = find_vma(mm, address);
if (unlikely(!vma)) {

2009-01-29 15:00:10

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH] x86: add might_sleep() to do_page_fault()

* Peter Zijlstra <[email protected]> wrote:

> VirtualBox calls do_page_fault() from an atomic context but runs into a
> might_sleep() way pas this point, cure that.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> arch/x86/mm/fault.c | 5 +++++
> 1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 67e4df5..bb7f946 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -908,6 +908,11 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
> }
> down_read(&mm->mmap_sem);
> }
> + /*
> + * The above down_read_trylock() might have succeeded in which case
> + * we'll have missed the might_sleep() from down_read().
> + */
> + might_sleep();

should go into the 'else' branch i guess? In the down_read() case we
already had the check.

Ingo

2009-01-29 15:02:32

by Peter Zijlstra

[permalink] [raw]

Subject: [PATCH v2] x86: add might_sleep() to do_page_fault()

> should go into the 'else' branch i guess? In the down_read() case we
> already had the check.

True.

---
VirtualBox calls do_page_fault() from an atomic context but runs into a
might_sleep() way pas this point, cure that.

Signed-off-by: Peter Zijlstra <[email protected]>
---
arch/x86/mm/fault.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 67e4df5..bfac289 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -907,6 +907,12 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
return;
}
down_read(&mm->mmap_sem);
+ } else {
+ /*
+ * The above down_read_trylock() might have succeeded in which
+ * case we'll have missed the might_sleep() from down_read().
+ */
+ might_sleep();
}

vma = find_vma(mm, address);

2009-01-29 15:04:38

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH v2] x86: add might_sleep() to do_page_fault()

* Peter Zijlstra <[email protected]> wrote:

>
> > should go into the 'else' branch i guess? In the down_read() case we
> > already had the check.
>
> True.
>
> ---
> VirtualBox calls do_page_fault() from an atomic context but runs into a
> might_sleep() way pas this point, cure that.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> arch/x86/mm/fault.c | 6 ++++++
> 1 files changed, 6 insertions(+), 0 deletions(-)

Applied to tip/x86/mm, thanks Peter!

Ingo

2009-01-29 16:04:20

by Frank Mehnert

[permalink] [raw]

Subject: Re: PFs on pages pinned with get_user_pages()

On Thursday 29 January 2009, Peter Zijlstra wrote:
> > > That aside, is there any reason you have to avoid scheduling? Otherwise
> > > I would just allow so and be done with it.
> >
> > The reason is that our code expects that to ensure syncing of the CPU
> > state with the saved state. I fear it is quite difficult to change
> > that...
>
> Ah, is that what KVM uses the preempt notifiers for? Could you too?

Right, that could be an option.

We will try to change our code which is a big effort as we try
to keep the code as unique as possible between the different
hosts we support (Linux, Solaris, Windows, Mac OS X).

Just to be sure: There is no other option than disabling interrupts
or calling disable_preemption() to prevent scheduling?

Kind regards,

Frank
--
Dr.-Ing. Frank Mehnert Sun Microsystems http://www.sun.com/

Attachments:

(No filename) (839.00 B)
signature.asc (197.00 B)
This is a digitally signed message part. Download all attachments

2009-01-29 16:11:45

by Peter Zijlstra

[permalink] [raw]

Subject: Re: PFs on pages pinned with get_user_pages()

On Thu, 2009-01-29 at 17:03 +0100, Frank Mehnert wrote:
> On Thursday 29 January 2009, Peter Zijlstra wrote:
> > > > That aside, is there any reason you have to avoid scheduling? Otherwise
> > > > I would just allow so and be done with it.
> > >
> > > The reason is that our code expects that to ensure syncing of the CPU
> > > state with the saved state. I fear it is quite difficult to change
> > > that...
> >
> > Ah, is that what KVM uses the preempt notifiers for? Could you too?
>
> Right, that could be an option.
>
> We will try to change our code which is a big effort as we try
> to keep the code as unique as possible between the different
> hosts we support (Linux, Solaris, Windows, Mac OS X).
>
> Just to be sure: There is no other option than disabling interrupts
> or calling disable_preemption() to prevent scheduling?

Thing is, lock_page() and down_read() require to be able to schedule(),
so there's no way around that.

So even if there was another way to disable scheduling, you'd still have
the same problem.

2009-01-30 10:34:26

by Frank Mehnert

[permalink] [raw]

Subject: Re: PFs on pages pinned with get_user_pages()

On Thursday 29 January 2009, Peter Zijlstra wrote:
> On Thu, 2009-01-29 at 17:03 +0100, Frank Mehnert wrote:
> > On Thursday 29 January 2009, Peter Zijlstra wrote:
> > > > > That aside, is there any reason you have to avoid scheduling?
> > > > > Otherwise I would just allow so and be done with it.
> > > >
> > > > The reason is that our code expects that to ensure syncing of the CPU
> > > > state with the saved state. I fear it is quite difficult to change
> > > > that...
> > >
> > > Ah, is that what KVM uses the preempt notifiers for? Could you too?
> >
> > Right, that could be an option.
> >
> > We will try to change our code which is a big effort as we try
> > to keep the code as unique as possible between the different
> > hosts we support (Linux, Solaris, Windows, Mac OS X).
> >
> > Just to be sure: There is no other option than disabling interrupts
> > or calling disable_preemption() to prevent scheduling?
>
> Thing is, lock_page() and down_read() require to be able to schedule(),
> so there's no way around that.
>
> So even if there was another way to disable scheduling, you'd still have
> the same problem.

Yes, makes sense.

Back to my initial question: The problem arises for us because we depend
on permanent mappings of memory which were

- allocated with alloc_pages() or alloc_page()
- mapped into ring 3 with remap_pfn_range() and
- pinned with get_user_pages()

There are potential pagefaults when touching into these ring-3-mappings
from ring 0. So I assume we could prevent such pagefaults if we access
that memory from ring-0-mappings, right? Unfortunately, the space for
ring-0-mappings (< 1GB) is smaller than userland (~ 3GB), at least on
32-bit systems.

Kind regards,

Frank
--
Dr.-Ing. Frank Mehnert Sun Microsystems http://www.sun.com/

Attachments:

(No filename) (1.75 kB)
signature.asc (197.00 B)
This is a digitally signed message part. Download all attachments

2009-01-30 10:45:36

by Peter Zijlstra

[permalink] [raw]

Subject: Re: PFs on pages pinned with get_user_pages()

On Fri, 2009-01-30 at 11:34 +0100, Frank Mehnert wrote:

> > Thing is, lock_page() and down_read() require to be able to schedule(),
> > so there's no way around that.
> >
> > So even if there was another way to disable scheduling, you'd still have
> > the same problem.
>
> Yes, makes sense.
>
> Back to my initial question: The problem arises for us because we depend
> on permanent mappings of memory which were
>
> - allocated with alloc_pages() or alloc_page()
> - mapped into ring 3 with remap_pfn_range() and
> - pinned with get_user_pages()
>
> There are potential pagefaults when touching into these ring-3-mappings
> from ring 0. So I assume we could prevent such pagefaults if we access
> that memory from ring-0-mappings, right? Unfortunately, the space for
> ring-0-mappings (< 1GB) is smaller than userland (~ 3GB), at least on
> 32-bit systems.

if you only need to access one or two pages, you could kmap_atomic() the
actual pages from ring-0.