I recently discovered that commit 662bbcb2747c2422cf98d3d97619509379eee466
removed/skipped all might_sleep checks for might_fault() calls when in atomic.
Reason was to allow calling copy_(to|from)_user while in pagefault_disabled(),
because otherwise, CONFIG_DEBUG_ATOMIC_SLEEP would find false positives. However
we have the inatomic variants of these function for this purpose.
The result of this change was that all guest access (that correctly uses
might_fault()) doesn't perform atomic checks when CONFIG_DEBUG_ATOMIC_SLEEP is
enabled. We lost a mighty debugging feature for user access.
As I wasn't able to come up with any other reason why this should be
necessary, I suggest turning the might_sleep() checks on again in the
might_fault() code.
pagefault_disable/pagefault_enable seems to be used mainly for futex, perf event
and kmap.
Going over all changes since that commit, it seems like most code already uses the
inatomic variants of copy_(to|from)_user. Code relying on (get|put)_user during
pagefault_disable() don't make use of any might_fault() in their (get|put)_user
implementation. Examples:
- arch/m68k/include/asm/futex.h
- arch/parisc/include/asm/futex.h
- arch/sh/include/asm/futex-irq.h
- arch/tile/include/asm/futex.h
So changing might_fault() back to trigger might_sleep() won't change a thing for
them. Hope I haven't missed anything.
I only identified one might_fault() that would get triggered by get_user() on
powerpc and fixed it by using the inatomic variant (not tested). I am not sure
if we need some non-sleeping access_ok() prior to __get_user_inatomic().
By looking at the code I was wondering where the correct place for might_fault()
calls is? Doesn't seem to be very consistent. Examples:
| asm-generic | arm | arm64 | frv | m32r | x86 and s390
---------------------------------------------------------------------------
get_user() | Yes | Yes | Yes | No | Yes | Yes
__get_user() | No | Yes | No | No | Yes | No
put_user() | Yes | Yes | Yes | No | Yes | Yes
__put_user() | No | Yes | No | No | Yes | No
copy_to_user() | Yes | No | No | Yes | Yes | Yes
__copy_to_user() | No | No | No | Yes | No | No
copy_from_user() | Yes | No | No | Yes | Yes | Yes
__copy_from_user() | No | No | No | Yes | No | No
So I would have assume that the way asm-generic, x86 and s390 (+ propably
others) do this is the right way?
So we can speed up multiple calls to e.g. copy_to_user() by doing the access
check manually (and also the might_fault() if relevant), then calling
__copy_to_user().
So in general, I conclude that the concept is:
1. __.* variants perform no checking and don't call might_fault()
2. [a-z].* variants perform access checking (access_ok() if implemented)
3. [a-z].* variants call might_fault()
4. .*_inatomic variants usually don't perform access checks
5. .*_inatomic variants don't call might_fault()
6. If common code uses the __.* variants, it has to trigger access_ok() and
call might_fault()
7. For pagefault_disable(), the inatomic variants are to be used
Comments? Opinions?
David Hildenbrand (2):
powerpc/fsl-pci: atomic get_user when pagefault_disabled
mm, sched: trigger might_sleep() in might_fault() when atomic
arch/powerpc/sysdev/fsl_pci.c | 2 +-
include/linux/kernel.h | 8 ++++++--
mm/memory.c | 11 ++++-------
3 files changed, 11 insertions(+), 10 deletions(-)
--
1.8.5.5
Whenever we have pagefaults disabled, we have to use the atomic variants of
(set|get)_user and copy_(from|to)_user.
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/powerpc/sysdev/fsl_pci.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
index 65d2ed4..c0af4ef 100644
--- a/arch/powerpc/sysdev/fsl_pci.c
+++ b/arch/powerpc/sysdev/fsl_pci.c
@@ -1025,7 +1025,7 @@ int fsl_pci_mcheck_exception(struct pt_regs *regs)
if (is_in_pci_mem_space(addr)) {
if (user_mode(regs)) {
pagefault_disable();
- ret = get_user(regs->nip, &inst);
+ ret = __get_user_inatomic(regs->nip, &inst);
pagefault_enable();
} else {
ret = probe_kernel_address(regs->nip, inst);
--
1.8.5.5
Commit 662bbcb2747c2422cf98d3d97619509379eee466 disabled in atomic checks
for all user access code (that uses might_fault()).
That change basically disabled CONFIG_DEBUG_ATOMIC_SLEEP for all user access
functions. However, this is a mighty debugging aid that we want.
If user memory is to be accessed while pagefault_disabled() is set, the atomic
variants of copy_(to|from)_user can be used.
This patch reverts commit 662bbcb2747c2422cf98d3d97619509379eee466 taking care
of the !MMU optimization.
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/kernel.h | 8 ++++++--
mm/memory.c | 11 ++++-------
2 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 3d770f55..1d3397c 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -225,9 +225,13 @@ static inline u32 reciprocal_scale(u32 val, u32 ep_ro)
return (u32)(((u64) val * ep_ro) >> 32);
}
-#if defined(CONFIG_MMU) && \
- (defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_DEBUG_ATOMIC_SLEEP))
+#if defined(CONFIG_MMU) && defined(CONFIG_PROVE_LOCKING)
void might_fault(void);
+#elif defined(CONFIG_MMU) && defined(CONFIG_DEBUG_ATOMIC_SLEEP)
+static inline void might_fault(void)
+{
+ __might_sleep(__FILE__, __LINE__, 0);
+}
#else
static inline void might_fault(void) { }
#endif
diff --git a/mm/memory.c b/mm/memory.c
index 3e50383..fe0c815 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3699,7 +3699,7 @@ void print_vma_addr(char *prefix, unsigned long ip)
up_read(&mm->mmap_sem);
}
-#if defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_DEBUG_ATOMIC_SLEEP)
+#ifdef CONFIG_PROVE_LOCKING
void might_fault(void)
{
/*
@@ -3711,17 +3711,14 @@ void might_fault(void)
if (segment_eq(get_fs(), KERNEL_DS))
return;
+ __might_sleep(__FILE__, __LINE__, 0);
+
/*
* it would be nicer only to annotate paths which are not under
* pagefault_disable, however that requires a larger audit and
* providing helpers like get_user_atomic.
*/
- if (in_atomic())
- return;
-
- __might_sleep(__FILE__, __LINE__, 0);
-
- if (current->mm)
+ if (!in_atomic() && current->mm)
might_lock_read(¤t->mm->mmap_sem);
}
EXPORT_SYMBOL(might_fault);
--
1.8.5.5
On Tue, Nov 25, 2014 at 12:43:24PM +0100, David Hildenbrand wrote:
> I recently discovered that commit 662bbcb2747c2422cf98d3d97619509379eee466
> removed/skipped all might_sleep checks for might_fault() calls when in atomic.
Yes. You can add e.g. might_sleep in your code if, for some reason, it is
illegal to call it in an atomic context.
But faults are legal in atomic if you handle the possible
errors, and faults do not necessary cause caller to sleep, so might_fault
should not call might_sleep.
> Reason was to allow calling copy_(to|from)_user while in pagefault_disabled(),
> because otherwise, CONFIG_DEBUG_ATOMIC_SLEEP would find false positives.
That wasn't the only reason BTW.
Andi Kleen also showed that compiler did a terrible job optimizing
get/put user when might_sleep was called.
See e.g. this thread:
https://lkml.org/lkml/2013/8/14/652
There was even an lwn.net article about it, that I don't seem to be
able to locate.
So might_sleep is not appropriate for lightweight operations like __get_user,
which people literally expect to be a single instruction.
I also have a project going which handles very short packets by copying
them into guest memory directly without waking up a thread.
I do it by calling recvmsg on a socket, then switching to
a thread if I get back EFAULT.
Not yet clean enough to upstream but it does seem to cut
the latency down quite a bit, so I'd like to have the option.
Generally, a caller that does something like this under a spinlock:
preempt_disable
pagefault_disable
error = copy_to_user
pagefault_enable
preempt_enable_no_resched
is not doing anything wrong and should not get a warning,
as long as error is handled correctly later.
You can also find the discussion around the patches
educational:
http://article.gmane.org/gmane.comp.emulators.kvm.devel/109928
> However
> we have the inatomic variants of these function for this purpose.
Does inatomic install fixups now?
Last I checked, it didn't so it did not satisfy this purpose.
See this comment from x86:
* Copy data from kernel space to user space. Caller must check
* the specified block with access_ok() before calling this function.
* The caller should also make sure he pins the user space address
* so that we don't result in page fault and sleep.
Also - switching to inatomic would scatter if (atomic) all
over code. It's much better to simply call the same
function (e.g. recvmsg) and have it fail gracefully:
after all, we have code to handle get/put user errors
anyway.
> The result of this change was that all guest access (that correctly uses
> might_fault()) doesn't perform atomic checks when CONFIG_DEBUG_ATOMIC_SLEEP is
> enabled. We lost a mighty debugging feature for user access.
What's the path you are trying to debug?
If your code can faults, then it's safe to call
from atomic context.
If it can't, it must pin the page. You can not do access_ok checks and
then assume access won't fault.
> As I wasn't able to come up with any other reason why this should be
> necessary, I suggest turning the might_sleep() checks on again in the
> might_fault() code.
Faults triggered with pagefault_disabled do not cause
caller to sleep, merely to fail and get an error,
so might_sleep is simply wrong.
>
> pagefault_disable/pagefault_enable seems to be used mainly for futex, perf event
> and kmap.
>
> Going over all changes since that commit, it seems like most code already uses the
> inatomic variants of copy_(to|from)_user. Code relying on (get|put)_user during
> pagefault_disable() don't make use of any might_fault() in their (get|put)_user
> implementation. Examples:
> - arch/m68k/include/asm/futex.h
> - arch/parisc/include/asm/futex.h
> - arch/sh/include/asm/futex-irq.h
> - arch/tile/include/asm/futex.h
> So changing might_fault() back to trigger might_sleep() won't change a thing for
> them. Hope I haven't missed anything.
Did you check the generated code?
On x86 it seems to me this patchset is definitely going to
slow things down, in fact putting back in a performance regression
that Andi found.
> I only identified one might_fault() that would get triggered by get_user() on
> powerpc and fixed it by using the inatomic variant (not tested). I am not sure
> if we need some non-sleeping access_ok() prior to __get_user_inatomic().
>
> By looking at the code I was wondering where the correct place for might_fault()
> calls is? Doesn't seem to be very consistent. Examples:
>
> | asm-generic | arm | arm64 | frv | m32r | x86 and s390
> ---------------------------------------------------------------------------
> get_user() | Yes | Yes | Yes | No | Yes | Yes
> __get_user() | No | Yes | No | No | Yes | No
> put_user() | Yes | Yes | Yes | No | Yes | Yes
> __put_user() | No | Yes | No | No | Yes | No
> copy_to_user() | Yes | No | No | Yes | Yes | Yes
> __copy_to_user() | No | No | No | Yes | No | No
> copy_from_user() | Yes | No | No | Yes | Yes | Yes
> __copy_from_user() | No | No | No | Yes | No | No
>
I think it would be a mistake to make this change.
Most call-sites handle faults in atomic just fine by
returning error to caller.
> So I would have assume that the way asm-generic, x86 and s390 (+ propably
> others) do this is the right way?
> So we can speed up multiple calls to e.g. copy_to_user() by doing the access
> check manually (and also the might_fault() if relevant), then calling
> __copy_to_user().
>
> So in general, I conclude that the concept is:
> 1. __.* variants perform no checking and don't call might_fault()
> 2. [a-z].* variants perform access checking (access_ok() if implemented)
> 3. [a-z].* variants call might_fault()
> 4. .*_inatomic variants usually don't perform access checks
> 5. .*_inatomic variants don't call might_fault()
> 6. If common code uses the __.* variants, it has to trigger access_ok() and
> call might_fault()
> 7. For pagefault_disable(), the inatomic variants are to be used
inatomic variants don't seem to handle faults, so you
must pin any memory you pass to them.
> Comments? Opinions?
>
If the same address is accessed multiple times, access_ok + __
variant can be used to speed access up a bit.
This is rarely the case, but this is the case for e.g. vhost.
But access_ok does not guarantee that no fault will trigger:
there's really no way to do that ATM except pinning the page.
> David Hildenbrand (2):
> powerpc/fsl-pci: atomic get_user when pagefault_disabled
> mm, sched: trigger might_sleep() in might_fault() when atomic
>
> arch/powerpc/sysdev/fsl_pci.c | 2 +-
> include/linux/kernel.h | 8 ++++++--
> mm/memory.c | 11 ++++-------
> 3 files changed, 11 insertions(+), 10 deletions(-)
>
> --
> 1.8.5.5
Hi Michael,
thanks for your reply! some general thought:
This change was introduced mid 2013 but we don't have a single user relying
on this code change yet, right?
Disabling might_sleep() for functions that clearly state that they may sleep to
get some special case running feels wrong to me.
> On Tue, Nov 25, 2014 at 12:43:24PM +0100, David Hildenbrand wrote:
> > I recently discovered that commit 662bbcb2747c2422cf98d3d97619509379eee466
> > removed/skipped all might_sleep checks for might_fault() calls when in atomic.
>
> Yes. You can add e.g. might_sleep in your code if, for some reason, it is
> illegal to call it in an atomic context.
> But faults are legal in atomic if you handle the possible
> errors, and faults do not necessary cause caller to sleep, so might_fault
> should not call might_sleep.
My point is that and (almost at) everywhere where we use pagefault_disable, we
are using the inatomic variants. Also the documentation of copy_to_user()
clearly states at various points that this function "may sleep":
-> git grep "This function may sleep" yields
"Context: User context only. This function may sleep." e.g. s390, x86, mips
Patching out the might_sleep() from these functions seems more to be a hack to
solve another problem - not using the inatomic variants or finding them not to
be sufficient for a task?
So calling might_sleep() in these functions seems very right to me.
>
> > Reason was to allow calling copy_(to|from)_user while in pagefault_disabled(),
> > because otherwise, CONFIG_DEBUG_ATOMIC_SLEEP would find false positives.
>
> That wasn't the only reason BTW.
> Andi Kleen also showed that compiler did a terrible job optimizing
> get/put user when might_sleep was called.
> See e.g. this thread:
> https://lkml.org/lkml/2013/8/14/652
> There was even an lwn.net article about it, that I don't seem to be
> able to locate.
Thanks, I'll try to look it up! but:
might_sleep() will only be called when lock debugging / sleep in atomic is in,
so this doesn't seem to be a problem for me in a production environment. or am
I missing something?
> So might_sleep is not appropriate for lightweight operations like __get_user,
> which people literally expect to be a single instruction.
I agree that __.* variants should not call might_fault() (like I mentioned
after the table below).
>
>
> I also have a project going which handles very short packets by copying
> them into guest memory directly without waking up a thread.
> I do it by calling recvmsg on a socket, then switching to
> a thread if I get back EFAULT.
> Not yet clean enough to upstream but it does seem to cut
> the latency down quite a bit, so I'd like to have the option.
>
>
> Generally, a caller that does something like this under a spinlock:
> preempt_disable
> pagefault_disable
> error = copy_to_user
So why can't we use the inatomic variant? This seems to be
atomic environment to me. Calling a function that states that it may sleep
doesn't feel right to me.
> pagefault_enable
> preempt_enable_no_resched
>
> is not doing anything wrong and should not get a warning,
> as long as error is handled correctly later.
>
> You can also find the discussion around the patches
> educational:
> http://article.gmane.org/gmane.comp.emulators.kvm.devel/109928
>
> > However
> > we have the inatomic variants of these function for this purpose.
>
> Does inatomic install fixups now?
If not, I think this would rather be the way to go.
>
> Last I checked, it didn't so it did not satisfy this purpose.
> See this comment from x86:
>
> * Copy data from kernel space to user space. Caller must check
> * the specified block with access_ok() before calling this function.
> * The caller should also make sure he pins the user space address
> * so that we don't result in page fault and sleep.
>
>
> Also - switching to inatomic would scatter if (atomic) all
> over code. It's much better to simply call the same
> function (e.g. recvmsg) and have it fail gracefully:
> after all, we have code to handle get/put user errors
> anyway.
>
> > The result of this change was that all guest access (that correctly uses
> > might_fault()) doesn't perform atomic checks when CONFIG_DEBUG_ATOMIC_SLEEP is
> > enabled. We lost a mighty debugging feature for user access.
>
> What's the path you are trying to debug?
Well, we had a problem where we held a spin_lock and called
copy_(from|to)_user(). We experienced very random deadlocks that took some guy
almost a week to debug. The simple might_sleep() check would have showed this
error immediately.
I would have said that in 99,9% of all copy_to_user() calls we want to check
might_sleep(). That pagefault_disable() is a special case that should be
handled differently - in my opinion.
>
> If your code can faults, then it's safe to call
> from atomic context.
> If it can't, it must pin the page. You can not do access_ok checks and
> then assume access won't fault.
>
> > As I wasn't able to come up with any other reason why this should be
> > necessary, I suggest turning the might_sleep() checks on again in the
> > might_fault() code.
>
> Faults triggered with pagefault_disabled do not cause
> caller to sleep, merely to fail and get an error,
> so might_sleep is simply wrong.
>
> >
> > pagefault_disable/pagefault_enable seems to be used mainly for futex, perf event
> > and kmap.
> >
> > Going over all changes since that commit, it seems like most code already uses the
> > inatomic variants of copy_(to|from)_user. Code relying on (get|put)_user during
> > pagefault_disable() don't make use of any might_fault() in their (get|put)_user
> > implementation. Examples:
> > - arch/m68k/include/asm/futex.h
> > - arch/parisc/include/asm/futex.h
> > - arch/sh/include/asm/futex-irq.h
> > - arch/tile/include/asm/futex.h
> > So changing might_fault() back to trigger might_sleep() won't change a thing for
> > them. Hope I haven't missed anything.
>
> Did you check the generated code?
Nope not yet. But as I said, if lock debugging is not enabled, this should
remain as is - without any might_sleep() checks.
> On x86 it seems to me this patchset is definitely going to
> slow things down, in fact putting back in a performance regression
> that Andi found.
>
>
> > I only identified one might_fault() that would get triggered by get_user() on
> > powerpc and fixed it by using the inatomic variant (not tested). I am not sure
> > if we need some non-sleeping access_ok() prior to __get_user_inatomic().
> >
> > By looking at the code I was wondering where the correct place for might_fault()
> > calls is? Doesn't seem to be very consistent. Examples:
> >
> > | asm-generic | arm | arm64 | frv | m32r | x86 and s390
> > ---------------------------------------------------------------------------
> > get_user() | Yes | Yes | Yes | No | Yes | Yes
> > __get_user() | No | Yes | No | No | Yes | No
> > put_user() | Yes | Yes | Yes | No | Yes | Yes
> > __put_user() | No | Yes | No | No | Yes | No
> > copy_to_user() | Yes | No | No | Yes | Yes | Yes
> > __copy_to_user() | No | No | No | Yes | No | No
> > copy_from_user() | Yes | No | No | Yes | Yes | Yes
> > __copy_from_user() | No | No | No | Yes | No | No
> >
>
> I think it would be a mistake to make this change.
>
> Most call-sites handle faults in atomic just fine by
> returning error to caller.
>
> > So I would have assume that the way asm-generic, x86 and s390 (+ propably
> > others) do this is the right way?
> > So we can speed up multiple calls to e.g. copy_to_user() by doing the access
> > check manually (and also the might_fault() if relevant), then calling
> > __copy_to_user().
> >
> > So in general, I conclude that the concept is:
> > 1. __.* variants perform no checking and don't call might_fault()
> > 2. [a-z].* variants perform access checking (access_ok() if implemented)
> > 3. [a-z].* variants call might_fault()
> > 4. .*_inatomic variants usually don't perform access checks
> > 5. .*_inatomic variants don't call might_fault()
> > 6. If common code uses the __.* variants, it has to trigger access_ok() and
> > call might_fault()
> > 7. For pagefault_disable(), the inatomic variants are to be used
>
> inatomic variants don't seem to handle faults, so you
> must pin any memory you pass to them.
>
Would that be an option for your special case?
>
> > Comments? Opinions?
> >
>
> If the same address is accessed multiple times, access_ok + __
> variant can be used to speed access up a bit.
> This is rarely the case, but this is the case for e.g. vhost.
> But access_ok does not guarantee that no fault will trigger:
> there's really no way to do that ATM except pinning the page.
>
>
> > David Hildenbrand (2):
> > powerpc/fsl-pci: atomic get_user when pagefault_disabled
> > mm, sched: trigger might_sleep() in might_fault() when atomic
> >
> > arch/powerpc/sysdev/fsl_pci.c | 2 +-
> > include/linux/kernel.h | 8 ++++++--
> > mm/memory.c | 11 ++++-------
> > 3 files changed, 11 insertions(+), 10 deletions(-)
> >
> > --
> > 1.8.5.5
>
On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote:
> > What's the path you are trying to debug?
>
> Well, we had a problem where we held a spin_lock and called
> copy_(from|to)_user(). We experienced very random deadlocks that took some guy
> almost a week to debug. The simple might_sleep() check would have showed this
> error immediately.
This must have been a very old kernel.
A modern kernel will return an error from copy_to_user.
Which is really the point of the patch you are trying to revert.
Hmm you sent same mail to me off-list, and I replied there.
Now there's a copy on list - I'm just going to assume
it's exactly identical, pasting my response here as well.
If there are more questions I missed, let me know.
On Wed, Nov 26, 2014 at 09:23:31AM +0100, David Hildenbrand wrote:
> Sorry I haven't put you on cc, must have lost you while packing my list :)
> Thanks for your answer!
>
> This change was introduced in 2013, and I haven't seen an instance making use
> of your described scenario, right?
My work is still out of tree. RHEL6 shipped some patches that use
this. I don't know whether any instances in-tree use this capability.
But it just doesn't make sense for might_fault to call might_sleep
because a fault does not imply sleep.
> > On Tue, Nov 25, 2014 at 12:43:24PM +0100, David Hildenbrand wrote:
> > > I recently discovered that commit 662bbcb2747c2422cf98d3d97619509379eee466
> > > removed/skipped all might_sleep checks for might_fault() calls when in atomic.
> >
> > Yes. You can add e.g. might_sleep in your code if, for some reason, it is
> > illegal to call it in an atomic context.
> > But faults are legal in atomic if you handle the possible
> > errors, and faults do not necessary cause caller to sleep, so might_fault
> > should not call might_sleep.
>
> I think we should use in_atomic variants for this purpose (as done in the code
> for now) - especially as pagefault_disable has been relying on the preempt
> count for a long time. But see my comment below about existing documentation.
IIUC they are not interchangeable.
*in_atomic seems to require that page is pinned.
*user does not: it installs a fixup so you get an error code if you try
to access an invalid address.
Besides, this would just lead to a ton of
if (atomic) return inatomic else return user
in code, for no good purpose.
> >
> > > Reason was to allow calling copy_(to|from)_user while in pagefault_disabled(),
> > > because otherwise, CONFIG_DEBUG_ATOMIC_SLEEP would find false positives.
> >
> > That wasn't the only reason BTW.
> > Andi Kleen also showed that compiler did a terrible job optimizing
> > get/put user when might_sleep was called.
>
> might_fault() should never call might_sleep() when lock debugging is off, so
> there should be no performance problem or am I missing something?
CONFIG_DEBUG_ATOMIC_SLEEP too, no?
> > See e.g. this thread:
> > https://lkml.org/lkml/2013/8/14/652
> > There was even an lwn.net article about it, that I don't seem to be
> > able to locate.
>
> Thanks, will see if I can find it.
>
> > So might_sleep is not appropriate for lightweight operations like __get_user,
> > which people literally expect to be a single instruction.
>
> Yes, as discussed below, __.* variants should never call it.
So that would be even more inconsistent. They fault exactly the same as
the non __ variants.
> >
> >
> > I also have a project going which handles very short packets by copying
> > them into guest memory directly without waking up a thread.
> > I do it by calling recvmsg on a socket, then switching to
> > a thread if I get back EFAULT.
> > Not yet clean enough to upstream but it does seem to cut
> > the latency down quite a bit, so I'd like to have the option.
> >
> >
> > Generally, a caller that does something like this under a spinlock:
> > preempt_disable
> > pagefault_disable
> > error = copy_to_user
> > pagefault_enable
> > preempt_enable_no_resched
> >
> > is not doing anything wrong and should not get a warning,
> > as long as error is handled correctly later.
>
> I think this would be a perfect fit for an inatomic variant, no?
No because inatomic does not handle faults.
> I mean even the copy_to_user documentation of e.g. s390, x86, mips
> clearly says:
> "Context: User context only.>-This function may sleep."
So the comment is incomplete - I didn't get around fixing that.
It may sleep but not in atomic context.
> So calling it from your described scenario is wrong. And as the interface says,
> it might_sleep() and therefore also call the check in might_fault().
Exactly the reverse.
There's no way for it to sleep except on fault and that only if
preempttion is on.
> >
> > You can also find the discussion around the patches
> > educational:
> > http://article.gmane.org/gmane.comp.emulators.kvm.devel/109928
> >
> > > However
> > > we have the inatomic variants of these function for this purpose.
> >
> > Does inatomic install fixups now?
>
> I think this varies between architectures but I am no expert. But as 99,9% of
> all pagefault_disable code uses inatomic, I would have guessed that this is
> rather the way to got than simply using the non atomic variant that clearly
> states on the interface that it might sleep?
Let's go ahead and make the comment more exact then.
> >
> > Last I checked, it didn't so it did not satisfy this purpose.
> > See this comment from x86:
> >
> > * Copy data from kernel space to user space. Caller must check
> > * the specified block with access_ok() before calling this function.
> > * The caller should also make sure he pins the user space address
> > * so that we don't result in page fault and sleep.
> >
> >
> > Also - switching to inatomic would scatter if (atomic) all
> > over code. It's much better to simply call the same
> > function (e.g. recvmsg) and have it fail gracefully:
> > after all, we have code to handle get/put user errors
> > anyway.
> >
> > > The result of this change was that all guest access (that correctly uses
> > > might_fault()) doesn't perform atomic checks when CONFIG_DEBUG_ATOMIC_SLEEP is
> > > enabled. We lost a mighty debugging feature for user access.
> >
> > What's the path you are trying to debug?
> >
>
> Well, if you are holding a spin_lock and call copy_to_guest() you would love to
> see why you get deadlocks, such bugs are really hard to find (... and might
> take people almost a week to identify ...)
Is copy_to_guest same as copy_to_user?
I was unable to find it in tree.
If yes, you will not get deadlocks - it will simply fail.
> > If your code can faults, then it's safe to call
> > from atomic context.
> > If it can't, it must pin the page. You can not do access_ok checks and
> > then assume access won't fault.
> >
> > > As I wasn't able to come up with any other reason why this should be
> > > necessary, I suggest turning the might_sleep() checks on again in the
> > > might_fault() code.
> >
> > Faults triggered with pagefault_disabled do not cause
> > caller to sleep, merely to fail and get an error,
> > so might_sleep is simply wrong.
>
> And my point is, that we need a separate function for this scenario
To me it looks like you want to add a bunch of code for the sole purpose
of then making it easier to debug it.
> (in my
> opinion inatomic) - I mean the caller knows what he is doing, so he can handle
> it properly with inatomic, or am I missing something?
No, the caller gets pointer from userspace so it still must be
validated, inatomic does not do this.
> >
> > >
> > > pagefault_disable/pagefault_enable seems to be used mainly for futex, perf event
> > > and kmap.
> > >
> > > Going over all changes since that commit, it seems like most code already uses the
> > > inatomic variants of copy_(to|from)_user. Code relying on (get|put)_user during
> > > pagefault_disable() don't make use of any might_fault() in their (get|put)_user
> > > implementation. Examples:
> > > - arch/m68k/include/asm/futex.h
> > > - arch/parisc/include/asm/futex.h
> > > - arch/sh/include/asm/futex-irq.h
> > > - arch/tile/include/asm/futex.h
> > > So changing might_fault() back to trigger might_sleep() won't change a thing for
> > > them. Hope I haven't missed anything.
> >
> > Did you check the generated code?
> > On x86 it seems to me this patchset is definitely going to
> > slow things down, in fact putting back in a performance regression
> > that Andi found.
>
> Should be optimized out without lock debugging, right?
You can compile it out, but CONFIG_DEBUG_ATOMIC_SLEEP
is pretty common.
> >
> >
> > > I only identified one might_fault() that would get triggered by get_user() on
> > > powerpc and fixed it by using the inatomic variant (not tested). I am not sure
> > > if we need some non-sleeping access_ok() prior to __get_user_inatomic().
> > >
> > > By looking at the code I was wondering where the correct place for might_fault()
> > > calls is? Doesn't seem to be very consistent. Examples:
> > >
> > > | asm-generic | arm | arm64 | frv | m32r | x86 and s390
> > > ---------------------------------------------------------------------------
> > > get_user() | Yes | Yes | Yes | No | Yes | Yes
> > > __get_user() | No | Yes | No | No | Yes | No
> > > put_user() | Yes | Yes | Yes | No | Yes | Yes
> > > __put_user() | No | Yes | No | No | Yes | No
> > > copy_to_user() | Yes | No | No | Yes | Yes | Yes
> > > __copy_to_user() | No | No | No | Yes | No | No
> > > copy_from_user() | Yes | No | No | Yes | Yes | Yes
> > > __copy_from_user() | No | No | No | Yes | No | No
> > >
> >
> > I think it would be a mistake to make this change.
> >
> > Most call-sites handle faults in atomic just fine by
> > returning error to caller.
> >
> > > So I would have assume that the way asm-generic, x86 and s390 (+ propably
> > > others) do this is the right way?
> > > So we can speed up multiple calls to e.g. copy_to_user() by doing the access
> > > check manually (and also the might_fault() if relevant), then calling
> > > __copy_to_user().
> > >
> > > So in general, I conclude that the concept is:
> > > 1. __.* variants perform no checking and don't call might_fault()
> > > 2. [a-z].* variants perform access checking (access_ok() if implemented)
> > > 3. [a-z].* variants call might_fault()
> > > 4. .*_inatomic variants usually don't perform access checks
> > > 5. .*_inatomic variants don't call might_fault()
> > > 6. If common code uses the __.* variants, it has to trigger access_ok() and
> > > call might_fault()
> > > 7. For pagefault_disable(), the inatomic variants are to be used
> >
> > inatomic variants don't seem to handle faults, so you
> > must pin any memory you pass to them.
>
> And we don't have a single code part in the system that relies on this, right?
Hmm relies on what?
Do you want to change *inatomic to actually handle faults?
That's fine by me, but if you do, won't you need might_fault there.
And then your patch will be wrong, won't it?
> >
> >
> > > Comments? Opinions?
> > >
> >
> > If the same address is accessed multiple times, access_ok + __
> > variant can be used to speed access up a bit.
> > This is rarely the case, but this is the case for e.g. vhost.
> > But access_ok does not guarantee that no fault will trigger:
> > there's really no way to do that ATM except pinning the page.
> >
> >
On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote:
> On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote:
> > > What's the path you are trying to debug?
> >
> > Well, we had a problem where we held a spin_lock and called
> > copy_(from|to)_user(). We experienced very random deadlocks that took some guy
> > almost a week to debug. The simple might_sleep() check would have showed this
> > error immediately.
>
> This must have been a very old kernel.
> A modern kernel will return an error from copy_to_user.
> Which is really the point of the patch you are trying to revert.
That's assuming you disabled preemption. If you didn't, and take
a spinlock, you have deadlocks even without userspace access.
--
MST
Am 26.11.2014 um 16:17 schrieb Michael S. Tsirkin:
> On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote:
>>> What's the path you are trying to debug?
>>
>> Well, we had a problem where we held a spin_lock and called
>> copy_(from|to)_user(). We experienced very random deadlocks that took some guy
>> almost a week to debug. The simple might_sleep() check would have showed this
>> error immediately.
>
> This must have been a very old kernel.
> A modern kernel will return an error from copy_to_user.
I disagree. copy_to_user will not return while holding a spinlock, because it does not know! How should it?
See: spin_lock will call preempt_disable, but thats a no-op for a non-preempt kernel. So the mere fact that we hold a spin_lock is not known by any user access function. (or others). No?
Christian
> On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote:
> > On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote:
> > > > What's the path you are trying to debug?
> > >
> > > Well, we had a problem where we held a spin_lock and called
> > > copy_(from|to)_user(). We experienced very random deadlocks that took some guy
> > > almost a week to debug. The simple might_sleep() check would have showed this
> > > error immediately.
> >
> > This must have been a very old kernel.
> > A modern kernel will return an error from copy_to_user.
> > Which is really the point of the patch you are trying to revert.
>
> That's assuming you disabled preemption. If you didn't, and take
> a spinlock, you have deadlocks even without userspace access.
>
(Thanks for your resent, my first email was sent directly to you ... grml)
This is what happened on our side (very recent kernel):
spin_lock(&lock)
copy_to_user(...)
spin_unlock(&lock)
1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu id_
as "old value"
2. we slept during copy_to_user()
3. the thread got scheduled onto another cpu
4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked
the spinlock tried to unlocked it).
5. lock remained locked -> deadlock
Christian came up with the following explanation:
Without preemption, spin_lock() will not touch the preempt counter.
disable_pfault() will always touch it.
Therefore, with preemption disabled, copy_to_user() has no idea that it is
running in atomic context - and will therefore try to sleep.
So copy_to_user() will on s390:
1. run "as atomic" while spin_lock() with preemption enabled.
2. run "as not atomic" while spin_lock() with preemption disabled.
3. run "as atomic" while pagefault_disabled() with preemption enabled or
disabled.
4. run "as not atomic" when really not atomic.
And exactly nr 2. is the thing that produced the deadlock in our scenario and
the reason why I want a might_sleep() :)
On Wed, Nov 26, 2014 at 04:30:32PM +0100, Christian Borntraeger wrote:
> Am 26.11.2014 um 16:17 schrieb Michael S. Tsirkin:
> > On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote:
> >>> What's the path you are trying to debug?
> >>
> >> Well, we had a problem where we held a spin_lock and called
> >> copy_(from|to)_user(). We experienced very random deadlocks that took some guy
> >> almost a week to debug. The simple might_sleep() check would have showed this
> >> error immediately.
> >
>
> > This must have been a very old kernel.
> > A modern kernel will return an error from copy_to_user.
>
> I disagree. copy_to_user will not return while holding a spinlock, because it does not know! How should it?
> See: spin_lock will call preempt_disable, but thats a no-op for a non-preempt kernel. So the mere fact that we hold a spin_lock is not known by any user access function. (or others). No?
>
> Christian
>
>
Well might_sleep() merely checks preempt count and irqs_disabled too.
If you want debugging things to trigger, you need to enable
a bunch of config options. That's not new.
--
MST
On Wed, Nov 26, 2014 at 04:32:07PM +0100, David Hildenbrand wrote:
> > On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote:
> > > On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote:
> > > > > What's the path you are trying to debug?
> > > >
> > > > Well, we had a problem where we held a spin_lock and called
> > > > copy_(from|to)_user(). We experienced very random deadlocks that took some guy
> > > > almost a week to debug. The simple might_sleep() check would have showed this
> > > > error immediately.
> > >
> > > This must have been a very old kernel.
> > > A modern kernel will return an error from copy_to_user.
> > > Which is really the point of the patch you are trying to revert.
> >
> > That's assuming you disabled preemption. If you didn't, and take
> > a spinlock, you have deadlocks even without userspace access.
> >
>
> (Thanks for your resent, my first email was sent directly to you ... grml)
>
> This is what happened on our side (very recent kernel):
>
> spin_lock(&lock)
> copy_to_user(...)
> spin_unlock(&lock)
That's a deadlock even without copy_to_user - it's
enough for the thread to be preempted and another one
to try taking the lock.
> 1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu id_
> as "old value"
> 2. we slept during copy_to_user()
> 3. the thread got scheduled onto another cpu
> 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked
> the spinlock tried to unlocked it).
> 5. lock remained locked -> deadlock
>
> Christian came up with the following explanation:
> Without preemption, spin_lock() will not touch the preempt counter.
> disable_pfault() will always touch it.
>
> Therefore, with preemption disabled, copy_to_user() has no idea that it is
> running in atomic context - and will therefore try to sleep.
>
> So copy_to_user() will on s390:
> 1. run "as atomic" while spin_lock() with preemption enabled.
> 2. run "as not atomic" while spin_lock() with preemption disabled.
> 3. run "as atomic" while pagefault_disabled() with preemption enabled or
> disabled.
> 4. run "as not atomic" when really not atomic.
>
> And exactly nr 2. is the thing that produced the deadlock in our scenario and
> the reason why I want a might_sleep() :)
IMHO it's not copy to user that causes the problem.
It's the misuse of spinlocks with preemption on.
So might_sleep would make you think copy_to_user is
the problem, and e.g. let you paper over it by
moving copy_to_user out.
Enable lock prover and you will see what the real
issue is, which is you didn't disable preempt.
and if you did, copy_to_user would be okay.
--
MST
> > This is what happened on our side (very recent kernel):
> >
> > spin_lock(&lock)
> > copy_to_user(...)
> > spin_unlock(&lock)
>
> That's a deadlock even without copy_to_user - it's
> enough for the thread to be preempted and another one
> to try taking the lock.
>
>
> > 1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu id_
> > as "old value"
> > 2. we slept during copy_to_user()
> > 3. the thread got scheduled onto another cpu
> > 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked
> > the spinlock tried to unlocked it).
> > 5. lock remained locked -> deadlock
> >
> > Christian came up with the following explanation:
> > Without preemption, spin_lock() will not touch the preempt counter.
> > disable_pfault() will always touch it.
> >
> > Therefore, with preemption disabled, copy_to_user() has no idea that it is
> > running in atomic context - and will therefore try to sleep.
> >
> > So copy_to_user() will on s390:
> > 1. run "as atomic" while spin_lock() with preemption enabled.
> > 2. run "as not atomic" while spin_lock() with preemption disabled.
> > 3. run "as atomic" while pagefault_disabled() with preemption enabled or
> > disabled.
> > 4. run "as not atomic" when really not atomic.
should have been more clear at that point:
preemption enabled == kernel compiled with preemption support
preemption disabled == kernel compiled without preemption support
> >
> > And exactly nr 2. is the thing that produced the deadlock in our scenario and
> > the reason why I want a might_sleep() :)
>
> IMHO it's not copy to user that causes the problem.
> It's the misuse of spinlocks with preemption on.
As I said, preemption was off.
>
> So might_sleep would make you think copy_to_user is
> the problem, and e.g. let you paper over it by
> moving copy_to_user out.
Actually implementing different way of locking easily fixed the problem for us.
The old might_sleep() checks would have given us the problem within a few
seconds (I tested it).
>
> Enable lock prover and you will see what the real
> issue is, which is you didn't disable preempt.
> and if you did, copy_to_user would be okay.
>
Our kernel is compiled without preemption and we turned on all lock/atomic
sleep debugging aid. No problem was detected.
----
But the question is if we shouldn't rather provide a:
copy_to_user_nosleep() implementation that can be called from
pagefault_disable() because it won't sleep.
and a
copy_to_user_sleep() implementation that cannot be called from
pagefault_disable().
Another way to fix it would be a reworked pagefault_disable() function that
somehow sets "a flag", so copy_to_user() knows that it is in fact called from a
valid context, not just from "some atomic" context. So we could trigger
might_sleep() when detecting a !pagefault_disable context.
Am 26.11.2014 um 16:37 schrieb Michael S. Tsirkin:
> On Wed, Nov 26, 2014 at 04:30:32PM +0100, Christian Borntraeger wrote:
>> Am 26.11.2014 um 16:17 schrieb Michael S. Tsirkin:
>>> On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote:
>>>>> What's the path you are trying to debug?
>>>>
>>>> Well, we had a problem where we held a spin_lock and called
>>>> copy_(from|to)_user(). We experienced very random deadlocks that took some guy
>>>> almost a week to debug. The simple might_sleep() check would have showed this
>>>> error immediately.
>>>
>>
>>> This must have been a very old kernel.
>>> A modern kernel will return an error from copy_to_user.
>>
>> I disagree. copy_to_user will not return while holding a spinlock, because it does not know! How should it?
>> See: spin_lock will call preempt_disable, but thats a no-op for a non-preempt kernel. So the mere fact that we hold a spin_lock is not known by any user access function. (or others). No?
>>
>> Christian
>>
>>
>
> Well might_sleep() merely checks preempt count and irqs_disabled too.
> If you want debugging things to trigger, you need to enable
> a bunch of config options. That's not new.
You miss the point of the whole thread: The problem is that even with debug options enabled, holding a spinlock would not trigger a bug on copy_to_user. So the problem is not the good path, the problem is that a debugging aid for detecting a broken case was lost. Even with all kernel debugging enabled.
That is because CONFIG_DEBUG_ATOMIC_SLEEP selects PREEMPT_COUNT. That means: spin_lock will then be considered as in_atomic and no message comes. Without CONFIG_DEBUG_ATOMIC_SLEEP spin_lock will not touch the preempt_count but we also dont see a message because might_fault is now a nop
I understand that you dont like Davids changes due to other side effects that you have mentioned. So lets focus on how we can fix the debug option. Ok?
Christian
Am 26.11.2014 um 16:47 schrieb Michael S. Tsirkin:
> On Wed, Nov 26, 2014 at 04:32:07PM +0100, David Hildenbrand wrote:
>>> On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote:
>>>> On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote:
>>>>>> What's the path you are trying to debug?
>>>>>
>>>>> Well, we had a problem where we held a spin_lock and called
>>>>> copy_(from|to)_user(). We experienced very random deadlocks that took some guy
>>>>> almost a week to debug. The simple might_sleep() check would have showed this
>>>>> error immediately.
>>>>
>>>> This must have been a very old kernel.
>>>> A modern kernel will return an error from copy_to_user.
>>>> Which is really the point of the patch you are trying to revert.
>>>
>>> That's assuming you disabled preemption. If you didn't, and take
>>> a spinlock, you have deadlocks even without userspace access.
>>>
>>
>> (Thanks for your resent, my first email was sent directly to you ... grml)
>>
>> This is what happened on our side (very recent kernel):
>>
>> spin_lock(&lock)
>> copy_to_user(...)
>> spin_unlock(&lock)
>
> That's a deadlock even without copy_to_user - it's
> enough for the thread to be preempted and another one
> to try taking the lock.
Huh? With CONFIG_PREEMPT spin_lock will disable preemption. (we had preempt = server anyway).
But please: One step back. The problem is not the good path. The problem is that we lost a debugging aid for a known to be broken case. In other words: Our code had a bug. Older kernels detected that kind of bug. With your change we no longer saw the sleeping while atomic. Thats it. See my other mail.
Christian
On Wed, Nov 26, 2014 at 05:02:23PM +0100, David Hildenbrand wrote:
> > > This is what happened on our side (very recent kernel):
> > >
> > > spin_lock(&lock)
> > > copy_to_user(...)
> > > spin_unlock(&lock)
> >
> > That's a deadlock even without copy_to_user - it's
> > enough for the thread to be preempted and another one
> > to try taking the lock.
> >
> >
> > > 1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu id_
> > > as "old value"
> > > 2. we slept during copy_to_user()
> > > 3. the thread got scheduled onto another cpu
> > > 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked
> > > the spinlock tried to unlocked it).
> > > 5. lock remained locked -> deadlock
> > >
> > > Christian came up with the following explanation:
> > > Without preemption, spin_lock() will not touch the preempt counter.
> > > disable_pfault() will always touch it.
> > >
> > > Therefore, with preemption disabled, copy_to_user() has no idea that it is
> > > running in atomic context - and will therefore try to sleep.
> > >
> > > So copy_to_user() will on s390:
> > > 1. run "as atomic" while spin_lock() with preemption enabled.
> > > 2. run "as not atomic" while spin_lock() with preemption disabled.
> > > 3. run "as atomic" while pagefault_disabled() with preemption enabled or
> > > disabled.
> > > 4. run "as not atomic" when really not atomic.
>
> should have been more clear at that point:
> preemption enabled == kernel compiled with preemption support
> preemption disabled == kernel compiled without preemption support
>
> > >
> > > And exactly nr 2. is the thing that produced the deadlock in our scenario and
> > > the reason why I want a might_sleep() :)
> >
> > IMHO it's not copy to user that causes the problem.
> > It's the misuse of spinlocks with preemption on.
>
> As I said, preemption was off.
off -> disabled at compile time?
But the code is broken for people that do enable it.
> >
> > So might_sleep would make you think copy_to_user is
> > the problem, and e.g. let you paper over it by
> > moving copy_to_user out.
>
> Actually implementing different way of locking easily fixed the problem for us.
> The old might_sleep() checks would have given us the problem within a few
> seconds (I tested it).
Or enable CONFIG_PREMPT, with same effect (copy_to_user will report
an error).
Do you check return code from copy to user?
If not then you have another bug ...
> >
> > Enable lock prover and you will see what the real
> > issue is, which is you didn't disable preempt.
> > and if you did, copy_to_user would be okay.
> >
>
> Our kernel is compiled without preemption and we turned on all lock/atomic
> sleep debugging aid. No problem was detected.
But your code is still buggy with preemption on, isn't it?
> ----
> But the question is if we shouldn't rather provide a:
>
> copy_to_user_nosleep() implementation that can be called from
> pagefault_disable() because it won't sleep.
> and a
> copy_to_user_sleep() implementation that cannot be called from
> pagefault_disable().
>
> Another way to fix it would be a reworked pagefault_disable() function that
> somehow sets "a flag", so copy_to_user() knows that it is in fact called from a
> valid context, not just from "some atomic" context. So we could trigger
> might_sleep() when detecting a !pagefault_disable contex
I think all this is just directing people to paper over the
problem. You should normally disable preemption if you take
spinlocks.
Yes it might happen to work if preempt is compiled out
and you don't trigger scheduler, but Linux might
add scheduler calls at any point without notice,
code must be preempt safe.
Maybe add a debug option warning about spinlocks taken
with preempt on.
That would make sense I think.
--
MST
Am 26.11.2014 um 17:19 schrieb Michael S. Tsirkin:
> On Wed, Nov 26, 2014 at 05:02:23PM +0100, David Hildenbrand wrote:
>>>> This is what happened on our side (very recent kernel):
>>>>
>>>> spin_lock(&lock)
>>>> copy_to_user(...)
>>>> spin_unlock(&lock)
>>>
>>> That's a deadlock even without copy_to_user - it's
>>> enough for the thread to be preempted and another one
>>> to try taking the lock.
>>>
>>>
>>>> 1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu id_
>>>> as "old value"
>>>> 2. we slept during copy_to_user()
>>>> 3. the thread got scheduled onto another cpu
>>>> 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked
>>>> the spinlock tried to unlocked it).
>>>> 5. lock remained locked -> deadlock
>>>>
>>>> Christian came up with the following explanation:
>>>> Without preemption, spin_lock() will not touch the preempt counter.
>>>> disable_pfault() will always touch it.
>>>>
>>>> Therefore, with preemption disabled, copy_to_user() has no idea that it is
>>>> running in atomic context - and will therefore try to sleep.
>>>>
>>>> So copy_to_user() will on s390:
>>>> 1. run "as atomic" while spin_lock() with preemption enabled.
>>>> 2. run "as not atomic" while spin_lock() with preemption disabled.
>>>> 3. run "as atomic" while pagefault_disabled() with preemption enabled or
>>>> disabled.
>>>> 4. run "as not atomic" when really not atomic.
>>
>> should have been more clear at that point:
>> preemption enabled == kernel compiled with preemption support
>> preemption disabled == kernel compiled without preemption support
>>
>>>>
>>>> And exactly nr 2. is the thing that produced the deadlock in our scenario and
>>>> the reason why I want a might_sleep() :)
>>>
>>> IMHO it's not copy to user that causes the problem.
>>> It's the misuse of spinlocks with preemption on.
>>
>> As I said, preemption was off.
>
> off -> disabled at compile time?
>
> But the code is broken for people that do enable it.
[...]
> You should normally disable preemption if you take
> spinlocks.
Your are telling that any sequence of
spin_lock
...
spin_unlock
is broken with CONFIG_PREEMPT?
Michael, that is bullshit. spin_lock will take care of CONFIG_PREEMPT just fine.
Only sequences like
spin_lock
...
schedule
...
spin_unlock
are broken.
But as I said. That is not the problem that we are discussing here.
Christian
On Wed, Nov 26, 2014 at 05:07:13PM +0100, Christian Borntraeger wrote:
> Am 26.11.2014 um 16:47 schrieb Michael S. Tsirkin:
> > On Wed, Nov 26, 2014 at 04:32:07PM +0100, David Hildenbrand wrote:
> >>> On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote:
> >>>> On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote:
> >>>>>> What's the path you are trying to debug?
> >>>>>
> >>>>> Well, we had a problem where we held a spin_lock and called
> >>>>> copy_(from|to)_user(). We experienced very random deadlocks that took some guy
> >>>>> almost a week to debug. The simple might_sleep() check would have showed this
> >>>>> error immediately.
> >>>>
> >>>> This must have been a very old kernel.
> >>>> A modern kernel will return an error from copy_to_user.
> >>>> Which is really the point of the patch you are trying to revert.
> >>>
> >>> That's assuming you disabled preemption. If you didn't, and take
> >>> a spinlock, you have deadlocks even without userspace access.
> >>>
> >>
> >> (Thanks for your resent, my first email was sent directly to you ... grml)
> >>
> >> This is what happened on our side (very recent kernel):
> >>
> >> spin_lock(&lock)
> >> copy_to_user(...)
> >> spin_unlock(&lock)
> >
> > That's a deadlock even without copy_to_user - it's
> > enough for the thread to be preempted and another one
> > to try taking the lock.
>
> Huh? With CONFIG_PREEMPT spin_lock will disable preemption. (we had preempt = server anyway).
Are you sure? Can you point me where it does this please?
> But please: One step back. The problem is not the good path. The problem is that we lost a debugging aid for a known to be broken case. In other words: Our code had a bug. Older kernels detected that kind of bug. With your change we no longer saw the sleeping while atomic. Thats it. See my other mail.
>
> Christian
You want to add more debugging tools, fine. But this one was
giving users in field false positives.
The point is that *_user is safe with preempt off.
It returns an error gracefully.
It does not sleep.
It does not trigger the scheduler in that context.
David's patch makes it say it does, so it's wrong.
--
MST
On Wed, Nov 26, 2014 at 05:30:35PM +0100, Christian Borntraeger wrote:
> Am 26.11.2014 um 17:19 schrieb Michael S. Tsirkin:
> > On Wed, Nov 26, 2014 at 05:02:23PM +0100, David Hildenbrand wrote:
> >>>> This is what happened on our side (very recent kernel):
> >>>>
> >>>> spin_lock(&lock)
> >>>> copy_to_user(...)
> >>>> spin_unlock(&lock)
> >>>
> >>> That's a deadlock even without copy_to_user - it's
> >>> enough for the thread to be preempted and another one
> >>> to try taking the lock.
> >>>
> >>>
> >>>> 1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu id_
> >>>> as "old value"
> >>>> 2. we slept during copy_to_user()
> >>>> 3. the thread got scheduled onto another cpu
> >>>> 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked
> >>>> the spinlock tried to unlocked it).
> >>>> 5. lock remained locked -> deadlock
> >>>>
> >>>> Christian came up with the following explanation:
> >>>> Without preemption, spin_lock() will not touch the preempt counter.
> >>>> disable_pfault() will always touch it.
> >>>>
> >>>> Therefore, with preemption disabled, copy_to_user() has no idea that it is
> >>>> running in atomic context - and will therefore try to sleep.
> >>>>
> >>>> So copy_to_user() will on s390:
> >>>> 1. run "as atomic" while spin_lock() with preemption enabled.
> >>>> 2. run "as not atomic" while spin_lock() with preemption disabled.
> >>>> 3. run "as atomic" while pagefault_disabled() with preemption enabled or
> >>>> disabled.
> >>>> 4. run "as not atomic" when really not atomic.
> >>
> >> should have been more clear at that point:
> >> preemption enabled == kernel compiled with preemption support
> >> preemption disabled == kernel compiled without preemption support
> >>
> >>>>
> >>>> And exactly nr 2. is the thing that produced the deadlock in our scenario and
> >>>> the reason why I want a might_sleep() :)
> >>>
> >>> IMHO it's not copy to user that causes the problem.
> >>> It's the misuse of spinlocks with preemption on.
> >>
> >> As I said, preemption was off.
> >
> > off -> disabled at compile time?
> >
> > But the code is broken for people that do enable it.
> [...]
> > You should normally disable preemption if you take
> > spinlocks.
>
> Your are telling that any sequence of
> spin_lock
> ...
> spin_unlock
>
> is broken with CONFIG_PREEMPT?
> Michael, that is bullshit. spin_lock will take care of CONFIG_PREEMPT just fine.
>
> Only sequences like
> spin_lock
> ...
> schedule
> ...
> spin_unlock
> are broken.
>
> But as I said. That is not the problem that we are discussing here.
>
> Christian
I'm saying spin_lock without _irqsave is often a bug.
I am also saying this code in mm/fault.c:
__do_page_fault
...
/*
* If we're in an interrupt, have no user context or are running
* in an atomic region then we must not take the fault:
*/
if (unlikely(in_atomic() || !mm)) {
bad_area_nosemaphore(regs, error_code, address);
return;
}
means that a fault won't cause sleep if called in atomic context.
And a bunch of code relies on this.
This is why might_fault does:
* it would be nicer only to annotate paths which are not under
* pagefault_disable, however that requires a larger audit and
* providing helpers like get_user_atomic.
*/
if (in_atomic())
return;
__might_sleep(__FILE__, __LINE__, 0);
If you see this violated, let's figure out why.
--
MST
Am 26.11.2014 um 17:32 schrieb Michael S. Tsirkin:
[...]
>>>> This is what happened on our side (very recent kernel):
>>>>
>>>> spin_lock(&lock)
>>>> copy_to_user(...)
>>>> spin_unlock(&lock)
>>>
>>> That's a deadlock even without copy_to_user - it's
>>> enough for the thread to be preempted and another one
>>> to try taking the lock.
>>
>> Huh? With CONFIG_PREEMPT spin_lock will disable preemption. (we had preempt = server anyway).
>
> Are you sure? Can you point me where it does this please?
spin_lock --> raw_spin_lock --> _raw_spin_lock --> __raw_spin_lock
static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
----> preempt_disable(); <-----
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}
Michael, please be serious. The whole kernel would be broken if spin_lock would not disable preemption.
>
>> But please: One step back. The problem is not the good path. The problem is that we lost a debugging aid for a known to be broken case. In other words: Our code had a bug. Older kernels detected that kind of bug. With your change we no longer saw the sleeping while atomic. Thats it. See my other mail.
>>
>> Christian
>
> You want to add more debugging tools, fine.
We dont want to add, we want to fix something that used to work
> But this one was > giving users in field false positives.
So lets try to fix those, ok? If we cant, then tough luck. But coming up with wrong statements is not helpful.
>
> The point is that *_user is safe with preempt off.
> It returns an error gracefully.
> It does not sleep.
> It does not trigger the scheduler in that context.
There are special cases where your statement is true. But its not in general.
copy_to_user might fault and that fault might sleep and reschedule. For example handle_mm_fault might go down to pud_alloc, pmd_alloc etc and all these functions could do an GFP_KERNEL allocation. Which might sleep. Which will schedule.
>
>
> David's patch makes it say it does, so it's wrong.
>
>
>
On Wed, Nov 26, 2014 at 05:51:08PM +0100, Christian Borntraeger wrote:
> > But this one was > giving users in field false positives.
>
> So lets try to fix those, ok? If we cant, then tough luck.
Sure.
I think the simplest way might be to make spinlock disable
premption when CONFIG_DEBUG_ATOMIC_SLEEP is enabled.
As a result, userspace access will fail and caller will
get a nice error.
> But coming up with wrong statements is not helpful.
True. Sorry that I did that.
> >
> > The point is that *_user is safe with preempt off.
> > It returns an error gracefully.
> > It does not sleep.
> > It does not trigger the scheduler in that context.
>
> There are special cases where your statement is true. But its not in general.
> copy_to_user might fault and that fault might sleep and reschedule.
Yes. But not if called inatomic.
> For example handle_mm_fault might go down to pud_alloc, pmd_alloc etc and all these functions could do an GFP_KERNEL allocation. Which might sleep. Which will schedule.
>
>
> >
> >
> > David's patch makes it say it does, so it's wrong.
> >
> >
> >
Absolutely.
I think you can already debug your case easily, by enabling CONFIG_PREEMPT.
This seems counter-intuitive, and distro debug kernels don't seem to do this.
--
MST
On Wed, Nov 26, 2014 at 07:04:47PM +0200, Michael S. Tsirkin wrote:
> On Wed, Nov 26, 2014 at 05:51:08PM +0100, Christian Borntraeger wrote:
> > > But this one was > giving users in field false positives.
> >
> > So lets try to fix those, ok? If we cant, then tough luck.
>
> Sure.
> I think the simplest way might be to make spinlock disable
> premption when CONFIG_DEBUG_ATOMIC_SLEEP is enabled.
Specifically maybe DEBUG_ATOMIC_SLEEP should select PREEMPT_COUNT?
> As a result, userspace access will fail and caller will
> get a nice error.
>
>
>
> > But coming up with wrong statements is not helpful.
>
> True. Sorry that I did that.
>
> > >
> > > The point is that *_user is safe with preempt off.
> > > It returns an error gracefully.
> > > It does not sleep.
> > > It does not trigger the scheduler in that context.
> >
> > There are special cases where your statement is true. But its not in general.
> > copy_to_user might fault and that fault might sleep and reschedule.
>
> Yes. But not if called inatomic.
>
>
>
> > For example handle_mm_fault might go down to pud_alloc, pmd_alloc etc and all these functions could do an GFP_KERNEL allocation. Which might sleep. Which will schedule.
> >
> >
> > >
> > >
> > > David's patch makes it say it does, so it's wrong.
> > >
> > >
> > >
>
> Absolutely.
> I think you can already debug your case easily, by enabling CONFIG_PREEMPT.
> This seems counter-intuitive, and distro debug kernels don't seem to do this.
>
> --
> MST
On Wed, Nov 26, 2014 at 07:04:47PM +0200, Michael S. Tsirkin wrote:
> On Wed, Nov 26, 2014 at 05:51:08PM +0100, Christian Borntraeger wrote:
> > > But this one was > giving users in field false positives.
> >
> > So lets try to fix those, ok? If we cant, then tough luck.
>
> Sure.
> I think the simplest way might be to make spinlock disable
> premption when CONFIG_DEBUG_ATOMIC_SLEEP is enabled.
>
> As a result, userspace access will fail and caller will
> get a nice error.
Yes, _userspace_ now sees unpredictable behaviour, instead of that the
kernel emits a big loud warning to the console.
Please consider this simple example:
int bar(char __user *ptr)
{
...
if (copy_to_user(ptr, ...)
return -EFAULT;
...
}
SYSCALL_DEFINE1(foo, char __user *, ptr)
{
int rc;
...
rc = bar(ptr);
if (rc)
goto out;
...
out:
return rc;
}
The above simple system call just works fine, with and without your change,
however if somebody (incorrectly) changes sys_foo() to the code below:
spin_lock(&lock);
rc = bar(ptr);
if (rc)
goto out;
out:
spin_unlock(&lock);
return rc;
Broken code like above used to generate warnings. With your change we won't
see any warnings anymore. Instead we get random and bad behaviour:
For !CONFIG_PREEMPT if the page at ptr is not mapped, the kernel will see
a fault, potentially schedule and potentially deadlock on &lock.
Without _any_ warning anymore.
For CONFIG_PREEMPT if the page at ptr is mapped, everthing works. However if
the page is not mapped, userspace now all of the sudden will see an invalid(!)
-EFAULT return code, instead of that the kernel resolved the page fault.
Yes, the kernel can't resolve the fault since we hold a spinlock. But the
above bogus code did give warnings to give you an idea that something probably
is not correct.
Who on earth is supposed to debug crap like this???
What we really want is:
Code like
spin_lock(&lock);
if (copy_to_user(...))
rc = ...
spin_unlock(&lock);
really *should* generate warnings like it did before.
And *only* code like
spin_lock(&lock);
page_fault_disable();
if (copy_to_user(...))
rc = ...
page_fault_enable();
spin_unlock(&lock);
should not generate warnings, since the author hopefully knew what he did.
We could achieve that by e.g. adding a couple of pagefault disabled bits
within current_thread_info()->preempt_count, which would allow
pagefault_disable() and pagefault_enable() to modify a different part of
preempt_count than it does now, so there is a way to tell if pagefaults have
been explicitly disabled or are just a side effect of preemption being
disabled.
This would allow might_fault() to restore its old sane behaviour for the
!page_fault_disabled() case.
On Thu, Nov 27, 2014 at 08:09:19AM +0100, Heiko Carstens wrote:
> On Wed, Nov 26, 2014 at 07:04:47PM +0200, Michael S. Tsirkin wrote:
> > On Wed, Nov 26, 2014 at 05:51:08PM +0100, Christian Borntraeger wrote:
> > > > But this one was > giving users in field false positives.
> > >
> > > So lets try to fix those, ok? If we cant, then tough luck.
> >
> > Sure.
> > I think the simplest way might be to make spinlock disable
> > premption when CONFIG_DEBUG_ATOMIC_SLEEP is enabled.
> >
> > As a result, userspace access will fail and caller will
> > get a nice error.
>
> Yes, _userspace_ now sees unpredictable behaviour, instead of that the
> kernel emits a big loud warning to the console.
So I don't object to adding more debugging at all.
Sure, would be nice.
But the fix is not an unconditional might_sleep
within might_fault, this would trigger false positives.
Rather, detect that you took a spinlock
without disabling preemption.
> Please consider this simple example:
>
> int bar(char __user *ptr)
> {
> ...
> if (copy_to_user(ptr, ...)
> return -EFAULT;
> ...
> }
>
> SYSCALL_DEFINE1(foo, char __user *, ptr)
> {
> int rc;
>
> ...
> rc = bar(ptr);
> if (rc)
> goto out;
> ...
> out:
> return rc;
> }
>
> The above simple system call just works fine, with and without your change,
> however if somebody (incorrectly) changes sys_foo() to the code below:
>
> spin_lock(&lock);
> rc = bar(ptr);
> if (rc)
> goto out;
> out:
> spin_unlock(&lock);
> return rc;
>
> Broken code like above used to generate warnings. With your change we won't
> see any warnings anymore. Instead we get random and bad behaviour:
>
> For !CONFIG_PREEMPT if the page at ptr is not mapped, the kernel will see
> a fault, potentially schedule and potentially deadlock on &lock.
> Without _any_ warning anymore.
>
> For CONFIG_PREEMPT if the page at ptr is mapped, everthing works. However if
> the page is not mapped, userspace now all of the sudden will see an invalid(!)
> -EFAULT return code, instead of that the kernel resolved the page fault.
> Yes, the kernel can't resolve the fault since we hold a spinlock. But the
> above bogus code did give warnings to give you an idea that something probably
> is not correct.
>
> Who on earth is supposed to debug crap like this???
>
> What we really want is:
>
> Code like
> spin_lock(&lock);
> if (copy_to_user(...))
> rc = ...
> spin_unlock(&lock);
> really *should* generate warnings like it did before.
>
> And *only* code like
> spin_lock(&lock);
> page_fault_disable();
> if (copy_to_user(...))
> rc = ...
> page_fault_enable();
> spin_unlock(&lock);
> should not generate warnings, since the author hopefully knew what he did.
>
> We could achieve that by e.g. adding a couple of pagefault disabled bits
> within current_thread_info()->preempt_count, which would allow
> pagefault_disable() and pagefault_enable() to modify a different part of
> preempt_count than it does now, so there is a way to tell if pagefaults have
> been explicitly disabled or are just a side effect of preemption being
> disabled.
> This would allow might_fault() to restore its old sane behaviour for the
> !page_fault_disabled() case.
Exactly. I agree, that would be a useful debugging tool.
In fact this comment in mm/memory.c hints at this:
* it would be nicer only to annotate paths which are not under
* pagefault_disable,
it further says
* however that requires a larger audit and
* providing helpers like get_user_atomic.
but I think that what you outline is a better way to do this.
--
MST
> Code like
> spin_lock(&lock);
> if (copy_to_user(...))
> rc = ...
> spin_unlock(&lock);
> really *should* generate warnings like it did before.
>
> And *only* code like
> spin_lock(&lock);
Is only code like this valid or also with the spin_lock() dropped?
(e.g. the access in patch1 if I remember correctly)
So should page_fault_disable() increment the pagefault counter and the preempt
counter or only the first one?
> page_fault_disable();
> if (copy_to_user(...))
> rc = ...
> page_fault_enable();
> spin_unlock(&lock);
> should not generate warnings, since the author hopefully knew what he did.
>
> We could achieve that by e.g. adding a couple of pagefault disabled bits
> within current_thread_info()->preempt_count, which would allow
> pagefault_disable() and pagefault_enable() to modify a different part of
> preempt_count than it does now, so there is a way to tell if pagefaults have
> been explicitly disabled or are just a side effect of preemption being
> disabled.
> This would allow might_fault() to restore its old sane behaviour for the
> !page_fault_disabled() case.
So we would have pagefault code rely on:
in_disabled_pagefault() ( pagefault_disabled() ... whatever ) instead of
in_atomic().
I agree with this approach, as this is basically what I suggested in one of my
previous mails.
On Thu, Nov 27, 2014 at 09:03:01AM +0100, David Hildenbrand wrote:
> > Code like
> > spin_lock(&lock);
> > if (copy_to_user(...))
> > rc = ...
> > spin_unlock(&lock);
> > really *should* generate warnings like it did before.
> >
> > And *only* code like
> > spin_lock(&lock);
>
> Is only code like this valid or also with the spin_lock() dropped?
> (e.g. the access in patch1 if I remember correctly)
>
> So should page_fault_disable() increment the pagefault counter and the preempt
> counter or only the first one?
Given that a sequence like
page_fault_disable();
if (copy_to_user(...))
rc = ...
page_fault_enable();
is correct code right now I think page_fault_disable() should increase both.
No need for surprising semantic changes.
> So we would have pagefault code rely on:
>
> in_disabled_pagefault() ( pagefault_disabled() ... whatever ) instead of
> in_atomic().
No, let's be more defensive: the page fault handler should do nothing if
in_atomic() just like now. But it could have a quick check and emit a one
time warning if page faults aren't disabled in addition.
That might help debugging but keeps the system more likely alive.
might_fault() however should call might_sleep() if page faults aren't
disabled, but that's what you proposed anyway I think.
> On Thu, Nov 27, 2014 at 09:03:01AM +0100, David Hildenbrand wrote:
> > > Code like
> > > spin_lock(&lock);
> > > if (copy_to_user(...))
> > > rc = ...
> > > spin_unlock(&lock);
> > > really *should* generate warnings like it did before.
> > >
> > > And *only* code like
> > > spin_lock(&lock);
> >
> > Is only code like this valid or also with the spin_lock() dropped?
> > (e.g. the access in patch1 if I remember correctly)
> >
> > So should page_fault_disable() increment the pagefault counter and the preempt
> > counter or only the first one?
>
> Given that a sequence like
>
> page_fault_disable();
> if (copy_to_user(...))
> rc = ...
> page_fault_enable();
>
> is correct code right now I think page_fault_disable() should increase both.
> No need for surprising semantic changes.
>
> > So we would have pagefault code rely on:
> >
> > in_disabled_pagefault() ( pagefault_disabled() ... whatever ) instead of
> > in_atomic().
>
> No, let's be more defensive: the page fault handler should do nothing if
> in_atomic() just like now. But it could have a quick check and emit a one
> time warning if page faults aren't disabled in addition.
> That might help debugging but keeps the system more likely alive.
Sounds sane if we increase both counters!
>
> might_fault() however should call might_sleep() if page faults aren't
> disabled, but that's what you proposed anyway I think.
Jap, sounds good to me. Will see if I can come up with something.
Thanks!
On Thu, 27 Nov 2014, Heiko Carstens wrote:
> On Thu, Nov 27, 2014 at 09:03:01AM +0100, David Hildenbrand wrote:
> > > Code like
> > > spin_lock(&lock);
> > > if (copy_to_user(...))
> > > rc = ...
> > > spin_unlock(&lock);
> > > really *should* generate warnings like it did before.
> > >
> > > And *only* code like
> > > spin_lock(&lock);
> >
> > Is only code like this valid or also with the spin_lock() dropped?
> > (e.g. the access in patch1 if I remember correctly)
> >
> > So should page_fault_disable() increment the pagefault counter and the preempt
> > counter or only the first one?
>
> Given that a sequence like
>
> page_fault_disable();
> if (copy_to_user(...))
> rc = ...
> page_fault_enable();
>
> is correct code right now I think page_fault_disable() should increase both.
> No need for surprising semantic changes.
OTOH, there is no reason why we need to disable preemption over that
page_fault_disabled() region. There are code pathes which really do
not require to disable preemption for that.
We have that seperated in preempt-rt for obvious reasons and IIRC
Peter Zijlstra tried to distangle it in mainline some time ago. I
forgot why that never got merged.
We tie way too much stuff on the preemption count already, which is a
mightmare because we have no clear distinction of protection
scopes.
Thanks,
tglx
> OTOH, there is no reason why we need to disable preemption over that
> page_fault_disabled() region. There are code pathes which really do
> not require to disable preemption for that.
>
> We have that seperated in preempt-rt for obvious reasons and IIRC
> Peter Zijlstra tried to distangle it in mainline some time ago. I
> forgot why that never got merged.
>
Of course, we can completely separate that in our page fault code by doing
pagefault_disabled() checks instead of in_atomic() checks (even in add on
patches later).
> We tie way too much stuff on the preemption count already, which is a
> mightmare because we have no clear distinction of protection
> scopes.
Although it might not be optimal, but keeping a separate counter for
pagefault_disable() as part of the preemption counter seems to be the only
doable thing right now. I am not sure if a completely separated counter is even
possible, increasing the size of thread_info.
I am working on a prototype right now.
Thanks!
>
> Thanks,
>
> tglx
>
From: David Hildenbrand
...
> Although it might not be optimal, but keeping a separate counter for
> pagefault_disable() as part of the preemption counter seems to be the only
> doable thing right now. I am not sure if a completely separated counter is even
> possible, increasing the size of thread_info.
What about adding (say) 0x10000 for the more restrictive test?
David
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
> From: David Hildenbrand
> ...
> > Although it might not be optimal, but keeping a separate counter for
> > pagefault_disable() as part of the preemption counter seems to be the only
> > doable thing right now. I am not sure if a completely separated counter is even
> > possible, increasing the size of thread_info.
>
> What about adding (say) 0x10000 for the more restrictive test?
>
> David
>
You mean as part of the preempt counter?
The current layout (on my branch) is
* PREEMPT_MASK: 0x000000ff
* SOFTIRQ_MASK: 0x0000ff00
* HARDIRQ_MASK: 0x000f0000
* NMI_MASK: 0x00100000
* PREEMPT_ACTIVE: 0x00200000
I would have added
* PAGEFAULT_MASK: 0x03C00000
So 4 bit == 16 levels (tbd)
By implementing scope checks in the debug case like done for the regular
preempt_count_inc() preempt_count_dec(), we could catch over/underflows.
Thanks,
David
From: David Hildenbrand [mailto:[email protected]]
> > From: David Hildenbrand
> > ...
> > > Although it might not be optimal, but keeping a separate counter for
> > > pagefault_disable() as part of the preemption counter seems to be the only
> > > doable thing right now. I am not sure if a completely separated counter is even
> > > possible, increasing the size of thread_info.
> >
> > What about adding (say) 0x10000 for the more restrictive test?
> >
> > David
> >
>
> You mean as part of the preempt counter?
>
> The current layout (on my branch) is
>
> * PREEMPT_MASK: 0x000000ff
> * SOFTIRQ_MASK: 0x0000ff00
> * HARDIRQ_MASK: 0x000f0000
> * NMI_MASK: 0x00100000
> * PREEMPT_ACTIVE: 0x00200000
>
> I would have added
> * PAGEFAULT_MASK: 0x03C00000
I'm not sure where you'd need to add the bits.
I think the above works because disabling 'HARDIRQ' implicitly
disables 'SOFTIRQ' and 'PREEMPT' (etc), so if 256+ threads
disable PREEMPT everything still works.
So if disabling pagefaults implies that pre-emption is disabled
(but SOFTIRQ is still allowed) then you need to insert your bit(s)
between 0xff00 and 0x00ff.
OTOH if disabling pre-emption implies that pagefaults are disabled
then you'd need to use the lsb and change all the above values.
Which makes me think that 'PREEMPT_ACTIVE' isn't right at all.
Two threads disabling NMIs (or 32 disabling HARDIRQ) won't DTRT.
OTOH I'm only guessing at how this is used.
David
> From: David Hildenbrand [mailto:[email protected]]
> > > From: David Hildenbrand
> > > ...
> > > > Although it might not be optimal, but keeping a separate counter for
> > > > pagefault_disable() as part of the preemption counter seems to be the only
> > > > doable thing right now. I am not sure if a completely separated counter is even
> > > > possible, increasing the size of thread_info.
> > >
> > > What about adding (say) 0x10000 for the more restrictive test?
> > >
> > > David
> > >
> >
> > You mean as part of the preempt counter?
> >
> > The current layout (on my branch) is
> >
> > * PREEMPT_MASK: 0x000000ff
> > * SOFTIRQ_MASK: 0x0000ff00
> > * HARDIRQ_MASK: 0x000f0000
> > * NMI_MASK: 0x00100000
> > * PREEMPT_ACTIVE: 0x00200000
> >
> > I would have added
> > * PAGEFAULT_MASK: 0x03C00000
>
> I'm not sure where you'd need to add the bits.
>
> I think the above works because disabling 'HARDIRQ' implicitly
> disables 'SOFTIRQ' and 'PREEMPT' (etc), so if 256+ threads
> disable PREEMPT everything still works.
AFAIK 256+ levels of preempt will break the system :)
Therefore with CONFIG_DEBUG_PREEMPT we verify that we don't have any
over/underflows.
But such bugs can only be found with CONFIG_DEBUG_PREEMPT enabled.
>
> So if disabling pagefaults implies that pre-emption is disabled
> (but SOFTIRQ is still allowed) then you need to insert your bit(s)
> between 0xff00 and 0x00ff.
> OTOH if disabling pre-emption implies that pagefaults are disabled
> then you'd need to use the lsb and change all the above values.
>
> Which makes me think that 'PREEMPT_ACTIVE' isn't right at all.
> Two threads disabling NMIs (or 32 disabling HARDIRQ) won't DTRT.
With threads you mean levels? This is a per thread information.
>
> OTOH I'm only guessing at how this is used.
>
> David
>
>
>
Commit 662bbcb2747c2422cf98d3d97619509379eee466 removed might_sleep() checks
for all user access code (that uses might_fault()).
The reason was to disable wrong "sleep in atomic" warnings in the following
scenario:
pagefault_disable();
rc = copy_to_user(...);
pagefault_enable();
Which is valid, as pagefault_disable() increments the preempt counter and
therefore disables the pagefault handler. copy_to_user() will not sleep and return
an invalid return code if a page is not available.
However, as all might_sleep() checks are removed, CONFIG_DEBUG_ATOMIC_SLEEP
would no longer detect the following scenario:
spin_lock(&lock);
rc = copy_to_user(...);
spin_unlock(&lock);
If the kernel is compiled with preemption turned on, the preempt counter would
be incremented and copy_to_user() would never sleep. However, with preemption
turned off, the preempt counter will not be touched, we will therefore sleep in
atomic context. We really want to enable CONFIG_DEBUG_ATOMIC_SLEEP checks for
user access functions again, otherwise horrible deadlocks might be hard to debug.
Root of all evil is that pagefault_disable() acted almost as preempt_disable(),
depending on preemption being turned on/off.
As we now have a fixed pagefault_disable() implementation in place, that uses
own bits in the preempt counter, we can reenable might_sleep() checks.
This patch reverts commit 662bbcb2747c2422cf98d3d97619509379eee466 taking care
of the !MMU optimization and the new pagefault_disabled() check.
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/kernel.h | 9 +++++++--
mm/memory.c | 15 ++++-----------
2 files changed, 11 insertions(+), 13 deletions(-)
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 3d770f55..64b5f93 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -225,9 +225,14 @@ static inline u32 reciprocal_scale(u32 val, u32 ep_ro)
return (u32)(((u64) val * ep_ro) >> 32);
}
-#if defined(CONFIG_MMU) && \
- (defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_DEBUG_ATOMIC_SLEEP))
+#if defined(CONFIG_MMU) && defined(CONFIG_PROVE_LOCKING)
void might_fault(void);
+#elif defined(CONFIG_MMU) && defined(CONFIG_DEBUG_ATOMIC_SLEEP)
+static inline void might_fault(void)
+{
+ if (unlikely(!pagefault_disabled()))
+ __might_sleep(__FILE__, __LINE__, 0);
+}
#else
static inline void might_fault(void) { }
#endif
diff --git a/mm/memory.c b/mm/memory.c
index 3e50383..0e59db9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3699,7 +3699,7 @@ void print_vma_addr(char *prefix, unsigned long ip)
up_read(&mm->mmap_sem);
}
-#if defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_DEBUG_ATOMIC_SLEEP)
+#ifdef CONFIG_PROVE_LOCKING
void might_fault(void)
{
/*
@@ -3711,17 +3711,10 @@ void might_fault(void)
if (segment_eq(get_fs(), KERNEL_DS))
return;
- /*
- * it would be nicer only to annotate paths which are not under
- * pagefault_disable, however that requires a larger audit and
- * providing helpers like get_user_atomic.
- */
- if (in_atomic())
- return;
-
- __might_sleep(__FILE__, __LINE__, 0);
+ if (unlikely(!pagefault_disabled()))
+ __might_sleep(__FILE__, __LINE__, 0);
- if (current->mm)
+ if (!in_atomic() && current->mm)
might_lock_read(¤t->mm->mmap_sem);
}
EXPORT_SYMBOL(might_fault);
--
1.8.5.5
Simple prototype to enable might_sleep() checks in might_fault(), avoiding false
positives for scenarios involving explicit pagefault_disable().
So this should work:
spin_lock(&lock); /* also if left away */
pagefault_disable()
rc = copy_to_user(...)
pagefault_enable();
spin_unlock(&lock); /*
And this should report a warning again:
spin_lock(&lock);
rc = copy_to_user(...);
spin_unlock(&lock);
Still missing:
- Split of preempt documentation update + preempt_active define reshuffle
- Debug version to test for over/underflows
- Change documentation of user access methods to reflect the real behavior
- Don't touch the preempt counter, only the pagefault disable counter (future
work)
David Hildenbrand (2):
preempt: track pagefault_disable() calls in the preempt counter
mm, sched: trigger might_sleep() in might_fault() when pagefaults are
disabled
include/linux/kernel.h | 9 +++++++--
include/linux/preempt_mask.h | 24 +++++++++++++++++++-----
include/linux/uaccess.h | 21 ++++++++++++++-------
mm/memory.c | 15 ++++-----------
4 files changed, 44 insertions(+), 25 deletions(-)
--
1.8.5.5
Let's track the levels of pagefault_disable() calls in a separate part of the
preempt counter. Also update the regular preempt counter to keep the existing
pagefault infrastructure working (can be demangeled and cleaned up later).
This change is needed to detect whether we are running in a simple atomic
context or in pagefault_disable() context.
Cleanup the PREEMPT_ACTIVE defines and fix the preempt count documentation on
the way.
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/preempt_mask.h | 24 +++++++++++++++++++-----
include/linux/uaccess.h | 21 ++++++++++++++-------
2 files changed, 33 insertions(+), 12 deletions(-)
diff --git a/include/linux/preempt_mask.h b/include/linux/preempt_mask.h
index dbeec4d..9d6e7f7 100644
--- a/include/linux/preempt_mask.h
+++ b/include/linux/preempt_mask.h
@@ -4,11 +4,15 @@
#include <linux/preempt.h>
/*
- * We put the hardirq and softirq counter into the preemption
+ * We put the hardirq, softirq and pagefault_disable counter into the preemption
* counter. The bitmask has the following meaning:
*
* - bits 0-7 are the preemption count (max preemption depth: 256)
* - bits 8-15 are the softirq count (max # of softirqs: 256)
+ * - bits 16-19 are the hardirq count (max # of hardirqs: 16)
+ * - bit 20 is the nmi flag
+ * - bit 21 is the preempt_active flag
+ * - bits 22-25 are the pagefault count (max pagefault disable depth: 16)
*
* The hardirq count could in theory be the same as the number of
* interrupts in the system, but we run all interrupt handlers with
@@ -21,16 +25,21 @@
* HARDIRQ_MASK: 0x000f0000
* NMI_MASK: 0x00100000
* PREEMPT_ACTIVE: 0x00200000
+ * PAGEFAULT_MASK: 0x03C00000
*/
#define PREEMPT_BITS 8
#define SOFTIRQ_BITS 8
#define HARDIRQ_BITS 4
#define NMI_BITS 1
+#define PREEMPT_ACTIVE_BITS 1
+#define PAGEFAULT_BITS 4
#define PREEMPT_SHIFT 0
#define SOFTIRQ_SHIFT (PREEMPT_SHIFT + PREEMPT_BITS)
#define HARDIRQ_SHIFT (SOFTIRQ_SHIFT + SOFTIRQ_BITS)
#define NMI_SHIFT (HARDIRQ_SHIFT + HARDIRQ_BITS)
+#define PREEMPT_ACTIVE_SHIFT (NMI_SHIFT + NMI_BITS)
+#define PAGEFAULT_SHIFT (PREEMPT_ACTIVE_SHIFT + PREEMPT_ACTIVE_BITS)
#define __IRQ_MASK(x) ((1UL << (x))-1)
@@ -38,18 +47,17 @@
#define SOFTIRQ_MASK (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT)
#define HARDIRQ_MASK (__IRQ_MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT)
#define NMI_MASK (__IRQ_MASK(NMI_BITS) << NMI_SHIFT)
+#define PREEMPT_ACTIVE (__IRQ_MASK(PREEMPT_ACTIVE_BITS) << PREEMPT_ACTIVE_SHIFT)
+#define PAGEFAULT_MASK (__IRQ_MASK(PAGEFAULT_BITS) << PAGEFAULT_SHIFT)
#define PREEMPT_OFFSET (1UL << PREEMPT_SHIFT)
#define SOFTIRQ_OFFSET (1UL << SOFTIRQ_SHIFT)
#define HARDIRQ_OFFSET (1UL << HARDIRQ_SHIFT)
#define NMI_OFFSET (1UL << NMI_SHIFT)
+#define PAGEFAULT_OFFSET (1UL << PAGEFAULT_SHIFT)
#define SOFTIRQ_DISABLE_OFFSET (2 * SOFTIRQ_OFFSET)
-#define PREEMPT_ACTIVE_BITS 1
-#define PREEMPT_ACTIVE_SHIFT (NMI_SHIFT + NMI_BITS)
-#define PREEMPT_ACTIVE (__IRQ_MASK(PREEMPT_ACTIVE_BITS) << PREEMPT_ACTIVE_SHIFT)
-
#define hardirq_count() (preempt_count() & HARDIRQ_MASK)
#define softirq_count() (preempt_count() & SOFTIRQ_MASK)
#define irq_count() (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK \
@@ -71,6 +79,12 @@
*/
#define in_nmi() (preempt_count() & NMI_MASK)
+/*
+ * Are we in pagefault_disable context?
+ */
+#define pagefault_disabled() (preempt_count() & PAGEFAULT_MASK)
+
+
#if defined(CONFIG_PREEMPT_COUNT)
# define PREEMPT_CHECK_OFFSET 1
#else
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index ecd3319..a2ba6e6 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -4,18 +4,24 @@
#include <linux/preempt.h>
#include <asm/uaccess.h>
+#define __pagefault_count_inc() preempt_count_add(PAGEFAULT_OFFSET)
+#define __pagefault_count_dec() preempt_count_sub(PAGEFAULT_OFFSET)
+
/*
- * These routines enable/disable the pagefault handler in that
- * it will not take any locks and go straight to the fixup table.
+ * These routines enable/disable the pagefault handler. If disabled, it will
+ * not take any locks and go straight to the fixup table.
+ *
+ * We increase the preempt and the pagefault count, to be able to distinguish
+ * whether we run in simple atomic context or in a real pagefault_disable context.
+ *
+ * For now, after pagefault_disabled() has been called, we run in atomic
+ * context. User access methods will not sleep.
*
- * They have great resemblance to the preempt_disable/enable calls
- * and in fact they are identical; this is because currently there is
- * no other way to make the pagefault handlers do this. So we do
- * disable preemption but we don't necessarily care about that.
*/
static inline void pagefault_disable(void)
{
preempt_count_inc();
+ __pagefault_count_inc();
/*
* make sure to have issued the store before a pagefault
* can hit.
@@ -25,12 +31,13 @@ static inline void pagefault_disable(void)
static inline void pagefault_enable(void)
{
-#ifndef CONFIG_PREEMPT
/*
* make sure to issue those last loads/stores before enabling
* the pagefault handler again.
*/
barrier();
+ __pagefault_count_dec();
+#ifndef CONFIG_PREEMPT
preempt_count_dec();
#else
preempt_enable();
--
1.8.5.5
On Thu, Nov 27, 2014 at 06:10:17PM +0100, David Hildenbrand wrote:
> Commit 662bbcb2747c2422cf98d3d97619509379eee466 removed might_sleep() checks
> for all user access code (that uses might_fault()).
>
> The reason was to disable wrong "sleep in atomic" warnings in the following
> scenario:
> pagefault_disable();
> rc = copy_to_user(...);
> pagefault_enable();
>
> Which is valid, as pagefault_disable() increments the preempt counter and
> therefore disables the pagefault handler. copy_to_user() will not sleep and return
> an invalid return code if a page is not available.
>
> However, as all might_sleep() checks are removed, CONFIG_DEBUG_ATOMIC_SLEEP
> would no longer detect the following scenario:
> spin_lock(&lock);
> rc = copy_to_user(...);
> spin_unlock(&lock);
>
> If the kernel is compiled with preemption turned on, the preempt counter would
> be incremented and copy_to_user() would never sleep. However, with preemption
> turned off, the preempt counter will not be touched, we will therefore sleep in
> atomic context. We really want to enable CONFIG_DEBUG_ATOMIC_SLEEP checks for
> user access functions again, otherwise horrible deadlocks might be hard to debug.
>
> Root of all evil is that pagefault_disable() acted almost as preempt_disable(),
> depending on preemption being turned on/off.
>
> As we now have a fixed pagefault_disable() implementation in place, that uses
> own bits in the preempt counter, we can reenable might_sleep() checks.
>
> This patch reverts commit 662bbcb2747c2422cf98d3d97619509379eee466 taking care
> of the !MMU optimization and the new pagefault_disabled() check.
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> include/linux/kernel.h | 9 +++++++--
> mm/memory.c | 15 ++++-----------
> 2 files changed, 11 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 3d770f55..64b5f93 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -225,9 +225,14 @@ static inline u32 reciprocal_scale(u32 val, u32 ep_ro)
> return (u32)(((u64) val * ep_ro) >> 32);
> }
>
> -#if defined(CONFIG_MMU) && \
> - (defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_DEBUG_ATOMIC_SLEEP))
> +#if defined(CONFIG_MMU) && defined(CONFIG_PROVE_LOCKING)
> void might_fault(void);
> +#elif defined(CONFIG_MMU) && defined(CONFIG_DEBUG_ATOMIC_SLEEP)
> +static inline void might_fault(void)
> +{
> + if (unlikely(!pagefault_disabled()))
> + __might_sleep(__FILE__, __LINE__, 0);
This __FILE__/__FILE__ will always point at kernel.h
You want a macro to wrap this up.
> +}
> #else
> static inline void might_fault(void) { }
> #endif
> diff --git a/mm/memory.c b/mm/memory.c
> index 3e50383..0e59db9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3699,7 +3699,7 @@ void print_vma_addr(char *prefix, unsigned long ip)
> up_read(&mm->mmap_sem);
> }
>
> -#if defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_DEBUG_ATOMIC_SLEEP)
> +#ifdef CONFIG_PROVE_LOCKING
> void might_fault(void)
> {
> /*
> @@ -3711,17 +3711,10 @@ void might_fault(void)
> if (segment_eq(get_fs(), KERNEL_DS))
> return;
>
> - /*
> - * it would be nicer only to annotate paths which are not under
> - * pagefault_disable, however that requires a larger audit and
> - * providing helpers like get_user_atomic.
> - */
> - if (in_atomic())
> - return;
> -
> - __might_sleep(__FILE__, __LINE__, 0);
> + if (unlikely(!pagefault_disabled()))
> + __might_sleep(__FILE__, __LINE__, 0);
>
> - if (current->mm)
> + if (!in_atomic() && current->mm)
> might_lock_read(¤t->mm->mmap_sem);
> }
> EXPORT_SYMBOL(might_fault);
> --
> 1.8.5.5
On Thu, Nov 27, 2014 at 07:24:49PM +0200, Michael S. Tsirkin wrote:
> On Thu, Nov 27, 2014 at 06:10:17PM +0100, David Hildenbrand wrote:
> > Commit 662bbcb2747c2422cf98d3d97619509379eee466 removed might_sleep() checks
> > for all user access code (that uses might_fault()).
> >
> > The reason was to disable wrong "sleep in atomic" warnings in the following
> > scenario:
> > pagefault_disable();
> > rc = copy_to_user(...);
> > pagefault_enable();
> >
> > Which is valid, as pagefault_disable() increments the preempt counter and
> > therefore disables the pagefault handler. copy_to_user() will not sleep and return
> > an invalid return code if a page is not available.
> >
> > However, as all might_sleep() checks are removed, CONFIG_DEBUG_ATOMIC_SLEEP
> > would no longer detect the following scenario:
> > spin_lock(&lock);
> > rc = copy_to_user(...);
> > spin_unlock(&lock);
> >
> > If the kernel is compiled with preemption turned on, the preempt counter would
> > be incremented and copy_to_user() would never sleep. However, with preemption
> > turned off, the preempt counter will not be touched, we will therefore sleep in
> > atomic context. We really want to enable CONFIG_DEBUG_ATOMIC_SLEEP checks for
> > user access functions again, otherwise horrible deadlocks might be hard to debug.
> >
> > Root of all evil is that pagefault_disable() acted almost as preempt_disable(),
> > depending on preemption being turned on/off.
> >
> > As we now have a fixed pagefault_disable() implementation in place, that uses
> > own bits in the preempt counter, we can reenable might_sleep() checks.
> >
> > This patch reverts commit 662bbcb2747c2422cf98d3d97619509379eee466 taking care
> > of the !MMU optimization and the new pagefault_disabled() check.
> >
> > Signed-off-by: David Hildenbrand <[email protected]>
> > ---
> > include/linux/kernel.h | 9 +++++++--
> > mm/memory.c | 15 ++++-----------
> > 2 files changed, 11 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > index 3d770f55..64b5f93 100644
> > --- a/include/linux/kernel.h
> > +++ b/include/linux/kernel.h
> > @@ -225,9 +225,14 @@ static inline u32 reciprocal_scale(u32 val, u32 ep_ro)
> > return (u32)(((u64) val * ep_ro) >> 32);
> > }
> >
> > -#if defined(CONFIG_MMU) && \
> > - (defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_DEBUG_ATOMIC_SLEEP))
> > +#if defined(CONFIG_MMU) && defined(CONFIG_PROVE_LOCKING)
> > void might_fault(void);
> > +#elif defined(CONFIG_MMU) && defined(CONFIG_DEBUG_ATOMIC_SLEEP)
> > +static inline void might_fault(void)
> > +{
> > + if (unlikely(!pagefault_disabled()))
> > + __might_sleep(__FILE__, __LINE__, 0);
>
> This __FILE__/__FILE__ will always point at kernel.h
>
> You want a macro to wrap this up.
>
> > +}
> > #else
> > static inline void might_fault(void) { }
> > #endif
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 3e50383..0e59db9 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3699,7 +3699,7 @@ void print_vma_addr(char *prefix, unsigned long ip)
> > up_read(&mm->mmap_sem);
> > }
> >
> > -#if defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_DEBUG_ATOMIC_SLEEP)
> > +#ifdef CONFIG_PROVE_LOCKING
> > void might_fault(void)
> > {
> > /*
> > @@ -3711,17 +3711,10 @@ void might_fault(void)
> > if (segment_eq(get_fs(), KERNEL_DS))
> > return;
> >
> > - /*
> > - * it would be nicer only to annotate paths which are not under
> > - * pagefault_disable, however that requires a larger audit and
> > - * providing helpers like get_user_atomic.
> > - */
> > - if (in_atomic())
> > - return;
> > -
> > - __might_sleep(__FILE__, __LINE__, 0);
> > + if (unlikely(!pagefault_disabled()))
> > + __might_sleep(__FILE__, __LINE__, 0);
> >
Same here: so maybe make might_fault a wrapper
around __might_fault as well.
> > - if (current->mm)
> > + if (!in_atomic() && current->mm)
> > might_lock_read(¤t->mm->mmap_sem);
> > }
> > EXPORT_SYMBOL(might_fault);
> > --
> > 1.8.5.5
> > > -
> > > - __might_sleep(__FILE__, __LINE__, 0);
> > > + if (unlikely(!pagefault_disabled()))
> > > + __might_sleep(__FILE__, __LINE__, 0);
> > >
>
> Same here: so maybe make might_fault a wrapper
> around __might_fault as well.
Yes, I also noticed that. It was part of the original code.
For now I kept this revert as close as possible to
the original patch.
Better fix this in an add-on patch? Or directly in this commit? At least for
the in-header function it is easy to fix in this patch.
Thanks!
On Thu, Nov 27, 2014 at 07:08:42PM +0100, David Hildenbrand wrote:
> > > > -
> > > > - __might_sleep(__FILE__, __LINE__, 0);
> > > > + if (unlikely(!pagefault_disabled()))
> > > > + __might_sleep(__FILE__, __LINE__, 0);
> > > >
> >
> > Same here: so maybe make might_fault a wrapper
> > around __might_fault as well.
>
> Yes, I also noticed that. It was part of the original code.
> For now I kept this revert as close as possible to
> the original patch.
>
> Better fix this in an add-on patch? Or directly in this commit?
IMHO it's up to you really.
> At least for
> the in-header function it is easy to fix in this patch.
>
> Thanks!
Right.
On Thu, 27 Nov 2014, David Hildenbrand wrote:
> > OTOH, there is no reason why we need to disable preemption over that
> > page_fault_disabled() region. There are code pathes which really do
> > not require to disable preemption for that.
> >
> > We have that seperated in preempt-rt for obvious reasons and IIRC
> > Peter Zijlstra tried to distangle it in mainline some time ago. I
> > forgot why that never got merged.
> >
>
> Of course, we can completely separate that in our page fault code by doing
> pagefault_disabled() checks instead of in_atomic() checks (even in add on
> patches later).
>
> > We tie way too much stuff on the preemption count already, which is a
> > mightmare because we have no clear distinction of protection
> > scopes.
>
> Although it might not be optimal, but keeping a separate counter for
> pagefault_disable() as part of the preemption counter seems to be the only
> doable thing right now.
It needs to be seperate, if it should be useful. Otherwise we just
have a extra accounting in preempt_count() which does exactly the same
thing as we have now: disabling preemption.
Now you might say, that we could mask out that part when checking
preempt_count, but that wont work on x86 as x86 has the preempt
counter as a per cpu variable and not as a per thread one.
But if you want to distangle pagefault disable from preempt disable
then you must move it to the thread, because it is a property of the
thread. preempt count is very much a per cpu counter as you can only
go through schedule when it becomes 0.
Btw, I find the x86 representation way more clear, because it
documents that preempt count is a per cpu BKL and not a magic thread
property. And sadly that is how preempt count is used ...
> I am not sure if a completely separated counter is even possible,
> increasing the size of thread_info.
And adding a ulong to thread_info is going to create exactly which
problem?
Thanks,
tglx
> On Thu, 27 Nov 2014, David Hildenbrand wrote:
> > > OTOH, there is no reason why we need to disable preemption over that
> > > page_fault_disabled() region. There are code pathes which really do
> > > not require to disable preemption for that.
> > >
> > > We have that seperated in preempt-rt for obvious reasons and IIRC
> > > Peter Zijlstra tried to distangle it in mainline some time ago. I
> > > forgot why that never got merged.
> > >
> >
> > Of course, we can completely separate that in our page fault code by doing
> > pagefault_disabled() checks instead of in_atomic() checks (even in add on
> > patches later).
> >
> > > We tie way too much stuff on the preemption count already, which is a
> > > mightmare because we have no clear distinction of protection
> > > scopes.
> >
> > Although it might not be optimal, but keeping a separate counter for
> > pagefault_disable() as part of the preemption counter seems to be the only
> > doable thing right now.
>
> It needs to be seperate, if it should be useful. Otherwise we just
> have a extra accounting in preempt_count() which does exactly the same
> thing as we have now: disabling preemption.
>
> Now you might say, that we could mask out that part when checking
> preempt_count, but that wont work on x86 as x86 has the preempt
> counter as a per cpu variable and not as a per thread one.
Ah right, it's per cpu on x86. So it really belongs to a thread if we want to
demangle preemption and pagefault_disable.
Would work for now, but for x86 not on the long run.
>
> But if you want to distangle pagefault disable from preempt disable
> then you must move it to the thread, because it is a property of the
> thread. preempt count is very much a per cpu counter as you can only
> go through schedule when it becomes 0.
Thinking about it, this makes perfect sense!
>
> Btw, I find the x86 representation way more clear, because it
> documents that preempt count is a per cpu BKL and not a magic thread
> property. And sadly that is how preempt count is used ...
>
> > I am not sure if a completely separated counter is even possible,
> > increasing the size of thread_info.
>
> And adding a ulong to thread_info is going to create exactly which
> problem?
If we're allowed to increase the size of thread_info - absolutely fine with me!
(I am not sure if some archs have special constraints on the size)
Will see what I can come up with.
Thanks!
>
> Thanks,
>
> tglx
>