Hi all,
I don't know whether this is linux-rt specific or applies to
the mainline too, so I'll repeat some things the linux-rt
readers already know.
Environment:
- Geode LX or Celeron M
- _not_ CONFIG_SMP
- linux 3.4 with realtime patches and full preempt configured
- an application consisting of several mostly RR-class threads
- the application runs with mlockall()
- there is no swap
Problem:
- after several hours to 1-2 weeks some of the threads start to loop
in the following way
0d...0 62811.755382: function: do_page_fault
0....0 62811.755386: function: handle_mm_fault
0....0 62811.755389: function: handle_pte_fault
0d...0 62811.755394: function: do_page_fault
0....0 62811.755396: function: handle_mm_fault
0....0 62811.755398: function: handle_pte_fault
0d...0 62811.755402: function: do_page_fault
0....0 62811.755404: function: handle_mm_fault
0....0 62811.755406: function: handle_pte_fault
and stay in the loop until the RT throttling gets activated.
One of the faulting addresses was in code (after returning
from a syscall), a second one in stack (inside put_user right
before a syscall ends), both were surely mapped.
- After RT throttler activates it somehow magically fixes itself,
probably (not verified) because another _process_ gets scheduled.
When throttled the RR and FF threads are not allowed to run for
a while (20 ms in my configuration). The livelocks lasts around
1-3 seconds, and there is a SCHED_OTHER process that runs each
2 seconds.
- Kernel threads with higher priority than the faulting one (linux-rt
irq threads) run normally. A higher priority user thread from the
same process gets scheduled and then enters the same faulting loop.
- in ps -o min_flt,maj_flt the number of minor page faults
for the offending thread skyrockets to hundreds of thousands
(normally it stays zero as everything is already mapped
when it is started)
- The code in handle_pte_fault proceeds through the
entry = pte_mkyoung(entry);
line and the following
ptep_set_access_flags
returns zero.
- The livelock is extremely timing sensitive - different workloads
cause it not to happen at all or far later.
- I was able to make this happen a bit faster (once per ~4 hours)
with the rt thread repeatly causing the kernel to try to
invoke modprobe to load a missing module - so there is a load
of kworker-s launching modprobes (in case anyone wonders how it
can happen: this was a bug in our application with invalid level
specified for setsockopt causing searching for TCP congestion
module instead of setting SO_LINGER)
- the symptoms are similar to
http://lkml.indiana.edu/hypermail/linux/kernel/1103.0/01364.html
which got fixed by
https://lkml.org/lkml/2011/3/15/516
but this fix does not apply to the processors in question
- the patch below _seems_ to fix it, or at least massively delay it -
the testcase now runs for 2.5 days instead of 4 hours. I doubt
it is the proper patch (it brutally reloads the CR3 every time
a thread with userspace mapping is switched to). I just got the
suspicion that there is some way the kernel forgets to update
the memory mapping when going from an userpace thread through
some kernel ones back to another userspace one and tried to make
sure the mapping is always reloaded.
- the whole history starts at
http://www.spinics.net/lists/linux-rt-users/msg09758.html
I originally thought the problem is in timerfd and hunted it
in several places until I learned to use the tracing infrastructure
and started to pin it down with trace prints etc :)
- A trace file of the hang is at
http://www.meduna.org/tmp/trace.mmfaulthang.dat.gz
Does this ring a bell with someone?
Thanks
Stano
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 6902152..3d54a15 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -54,21 +54,23 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
if (unlikely(prev->context.ldt != next->context.ldt))
load_LDT_nolock(&next->context);
}
-#ifdef CONFIG_SMP
else {
+#ifdef CONFIG_SMP
percpu_write(cpu_tlbstate.state, TLBSTATE_OK);
BUG_ON(percpu_read(cpu_tlbstate.active_mm) != next);
if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
+#endif
/* We were in lazy tlb mode and leave_mm disabled
* tlb flush IPI delivery. We must reload CR3
* to make sure to use no freed page tables.
*/
load_cr3(next->pgd);
load_LDT_nolock(&next->context);
+#ifdef CONFIG_SMP
}
- }
#endif
+ }
}
#define activate_mm(prev, next)
On Fri, 2013-05-17 at 10:42 +0200, Stanislav Meduna wrote:
> Hi all,
>
> I don't know whether this is linux-rt specific or applies to
> the mainline too, so I'll repeat some things the linux-rt
> readers already know.
>
> Environment:
>
> - Geode LX or Celeron M
> - _not_ CONFIG_SMP
> - linux 3.4 with realtime patches and full preempt configured
> - an application consisting of several mostly RR-class threads
The threads do a mlockall too right? I'm not sure mlock will lock memory
for a new thread's stack.
> - the application runs with mlockall()
With both MCL_FUTURE and MCL_CURRENT set, right?
> - there is no swap
Hmm, doesn't mean that code can't be swapped out, as it is just mapped
from the file it came from. But you'd think mlockall would prevent that.
>
> Problem:
>
> - after several hours to 1-2 weeks some of the threads start to loop
> in the following way
>
> 0d...0 62811.755382: function: do_page_fault
> 0....0 62811.755386: function: handle_mm_fault
> 0....0 62811.755389: function: handle_pte_fault
> 0d...0 62811.755394: function: do_page_fault
> 0....0 62811.755396: function: handle_mm_fault
> 0....0 62811.755398: function: handle_pte_fault
> 0d...0 62811.755402: function: do_page_fault
> 0....0 62811.755404: function: handle_mm_fault
> 0....0 62811.755406: function: handle_pte_fault
>
> and stay in the loop until the RT throttling gets activated.
> One of the faulting addresses was in code (after returning
> from a syscall), a second one in stack (inside put_user right
> before a syscall ends), both were surely mapped.
>
> - After RT throttler activates it somehow magically fixes itself,
> probably (not verified) because another _process_ gets scheduled.
> When throttled the RR and FF threads are not allowed to run for
> a while (20 ms in my configuration). The livelocks lasts around
> 1-3 seconds, and there is a SCHED_OTHER process that runs each
> 2 seconds.
Hmm, if there was a missed TLB flush, and we are faulting due to a bad
TLB table, and it goes into an infinite faulting loop, the only thing
that will stop it is the RT throttle. Then a new task gets scheduled,
and we flush the TLB and everything is fine again.
>
> - Kernel threads with higher priority than the faulting one (linux-rt
> irq threads) run normally. A higher priority user thread from the
> same process gets scheduled and then enters the same faulting loop.
Kernel threads share the mm, and wont cause a reload of the CR3.
>
> - in ps -o min_flt,maj_flt the number of minor page faults
> for the offending thread skyrockets to hundreds of thousands
> (normally it stays zero as everything is already mapped
> when it is started)
>
> - The code in handle_pte_fault proceeds through the
> entry = pte_mkyoung(entry);
> line and the following
> ptep_set_access_flags
> returns zero.
>
> - The livelock is extremely timing sensitive - different workloads
> cause it not to happen at all or far later.
>
> - I was able to make this happen a bit faster (once per ~4 hours)
> with the rt thread repeatly causing the kernel to try to
> invoke modprobe to load a missing module - so there is a load
> of kworker-s launching modprobes (in case anyone wonders how it
> can happen: this was a bug in our application with invalid level
> specified for setsockopt causing searching for TCP congestion
> module instead of setting SO_LINGER)
Note, that modules are in vmalloc space, and do fault in. But it also
changes the PGD.
>
> - the symptoms are similar to
> http://lkml.indiana.edu/hypermail/linux/kernel/1103.0/01364.html
> which got fixed by
> https://lkml.org/lkml/2011/3/15/516
> but this fix does not apply to the processors in question
>
> - the patch below _seems_ to fix it, or at least massively delay it -
> the testcase now runs for 2.5 days instead of 4 hours. I doubt
> it is the proper patch (it brutally reloads the CR3 every time
> a thread with userspace mapping is switched to). I just got the
> suspicion that there is some way the kernel forgets to update
> the memory mapping when going from an userpace thread through
> some kernel ones back to another userspace one and tried to make
> sure the mapping is always reloaded.
Seems a bit extreme. Looks to me there's a missing flush TLB somewhere.
Do you have a reproducer you can share. That way, maybe we can all share
the joy.
-- Steve
>
> - the whole history starts at
> http://www.spinics.net/lists/linux-rt-users/msg09758.html
> I originally thought the problem is in timerfd and hunted it
> in several places until I learned to use the tracing infrastructure
> and started to pin it down with trace prints etc :)
>
> - A trace file of the hang is at
> http://www.meduna.org/tmp/trace.mmfaulthang.dat.gz
>
> Does this ring a bell with someone?
>
> Thanks
> Stano
>
>
>
>
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 6902152..3d54a15 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -54,21 +54,23 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> if (unlikely(prev->context.ldt != next->context.ldt))
> load_LDT_nolock(&next->context);
> }
> -#ifdef CONFIG_SMP
> else {
> +#ifdef CONFIG_SMP
> percpu_write(cpu_tlbstate.state, TLBSTATE_OK);
> BUG_ON(percpu_read(cpu_tlbstate.active_mm) != next);
>
> if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
> +#endif
> /* We were in lazy tlb mode and leave_mm disabled
> * tlb flush IPI delivery. We must reload CR3
> * to make sure to use no freed page tables.
> */
> load_cr3(next->pgd);
> load_LDT_nolock(&next->context);
> +#ifdef CONFIG_SMP
> }
> - }
> #endif
> + }
> }
>
> #define activate_mm(prev, next)
On 22.05.2013 02:39, Steven Rostedt wrote:
> The threads do a mlockall too right? I'm not sure mlock will lock memory
> for a new thread's stack.
They don't. However,
https://rt.wiki.kernel.org/index.php/Threaded_RT-application_with_memory_locking_and_stack_handling_example
claims
"Threads started after a call to mlockall(MCL_CURRENT | MCL_FUTURE) will
generate page faults immediately since the new stack is immediately forced
to RAM (due to the MCL_FUTURE flag)."
and as the ps -o min_flt reports zero page faults for the threads
so I think it is also the case.
Anyway, both particular addresses were surely mapped long before
the fault.
>> - the application runs with mlockall()
>
> With both MCL_FUTURE and MCL_CURRENT set, right?
Yes.
>> - there is no swap
>
> Hmm, doesn't mean that code can't be swapped out, as it is just mapped
> from the file it came from. But you'd think mlockall would prevent that.
mlockall also forces the stack to be mapped immediately and not
generating pagefaults when incrementally expanding.
> Seems a bit extreme. Looks to me there's a missing flush TLB somewhere.
Probably.
One interesting thing: the test for "need to reload something"
looks a bit differently for the ARM architecture in
arch/arm/include/asm/mmu_context.h:
if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next)) || prev != next) {
and they do something also for the
!CONFIG_SMP && !cpumask_test_and_set_cpu(cpu, mm_cpumask(next)
case. I don't know what exactly is semantics of mm_cpumask,
but the difference is suspicious.
> Do you have a reproducer you can share. That way, maybe we can all share
> the joy.
Unfortunately not and I have really tried :( If I get new ideas, I will
try again.
Thanks
--
Stano
On 05/21/2013 08:39 PM, Steven Rostedt wrote:
> On Fri, 2013-05-17 at 10:42 +0200, Stanislav Meduna wrote:
>> Hi all,
>>
>> I don't know whether this is linux-rt specific or applies to
>> the mainline too, so I'll repeat some things the linux-rt
>> readers already know.
>>
>> Environment:
>>
>> - Geode LX or Celeron M
>> - _not_ CONFIG_SMP
>> - linux 3.4 with realtime patches and full preempt configured
>> - an application consisting of several mostly RR-class threads
>
> The threads do a mlockall too right? I'm not sure mlock will lock memory
> for a new thread's stack.
>
>> - the application runs with mlockall()
>
> With both MCL_FUTURE and MCL_CURRENT set, right?
>
>> - there is no swap
>
> Hmm, doesn't mean that code can't be swapped out, as it is just mapped
> from the file it came from. But you'd think mlockall would prevent that.
>
>>
>> Problem:
>>
>> - after several hours to 1-2 weeks some of the threads start to loop
>> in the following way
>>
>> 0d...0 62811.755382: function: do_page_fault
>> 0....0 62811.755386: function: handle_mm_fault
>> 0....0 62811.755389: function: handle_pte_fault
>> 0d...0 62811.755394: function: do_page_fault
>> 0....0 62811.755396: function: handle_mm_fault
>> 0....0 62811.755398: function: handle_pte_fault
>> 0d...0 62811.755402: function: do_page_fault
>> 0....0 62811.755404: function: handle_mm_fault
>> 0....0 62811.755406: function: handle_pte_fault
>>
>> and stay in the loop until the RT throttling gets activated.
>> One of the faulting addresses was in code (after returning
>> from a syscall), a second one in stack (inside put_user right
>> before a syscall ends), both were surely mapped.
>>
>> - After RT throttler activates it somehow magically fixes itself,
>> probably (not verified) because another _process_ gets scheduled.
>> When throttled the RR and FF threads are not allowed to run for
>> a while (20 ms in my configuration). The livelocks lasts around
>> 1-3 seconds, and there is a SCHED_OTHER process that runs each
>> 2 seconds.
>
> Hmm, if there was a missed TLB flush, and we are faulting due to a bad
> TLB table, and it goes into an infinite faulting loop, the only thing
> that will stop it is the RT throttle. Then a new task gets scheduled,
> and we flush the TLB and everything is fine again.
That sounds like maybe we DO want a TLB flush on spurious
page faults, so we get rid of this problem.
Last fall we thought this problem could not happen on x86,
but your bug report suggests that it might.
We can get flush_tlb_fix_spurious_fault to do a local TLB
invalidate of just the address in question by removing the
x86-specific dummy version, falling back to the asm-generic
version that does something.
Can you test the attached patch?
--
All rights reversed
On Wed, May 22, 2013 at 5:33 AM, Rik van Riel <[email protected]> wrote:
>
> That sounds like maybe we DO want a TLB flush on spurious
> page faults, so we get rid of this problem.
Hmm. If it was just the Geode, I wouldn't be surprised. But with a Celeron too?
Anyway, worth testing..
> We can get flush_tlb_fix_spurious_fault to do a local TLB
> invalidate of just the address in question by removing the
> x86-specific dummy version, falling back to the asm-generic
> version that does something.
>
> Can you test the attached patch?
I think you should also remove the
if (flags & FAULT_FLAG_WRITE)
test in handle_pte_fault(). Because if it's spurious, it might happen
on reads too, I think.
RT people - does RT do anything special with the page tables?
Stanislav, the patch you sent out may well work, but it's damned odd.
On UP, we don't do the leave_mm() optimization that makes that code
necessary. So I agree with Rik that it's more likely somewhere else
(and infinite page faults do imply the TLB not getting flushed by the
page fault exception), and your patch might just be working around it
by simply flushing the TLB at least when switching between threads,
which still happens.
Linus
On Wed, 22 May 2013 08:01:43 -0700
Linus Torvalds <[email protected]> wrote:
> On Wed, May 22, 2013 at 5:33 AM, Rik van Riel <[email protected]> wrote:
> > Can you test the attached patch?
>
> I think you should also remove the
>
> if (flags & FAULT_FLAG_WRITE)
>
> test in handle_pte_fault(). Because if it's spurious, it might happen
> on reads too, I think.
Here you are. I wonder if the conditional was put in because we
originally did a global TLB flush (with IPIs) from the spurious
fault handler...
Stanislav, could you add this patch to your test?
---8<---
Subject: [PATCH] mm: fix up a spurious page fault whenever it happens
The kernel currently only handles spurious page faults when they
"should" happen, but potentially this is not the only situation
where they could happen.
The spurious fault handler only flushes an entry from the local
TLB; this should be a rare event with minimal side effects.
This patch removes the conditional, allowing the spurious fault
handler to execute whenever a spurious page fault happens, which
should eliminate infinite page fault loops.
Signed-off-by: Rik van Riel <[email protected]>
Reported-by: Stanislav Meduna <[email protected]>
---
mm/memory.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 6dc1882..962477d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3744,13 +3744,11 @@ int handle_pte_fault(struct mm_struct *mm,
update_mmu_cache(vma, address, pte);
} else {
/*
- * This is needed only for protection faults but the arch code
- * is not yet telling us if this is a protection fault or not.
- * This still avoids useless tlb flushes for .text page faults
- * with threads.
+ * The page table entry is good, but the CPU generated a
+ * spurious fault. Invalidate the corresponding TLB entry
+ * on this CPU, so the next access can succeed.
*/
- if (flags & FAULT_FLAG_WRITE)
- flush_tlb_fix_spurious_fault(vma, address);
+ flush_tlb_fix_spurious_fault(vma, address);
}
unlock:
pte_unmap_unlock(pte, ptl);
On 22.05.2013 19:41, Rik van Riel wrote:
>> I think you should also remove the
>>
>> if (flags & FAULT_FLAG_WRITE)
Done
>>> Can you test the attached patch?
Nope. Fails with the same symptoms, min_flt skyrockets,
the throttler activates and after 2 seconds all is well
again.
This is on Geode LX, I don't have the Celeron M at the hand now.
Thank
--
Stano
On Wed, 2013-05-22 at 20:04 +0200, Stanislav Meduna wrote:
> On 22.05.2013 19:41, Rik van Riel wrote:
>
> >> I think you should also remove the
> >>
> >> if (flags & FAULT_FLAG_WRITE)
>
> Done
>
> >>> Can you test the attached patch?
>
> Nope. Fails with the same symptoms, min_flt skyrockets,
> the throttler activates and after 2 seconds all is well
> again.
>
> This is on Geode LX, I don't have the Celeron M at the hand now.
>
Did you apply both patches? Without the first one, this one is
meaningless.
-- Steve
On 22.05.2013 20:11, Steven Rostedt wrote:
> Did you apply both patches? Without the first one, this one is
> meaningless.
Sure.
BTW, back when I tried to pinpoint it I also tried adding
flush_tlb_page(vma, address)
at the beginning of handle_pte_fault, which as I read should
be basically the same. It did not not change anything.
I did mention it some in some previous mail but forgot
to include it again in the summary - sorry :/
--
Stano
On 05/22/2013 02:21 PM, Stanislav Meduna wrote:
> On 22.05.2013 20:11, Steven Rostedt wrote:
>
>> Did you apply both patches? Without the first one, this one is
>> meaningless.
>
> Sure.
>
> BTW, back when I tried to pinpoint it I also tried adding
> flush_tlb_page(vma, address)
> at the beginning of handle_pte_fault, which as I read should
> be basically the same. It did not not change anything.
I'm stumped.
If the Geode knows how to flush single TLB entries, it
should do that when flush_tlb_page is called.
If it does not know, it should throw an invalid instruction
exception, and not quietly complete the instruction without
doing anything.
On 05/22/2013 11:35 AM, Rik van Riel wrote:
> On 05/22/2013 02:21 PM, Stanislav Meduna wrote:
>> On 22.05.2013 20:11, Steven Rostedt wrote:
>>
>>> Did you apply both patches? Without the first one, this one is
>>> meaningless.
>>
>> Sure.
>>
>> BTW, back when I tried to pinpoint it I also tried adding
>> flush_tlb_page(vma, address)
>> at the beginning of handle_pte_fault, which as I read should
>> be basically the same. It did not not change anything.
>
> I'm stumped.
>
> If the Geode knows how to flush single TLB entries, it
> should do that when flush_tlb_page is called.
>
> If it does not know, it should throw an invalid instruction
> exception, and not quietly complete the instruction without
> doing anything.
>
Some CPUs have had errata when it comes to flushing large pages that
have been split into small pages by hardware, e.g. due to MTRR
conflicts. In that case, fragments of the large page may have been left
in the TLB.
Could that explain what you are seeing?
-hpa
On 05/22/2013 02:42 PM, H. Peter Anvin wrote:
> On 05/22/2013 11:35 AM, Rik van Riel wrote:
>> On 05/22/2013 02:21 PM, Stanislav Meduna wrote:
>>> On 22.05.2013 20:11, Steven Rostedt wrote:
>>>
>>>> Did you apply both patches? Without the first one, this one is
>>>> meaningless.
>>>
>>> Sure.
>>>
>>> BTW, back when I tried to pinpoint it I also tried adding
>>> flush_tlb_page(vma, address)
>>> at the beginning of handle_pte_fault, which as I read should
>>> be basically the same. It did not not change anything.
>>
>> I'm stumped.
>>
>> If the Geode knows how to flush single TLB entries, it
>> should do that when flush_tlb_page is called.
>>
>> If it does not know, it should throw an invalid instruction
>> exception, and not quietly complete the instruction without
>> doing anything.
>>
>
> Some CPUs have had errata when it comes to flushing large pages that
> have been split into small pages by hardware, e.g. due to MTRR
> conflicts. In that case, fragments of the large page may have been left
> in the TLB.
>
> Could that explain what you are seeing?
That would be testable by changing __native_flush_tlb_single()
to call __flush_tlb(), instead of doing an invlpg instruction.
In other words, make the code look like this, for testing:
static inline void __native_flush_tlb_single(unsigned long addr)
{
__flush_tlb();
}
This on top of the other two patches.
On 22.05.2013 20:35, Rik van Riel wrote:
> I'm stumped.
>
> If the Geode knows how to flush single TLB entries, it
> should do that when flush_tlb_page is called.
>
> If it does not know, it should throw an invalid instruction
> exception, and not quietly complete the instruction without
> doing anything.
Could it be that the problem is not stale TLB, but a page directory
that is somehow invalid, e.g. belonging to the previous modprobe
(or whatever) instead of the running process?
My patch does load_cr3(next->pgd); so it explicitely loads something
there.
> In other words, make the code look like this, for testing:
>
> static inline void __native_flush_tlb_single(unsigned long addr)
> {
> __flush_tlb();
> }
Yup, will try it.
Thanks
--
Stano
On 22.05.2013 20:43, Rik van Riel wrote:
>> Some CPUs have had errata when it comes to flushing large pages that
>> have been split into small pages by hardware, e.g. due to MTRR
>> conflicts. In that case, fragments of the large page may have been left
>> in the TLB.
Can I somehow find if this is the case? The memory mapping
for the failing process has two regions slightly larger than
4 MB - code and heap.
The process also does not access any funny memory regions
from userspace - it is basically networking (both TCP/IP
and raw sockets) and crunching of the data received.
No mmapped devices or something like that.
> static inline void __native_flush_tlb_single(unsigned long addr)
> {
> __flush_tlb();
> }
>
> This on top of the other two patches.
It did not crash overnight, but it also does not show any
minor fault counted for the threads, so I'm afraid the situation
just did not happen - there should be at least one visible in
the ps -o min_flt output, right?
I will give it some more testing time.
Thanks
--
Stano
On 05/23/2013 04:07 AM, Stanislav Meduna wrote:
> On 22.05.2013 20:43, Rik van Riel wrote:
>
>>> Some CPUs have had errata when it comes to flushing large pages that
>>> have been split into small pages by hardware, e.g. due to MTRR
>>> conflicts. In that case, fragments of the large page may have been left
>>> in the TLB.
>
> Can I somehow find if this is the case? The memory mapping
> for the failing process has two regions slightly larger than
> 4 MB - code and heap.
>
> The process also does not access any funny memory regions
> from userspace - it is basically networking (both TCP/IP
> and raw sockets) and crunching of the data received.
> No mmapped devices or something like that.
>
>> static inline void __native_flush_tlb_single(unsigned long addr)
>> {
>> __flush_tlb();
>> }
>>
>> This on top of the other two patches.
>
> It did not crash overnight, but it also does not show any
> minor fault counted for the threads, so I'm afraid the situation
> just did not happen - there should be at least one visible in
> the ps -o min_flt output, right?
If all the page faults are done by he main thread,
and the TLB gets properly flushed now, the other
threads might not see minor faults.
> I will give it some more testing time.
That is a good idea.
Now to figure out how we properly fix this
issue in the kernel...
We can add a bit in the architecture bits that
we use to check against other CPU and system
errata, and conditionally flush the whole TLB
from __native_flush_tlb_single().
The question is, how do we identify what CPUs
need the extra flushing?
And in what circumstances do they require it?
On Thu, 2013-05-23 at 08:19 -0400, Rik van Riel wrote:
> We can add a bit in the architecture bits that
> we use to check against other CPU and system
> errata, and conditionally flush the whole TLB
> from __native_flush_tlb_single().
If we find that some CPUs have issues and others do not, and we can
determine this by checking the CPU type at run time, I would strongly
suggest using the jump_label infrastructure to do the branches. I know
this is early to suggest something like this, but I just wanted to put
it in your head ;-)
-- Steve
On Thu, May 23, 2013 at 1:07 AM, Stanislav Meduna <[email protected]> wrote:
>
> It did not crash overnight, but it also does not show any
> minor fault counted for the threads
Page faults that don't cause us to map a page (ie a spurious one, or
one that just updates dirty/accessed bits) don't show up as even minor
faults. Thing of the major/minor as "mapping activity" not a page
fault count.
So if this is due to some stuck TLB entry, that wouldn't show up anyway.
Linus
On Thu, May 23, 2013 at 7:45 AM, Linus Torvalds
<[email protected]> wrote:
>
> Page faults that don't cause us to map a page (ie a spurious one, or
> one that just updates dirty/accessed bits) don't show up as even minor
> faults. Thing of the major/minor as "mapping activity" not a page
> fault count.
Actually, I take that back. We always update eithe rmin_flt or maj_flt. My bad.
Another question: I'm assuming this is all 32-bit, is it with PAE
enabled? That changes some of the TLB flushing, and we had one bug
related to that, maybe there are others..
Linus
On 23.05.2013 16:50, Linus Torvalds wrote:
> Another question: I'm assuming this is all 32-bit, is it with PAE
> enabled? That changes some of the TLB flushing, and we had one bug
> related to that, maybe there are others..
32 bit, no PAE.
--
Stano
On 05/23/2013 06:29 AM, Steven Rostedt wrote:
> On Thu, 2013-05-23 at 08:19 -0400, Rik van Riel wrote:
>
>> We can add a bit in the architecture bits that
>> we use to check against other CPU and system
>> errata, and conditionally flush the whole TLB
>> from __native_flush_tlb_single().
>
> If we find that some CPUs have issues and others do not, and we can
> determine this by checking the CPU type at run time, I would strongly
> suggest using the jump_label infrastructure to do the branches. I know
> this is early to suggest something like this, but I just wanted to put
> it in your head ;-)
>
We don't even need the jump_label infrastructure -- we have
static_cpu_has*() which actually predates jump_label although it uses
the same underlying ideas.
-hpa
On Thu, 2013-05-23 at 08:06 -0700, H. Peter Anvin wrote:
> We don't even need the jump_label infrastructure -- we have
> static_cpu_has*() which actually predates jump_label although it uses
> the same underlying ideas.
Ah right. I wonder if it would be worth consolidating a lot of these
"modifying of code" infrastructures. Which reminds me, I need to update
text_poke() to do things similar to what ftrace does, and get rid of the
stop machine code.
-- Steve
On 05/23/2013 08:27 AM, Steven Rostedt wrote:
> On Thu, 2013-05-23 at 08:06 -0700, H. Peter Anvin wrote:
>
>> We don't even need the jump_label infrastructure -- we have
>> static_cpu_has*() which actually predates jump_label although it uses
>> the same underlying ideas.
>
> Ah right. I wonder if it would be worth consolidating a lot of these
> "modifying of code" infrastructures. Which reminds me, I need to update
> text_poke() to do things similar to what ftrace does, and get rid of the
> stop machine code.
>
Well, static_cpu_has*() just uses the alternatives infrastructure.
-hpa
On Thu, 2013-05-23 at 10:24 -0700, H. Peter Anvin wrote:
> On 05/23/2013 08:27 AM, Steven Rostedt wrote:
> > On Thu, 2013-05-23 at 08:06 -0700, H. Peter Anvin wrote:
> >
> >> We don't even need the jump_label infrastructure -- we have
> >> static_cpu_has*() which actually predates jump_label although it uses
> >> the same underlying ideas.
> >
> > Ah right. I wonder if it would be worth consolidating a lot of these
> > "modifying of code" infrastructures. Which reminds me, I need to update
> > text_poke() to do things similar to what ftrace does, and get rid of the
> > stop machine code.
> >
>
> Well, static_cpu_has*() just uses the alternatives infrastructure.
And as it's a boot time change only, it's not quite in the category of
jump_labels and function tracing.
-- Steve
On 05/23/2013 10:36 AM, Steven Rostedt wrote:
> On Thu, 2013-05-23 at 10:24 -0700, H. Peter Anvin wrote:
>> On 05/23/2013 08:27 AM, Steven Rostedt wrote:
>>> On Thu, 2013-05-23 at 08:06 -0700, H. Peter Anvin wrote:
>>>
>>>> We don't even need the jump_label infrastructure -- we have
>>>> static_cpu_has*() which actually predates jump_label although it uses
>>>> the same underlying ideas.
>>>
>>> Ah right. I wonder if it would be worth consolidating a lot of these
>>> "modifying of code" infrastructures. Which reminds me, I need to update
>>> text_poke() to do things similar to what ftrace does, and get rid of the
>>> stop machine code.
>>>
>>
>> Well, static_cpu_has*() just uses the alternatives infrastructure.
>
> And as it's a boot time change only, it's not quite in the category of
> jump_labels and function tracing.
>
Right.
-hpa
On 23.05.2013 14:19, Rik van Riel wrote:
>>> static inline void __native_flush_tlb_single(unsigned long addr)
>>> {
>>> __flush_tlb();
>>> }
>
>> I will give it some more testing time.
>
> That is a good idea.
Still no crash, so this one indeed seems to change things.
If I understand it correctly, these patches fix the problem
when it happens and we still don't know why the TLB is stale
in the first place - whether there is (also) a genuine bug
or whether we are hitting some chip errata, right?
For the record the cpuinfo for my present testsystem:
processor : 0
vendor_id : AuthenticAMD
cpu family : 5
model : 10
model name : Geode(TM) Integrated Processor by AMD PCS
stepping : 2
microcode : 0x88a93d
cpu MHz : 498.042
cache size : 128 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu de pse tsc msr cx8 sep pge cmov clflush mmx
mmxext 3dnowext 3dnow
bogomips : 996.08
clflush size : 32
cache_alignment : 32
address sizes : 32 bits physical, 32 bits virtual
power management:
and for the Celeron M where I can unfortunately reproduce
it much less often (days to weeks).
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 13
model name : Intel(R) Celeron(R) M processor 1.00GHz
stepping : 8
cpu MHz : 1000.011
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov clflush dts acpi mmx fxsr sse sse2 ss tm
pbe nx bts
bogomips : 2000.02
clflush size : 64
cache_alignment : 64
address sizes : 32 bits physical, 32 bits virtual
power management:
Thanks
--
Stano
On 24.05.2013 10:29, Stanislav Meduna wrote:
>>>> static inline void __native_flush_tlb_single(unsigned long addr)
>>>> {
>>>> __flush_tlb();
>>>> }
>>
>>> I will give it some more testing time.
>>
>> That is a good idea.
>
> Still no crash, so this one indeed seems to change things.
Take that back, now crashed as well, it just took longer.
min_flt of two threads jumped from zero at 1848 (lower prio)
and 735993 (higher prio, preempted the first one) respectively,
1.7 seconds hang.
--
Stano
On 05/24/2013 04:29 AM, Stanislav Meduna wrote:
> On 23.05.2013 14:19, Rik van Riel wrote:
>
>>>> static inline void __native_flush_tlb_single(unsigned long addr)
>>>> {
>>>> __flush_tlb();
>>>> }
>>
>>> I will give it some more testing time.
>>
>> That is a good idea.
>
> Still no crash, so this one indeed seems to change things.
>
> If I understand it correctly, these patches fix the problem
> when it happens and we still don't know why the TLB is stale
> in the first place - whether there is (also) a genuine bug
> or whether we are hitting some chip errata, right?
Just to rule something out, are you using
transparent huge pages on those systems?
That could result in a mix of 4MB and 4kB
mappings, sometimes of the same memory.
The page tables would only ever contain
one of those mappings, but if we have some
kind of TLB problem, we might preserve a
large mapping across a page breakup, or
a small one across a page collapse...
On 24.05.2013 15:06, Rik van Riel wrote:
> Just to rule something out, are you using
> transparent huge pages on those systems?
On my present test system they are configured in, but I am
not using them.
# cat /proc/meminfo | grep Huge
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 4096 kB
However during my (many) previous experiments the problem
also happened with kernels that did not have it configured.
Thanks
--
Stano
On 24.05.2013 15:55, Stanislav Meduna wrote:
>> Just to rule something out, are you using
>> transparent huge pages on those systems?
>
> On my present test system they are configured in, but I am
> not using them.
Ah, _transparent_ huge pages. No, that is not enabled.
--
Stano
Hi all,
I was able to reproduce the page fault problem with
a relatively simple application, for now on the
Geode platform. It can be downloaded at
http://www.meduna.org/tmp/PageFault.tar.gz
Basically the test application does:
- 4 threads that do nothing but periodically sleep
- 1 thread looping in a timerfd loop doing nothing
- 4 threads doing nonblocking TCP connects to an address
in the local network that does not exist, i.e. all that
happens are ARP requests.
- additionally a non-existing TCP congestion algorithm is
requested resulting in repeated futile requests to load
the module. This looks to be an important part in reproducing
it, but the problem also occasionally happened with kernels
that did not have modules enabled at all, so it is
probably just pushing some probabilities.
- the application is statically linked - this might or might
not be relevant, I just wanted the text-segment to be bigger
I know it is a weird mix, I was just trying to mimic what
our application did in the form that was able to trigger
the faults most often.
In my few tests this repeatably triggered the problem in hours,
max a day.
My feeling is that the problem is triggered best if there
is little network traffic and no other connections to the
machine, but this is only a subjective feeling.
The kernel configuration, cpuinfo, meminfo and lspci
are included in the tarball. The kernel configuration is not
very clean, it is a kernel intended to work on both Geode
and Celeron and is also a snapshot of what reproduced the
problem the best.
The environment is a current 3.4-rt with following tweaks:
chrt -f -p 37 <pid of ksoftirqd/0>
chrt -o -p 0 <pid of irq/14-pata> [because of a pata_cs5536 bug]
renice -15 <pid of irq/14-pata>
ulimit -s 512
Before compiling change the CONNECT_ADDR define to an address
that is in the local LAN but is not present.
Other than this application a lightweight mix of usual Debian
processes is running. There are no servers except openssh and ntp.
A shell script that wakes each 2 seconds and does some
housekeeping is running, that probably recovers the system
when it enters the page-fault loop followed by the
RT throttling.
Right now a test with the same kernel with preempt none
is running to see whether the problem also happens with this
application there (due to the timing sensitivity only a positive
result has a significance). I did not have a chance to test
on an Intel processor yet.
Thanks
--
Stano
On 16.06.2013 23:34, Stanislav Meduna wrote:
> Right now a test with the same kernel with preempt none
> is running to see whether the problem also happens with this
> application there (due to the timing sensitivity only a positive
> result has a significance).
No crash in 2 days running with preempt none...
--
Stano
On Tue, Jun 18, 2013 at 9:13 AM, Stanislav Meduna <[email protected]> wrote:
>
> No crash in 2 days running with preempt none...
Is this UP?
There's the fast_tlb race that Peter fixed in commit 29eb77825cc7
("arch, mm: Remove tlb_fast_mode()"). I'm not seeing how it would
cause infinite TLB faults, but it definitely causes potentially
incoherent TLB contents. And afaik it only happens with
CONFIG_PREEMPT, and on UP systems. Which sounds like it might match
your setup...
Linus
On 19.06.2013 07:20, Linus Torvalds wrote:
>> No crash in 2 days running with preempt none...
>
> Is this UP?
Yes it is.
> There's the fast_tlb race that Peter fixed in commit 29eb77825cc7
> ("arch, mm: Remove tlb_fast_mode()"). I'm not seeing how it would
> cause infinite TLB faults, but it definitely causes potentially
> incoherent TLB contents. And afaik it only happens with
> CONFIG_PREEMPT, and on UP systems. Which sounds like it might match
> your setup...
Oh, thank you for the pointer, this indeed looks interesting.
Unfortunately the patch massively does not apply to 3.4 which
I am using and I know too little what all is involved here
to backport it. I will test it when (if) it gets to the 3.4(-rt)
(or when I find some spare time to play with the newer kernel
on that system).
Thanks
--
Stano
On Wed, Jun 19, 2013 at 09:36:39AM +0200, Stanislav Meduna wrote:
> On 19.06.2013 07:20, Linus Torvalds wrote:
>
> >> No crash in 2 days running with preempt none...
> >
> > Is this UP?
>
> Yes it is.
>
> > There's the fast_tlb race that Peter fixed in commit 29eb77825cc7
> > ("arch, mm: Remove tlb_fast_mode()"). I'm not seeing how it would
> > cause infinite TLB faults, but it definitely causes potentially
> > incoherent TLB contents. And afaik it only happens with
> > CONFIG_PREEMPT, and on UP systems. Which sounds like it might match
> > your setup...
>
> Oh, thank you for the pointer, this indeed looks interesting.
>
> Unfortunately the patch massively does not apply to 3.4 which
> I am using and I know too little what all is involved here
> to backport it. I will test it when (if) it gets to the 3.4(-rt)
> (or when I find some spare time to play with the newer kernel
> on that system).
The easiest way to test for your system is to ensure tlb_fast_mode()
return an unconditional 0.
On 19.06.2013 10:06, Peter Zijlstra wrote:
>> On 19.06.2013 07:20, Linus Torvalds wrote:
>>> There's the fast_tlb race that Peter fixed in commit 29eb77825cc7
>>> ("arch, mm: Remove tlb_fast_mode()"). I'm not seeing how it would
>>> cause infinite TLB faults, but it definitely causes potentially
>>> incoherent TLB contents. And afaik it only happens with
>>> CONFIG_PREEMPT, and on UP systems. Which sounds like it might match
>>> your setup...
> The easiest way to test for your system is to ensure tlb_fast_mode()
> return an unconditional 0.
Nope. Got the faults also with tlb_fast_mode() returning 0, this time
after ~10 hours. So there still has to be something...
Regards
--
Stano