Good Morning,
First, I love this patch from a functional standpoint! However I've
run into an issue debugging early boot issues. Anything that causes
the kernel threading system to die (for example here, a NPE) causes
the boot console to halt before it outputs the error.
Would it be possible to have both a kconfig option and a runtime
option to toggle printk_fallback_preferred_direct()? Currently I've
had to force it at all times while debugging.
Thanks again for this amazing work!
Very Respectfully,
Peter Geis
Hi Peter,
On 2022-06-10, Peter Geis <[email protected]> wrote:
> However I've run into an issue debugging early boot issues. Anything
> that causes the kernel threading system to die (for example here, a
> NPE) causes the boot console to halt before it outputs the error.
A null pointer exception should trigger a panic, which will cause printk
to enter direct mode for the oops. It should be just as effective as
before introducing the printing kthreads.
Can you please provide a test case or patch that will trigger the
problem you refer to?
> Would it be possible to have both a kconfig option and a runtime
> option to toggle printk_fallback_preferred_direct()? Currently I've
> had to force it at all times while debugging.
It would be trivial to provide a boot parameter for this behavior, but I
think we first need to see a real problem and also see if we can fix
that problem.
John Ogness
On Fri, Jun 10, 2022 at 11:05 AM John Ogness <[email protected]> wrote:
>
> Hi Peter,
>
> On 2022-06-10, Peter Geis <[email protected]> wrote:
> > However I've run into an issue debugging early boot issues. Anything
> > that causes the kernel threading system to die (for example here, a
> > NPE) causes the boot console to halt before it outputs the error.
>
> A null pointer exception should trigger a panic, which will cause printk
> to enter direct mode for the oops. It should be just as effective as
> before introducing the printing kthreads.
This might be a side effect of the fact that this is on a low powered
arm64 board. I noticed with threading enabled during large bursts the
console drops an excessive amount of messages. It's especially
apparent during the handover from earlycon to the normal console.
>
> Can you please provide a test case or patch that will trigger the
> problem you refer to?
Unfortunately no, unless you have a Rockchip rk356x device with both
DSI and HDMI. We're working on adding DSI support to the new video
output driver, and when both are enabled it tosses a NPE.
>
> > Would it be possible to have both a kconfig option and a runtime
> > option to toggle printk_fallback_preferred_direct()? Currently I've
> > had to force it at all times while debugging.
>
> It would be trivial to provide a boot parameter for this behavior, but I
> think we first need to see a real problem and also see if we can fix
> that problem.
>
> John Ogness
On (22/06/10 17:11), John Ogness wrote:
> It would be trivial to provide a boot parameter for this behavior, but I
> think we first need to see a real problem and also see if we can fix
> that problem.
John, Petr, a quick question
Should a situation when we have only one online CPU be enough of a reason
to do direct printing? Otherwise we might not have CPUs to wakeup khtread on,
e.g. when CPU that printk is in atomic section for too long.
---
@@ -480,6 +480,9 @@ static inline bool allow_direct_printing(void)
if (!printk_kthreads_available)
return true;
+ if (num_online_cpus() == 1)
+ return true;
+
/*
* Prefer direct printing when the system is in a problematic state.
* The context that sets this state will always see the updated value.
On (22/06/10 11:34), Peter Geis wrote:
> > On 2022-06-10, Peter Geis <[email protected]> wrote:
> > > However I've run into an issue debugging early boot issues. Anything
> > > that causes the kernel threading system to die (for example here, a
> > > NPE) causes the boot console to halt before it outputs the error.
> >
> > A null pointer exception should trigger a panic, which will cause printk
> > to enter direct mode for the oops. It should be just as effective as
> > before introducing the printing kthreads.
>
> This might be a side effect of the fact that this is on a low powered
> arm64 board. I noticed with threading enabled during large bursts the
> console drops an excessive amount of messages. It's especially
> apparent during the handover from earlycon to the normal console.
How many CPUs does it have?
On Sat, Jun 11, 2022 at 10:57 PM Sergey Senozhatsky
<[email protected]> wrote:
>
> On (22/06/10 11:34), Peter Geis wrote:
> > > On 2022-06-10, Peter Geis <[email protected]> wrote:
> > > > However I've run into an issue debugging early boot issues. Anything
> > > > that causes the kernel threading system to die (for example here, a
> > > > NPE) causes the boot console to halt before it outputs the error.
> > >
> > > A null pointer exception should trigger a panic, which will cause printk
> > > to enter direct mode for the oops. It should be just as effective as
> > > before introducing the printing kthreads.
> >
> > This might be a side effect of the fact that this is on a low powered
> > arm64 board. I noticed with threading enabled during large bursts the
> > console drops an excessive amount of messages. It's especially
> > apparent during the handover from earlycon to the normal console.
>
> How many CPUs does it have?
Four and all are online when the error occurs.
On 2022-06-10, Peter Geis <[email protected]> wrote:
> This might be a side effect of the fact that this is on a low powered
> arm64 board. I noticed with threading enabled during large bursts the
> console drops an excessive amount of messages. It's especially
> apparent during the handover from earlycon to the normal console.
I guess you have a very small kernel buffer and are generating unusually
high amounts of messages? Is there a reason you cannot use a larger
buffer?
Or maybe you are generating a constant amount of messages that a serial
port could never keep up with unless it was interfering with the system
in such a way as to slow everything down? This is exactly what printk
should _not_ be used for.
John Ogness
On Sun, Jun 12, 2022 at 7:08 PM John Ogness <[email protected]> wrote:
>
> On 2022-06-10, Peter Geis <[email protected]> wrote:
> > This might be a side effect of the fact that this is on a low powered
> > arm64 board. I noticed with threading enabled during large bursts the
> > console drops an excessive amount of messages. It's especially
> > apparent during the handover from earlycon to the normal console.
>
> I guess you have a very small kernel buffer and are generating unusually
> high amounts of messages? Is there a reason you cannot use a larger
> buffer?
The buffer isn't an issue here, everything is available in dmesg when
userspace becomes available. Instead some messages bound for the
serial console are never output.
>
> Or maybe you are generating a constant amount of messages that a serial
> port could never keep up with unless it was interfering with the system
> in such a way as to slow everything down? This is exactly what printk
> should _not_ be used for.
We run a serial console at 1.5m baud which is significantly higher
than most SoCs which default to 115200. I have noticed some timing
differences since the introduction of the threaded console. A
significant amount of information is dumped very early in the boot
process (between 0 and 4 seconds into boot), as most drivers are
probing during this time. It also happens to be when the earlycon
console hands over to the normal console. There is no abnormal
debugging enabled, the output is a standard (non-quiet) boot log. The
question is why is direct mode not triggering during a panic?
>
> John Ogness
On 2022-06-12, Sergey Senozhatsky <[email protected]> wrote:
> Should a situation when we have only one online CPU be enough of a
> reason to do direct printing? Otherwise we might not have CPUs to
> wakeup khtread on, e.g. when CPU that printk is in atomic section for
> too long.
IMHO, no. Especially in that situation, we do not want printk causing
that atomic section to become even longer. If the machine has entered
normal operation, we want printk out of the way.
John Ogness
On 2022-06-12, Peter Geis <[email protected]> wrote:
> The buffer isn't an issue here, everything is available in dmesg when
> userspace becomes available. Instead some messages bound for the
> serial console are never output.
OK. Good to know.
> We run a serial console at 1.5m baud which is significantly higher
> than most SoCs which default to 115200. I have noticed some timing
> differences since the introduction of the threaded console. A
> significant amount of information is dumped very early in the boot
> process (between 0 and 4 seconds into boot), as most drivers are
> probing during this time. It also happens to be when the earlycon
> console hands over to the normal console. There is no abnormal
> debugging enabled, the output is a standard (non-quiet) boot log. The
> question is why is direct mode not triggering during a panic?
Just to be clear, you are not losing any intermediate messages. Only the
tail end of the kernel log was never printed. Is this correct?
This may be the same issue being discussed here [0].
John Ogness
[0] https://lore.kernel.org/all/[email protected]
On (22/06/13 01:08), John Ogness wrote:
> On 2022-06-12, Sergey Senozhatsky <[email protected]> wrote:
> > Should a situation when we have only one online CPU be enough of a
> > reason to do direct printing? Otherwise we might not have CPUs to
> > wakeup khtread on, e.g. when CPU that printk is in atomic section for
> > too long.
>
> IMHO, no. Especially in that situation, we do not want printk causing
> that atomic section to become even longer. If the machine has entered
> normal operation, we want printk out of the way.
At the same time printk throttles itself in such cases: new messages are
not added at much higher pace that they are printed at. So we lower the
chances of missing messages.
On 2022-06-13, Sergey Senozhatsky <[email protected]> wrote:
>> > Should a situation when we have only one online CPU be enough of a
>> > reason to do direct printing? Otherwise we might not have CPUs to
>> > wakeup khtread on, e.g. when CPU that printk is in atomic section for
>> > too long.
>>
>> IMHO, no. Especially in that situation, we do not want printk causing
>> that atomic section to become even longer. If the machine has entered
>> normal operation, we want printk out of the way.
>
> At the same time printk throttles itself in such cases: new messages are
> not added at much higher pace that they are printed at. So we lower the
> chances of missing messages.
That is true if there is only 1 printk caller. For SMP systems with
printing handovers, it might not help at all. I firmly believe that
sprinkling randomness into printk (i.e. system) latencies is not the
answer. We need to keep printk lockless and out of the system's way
unless there is a real emergency happening.
This particular thread is not about missed messages due to printk not
"throttling the system", but rather the kernel buffers not getting
flushed in an emergency. This, of course, needs to be properly handled.
John Ogness
On (22/06/13 10:36), John Ogness wrote:
> >> IMHO, no. Especially in that situation, we do not want printk causing
> >> that atomic section to become even longer. If the machine has entered
> >> normal operation, we want printk out of the way.
> >
> > At the same time printk throttles itself in such cases: new messages are
> > not added at much higher pace that they are printed at. So we lower the
> > chances of missing messages.
>
> That is true if there is only 1 printk caller.
Well, which is the case when num_online_cpus() == 1?
> For SMP systems with printing handovers, it might not help at all.
> I firmly believe that sprinkling randomness into printk (i.e. system)
> latencies is not the answer. We need to keep printk lockless and out
> of the system's way unless there is a real emergency happening.
Yeah sure.
> This particular thread is not about missed messages due to printk not
> "throttling the system", but rather the kernel buffers not getting
> flushed in an emergency. This, of course, needs to be properly handled.
True, but Peter mentioned
"I noticed with threading enabled during large bursts the console
drops an excessive amount of messages. It's especially apparent
during the handover from earlycon to the normal console."
On Mon 2022-06-13 18:05:08, Sergey Senozhatsky wrote:
> On (22/06/13 10:36), John Ogness wrote:
> > >> IMHO, no. Especially in that situation, we do not want printk causing
> > >> that atomic section to become even longer. If the machine has entered
> > >> normal operation, we want printk out of the way.
> > >
> > > At the same time printk throttles itself in such cases: new messages are
> > > not added at much higher pace that they are printed at. So we lower the
> > > chances of missing messages.
> >
> > That is true if there is only 1 printk caller.
>
> Well, which is the case when num_online_cpus() == 1?
Good question.
Well, it would be nice to have the same behavior on single CPU
and SMP systems. Blocking atomic context with slow console is
bad even on single processor system. If there are problems with
lost messages then we will need a solution for SMP anyway.
> > For SMP systems with printing handovers, it might not help at all.
> > I firmly believe that sprinkling randomness into printk (i.e. system)
> > latencies is not the answer. We need to keep printk lockless and out
> > of the system's way unless there is a real emergency happening.
>
> Yeah sure.
>
> > This particular thread is not about missed messages due to printk not
> > "throttling the system", but rather the kernel buffers not getting
> > flushed in an emergency. This, of course, needs to be properly handled.
>
> True, but Peter mentioned
>
> "I noticed with threading enabled during large bursts the console
> drops an excessive amount of messages. It's especially apparent
> during the handover from earlycon to the normal console."
But this is also the situation when softlockups happen. It should
ideally be solved with a big enough buffer.
Another interesting alternative is the Peter Zijlstra's mode
where all messages are printed to the console "immediately".
They are serialized only by the CPU-reentrant lock.
This mode is not good for production system. But it might
be good for debugging. The good thing is that the behavior
is well defined.
I hope that we will get this mode with the atomic consoles.
Best Regards,
Petr
On Sun 2022-06-12 19:30:27, Peter Geis wrote:
> On Sun, Jun 12, 2022 at 7:08 PM John Ogness <[email protected]> wrote:
> >
> > On 2022-06-10, Peter Geis <[email protected]> wrote:
> > > This might be a side effect of the fact that this is on a low powered
> > > arm64 board. I noticed with threading enabled during large bursts the
> > > console drops an excessive amount of messages.
What do you mean that console drops an excessive amount of messages,
please?
Do you see "dropped XXX messages" in the console output?
Or something else?
> > > It's especially
> > > apparent during the handover from earlycon to the normal console.
> >
> > I guess you have a very small kernel buffer and are generating unusually
> > high amounts of messages? Is there a reason you cannot use a larger
> > buffer?
>
> The buffer isn't an issue here, everything is available in dmesg when
> userspace becomes available. Instead some messages bound for the
> serial console are never output.
This is strage. The message "dropped XXX messages" is printed only
when the buffer was full and the oldest messages were overwritten
before they were pushed to the console. If "dmesg" shows all messages
it means that no messages were overwritten.
Best Regards,
Petr
On Mon 2022-06-13 04:29:50, John Ogness wrote:
> On 2022-06-12, Peter Geis <[email protected]> wrote:
> > The buffer isn't an issue here, everything is available in dmesg when
> > userspace becomes available. Instead some messages bound for the
> > serial console are never output.
>
> OK. Good to know.
>
> > We run a serial console at 1.5m baud which is significantly higher
> > than most SoCs which default to 115200. I have noticed some timing
> > differences since the introduction of the threaded console. A
> > significant amount of information is dumped very early in the boot
> > process (between 0 and 4 seconds into boot), as most drivers are
> > probing during this time. It also happens to be when the earlycon
> > console hands over to the normal console. There is no abnormal
> > debugging enabled, the output is a standard (non-quiet) boot log. The
> > question is why is direct mode not triggering during a panic?
>
> Just to be clear, you are not losing any intermediate messages. Only the
> tail end of the kernel log was never printed. Is this correct?
>
> This may be the same issue being discussed here [0].
>
> John Ogness
>
> [0] https://lore.kernel.org/all/[email protected]
If the last messages are missing then it is most likely the same
issue.
Peter, could you please try if the patch at
https://lore.kernel.org/r/YqdSw/fJvnkRbjvc@alley
would make any difference?
Best Regards,
Petr
From: Petr Mladek
> Sent: 13 June 2022 11:14
...
> Another interesting alternative is the Peter Zijlstra's mode
> where all messages are printed to the console "immediately".
> They are serialized only by the CPU-reentrant lock.
>
> This mode is not good for production system. But it might
> be good for debugging. The good thing is that the behavior
> is well defined.
ISTM that all messages should be output (even if this means
that cpu spin waiting for a serial console) until userspace
has a chance to set an option to change the behaviour.
Oh, and can someone stop the distros hiding the console output.
I want to see the 'oops' traceback when I break the kernel!
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Mon, Jun 13, 2022 at 11:11 AM Petr Mladek <[email protected]> wrote:
>
> On Mon 2022-06-13 04:29:50, John Ogness wrote:
> > On 2022-06-12, Peter Geis <[email protected]> wrote:
> > > The buffer isn't an issue here, everything is available in dmesg when
> > > userspace becomes available. Instead some messages bound for the
> > > serial console are never output.
> >
> > OK. Good to know.
> >
> > > We run a serial console at 1.5m baud which is significantly higher
> > > than most SoCs which default to 115200. I have noticed some timing
> > > differences since the introduction of the threaded console. A
> > > significant amount of information is dumped very early in the boot
> > > process (between 0 and 4 seconds into boot), as most drivers are
> > > probing during this time. It also happens to be when the earlycon
> > > console hands over to the normal console. There is no abnormal
> > > debugging enabled, the output is a standard (non-quiet) boot log. The
> > > question is why is direct mode not triggering during a panic?
> >
> > Just to be clear, you are not losing any intermediate messages. Only the
> > tail end of the kernel log was never printed. Is this correct?
> >
> > This may be the same issue being discussed here [0].
> >
> > John Ogness
> >
> > [0] https://lore.kernel.org/all/[email protected]
>
> If the last messages are missing then it is most likely the same
> issue.
>
> Peter, could you please try if the patch at
> https://lore.kernel.org/r/YqdSw/fJvnkRbjvc@alley
> would make any difference?
This patch permits the panic to come through to the console port. Thank you!
Tested-by: Peter Geis <[email protected]>
>
> Best Regards,
> Petr
On Mon 2022-06-13 18:20:03, Peter Geis wrote:
> On Mon, Jun 13, 2022 at 11:11 AM Petr Mladek <[email protected]> wrote:
> >
> > On Mon 2022-06-13 04:29:50, John Ogness wrote:
> > > On 2022-06-12, Peter Geis <[email protected]> wrote:
> > > > The buffer isn't an issue here, everything is available in dmesg when
> > > > userspace becomes available. Instead some messages bound for the
> > > > serial console are never output.
> > >
> > > OK. Good to know.
> > >
> > > > We run a serial console at 1.5m baud which is significantly higher
> > > > than most SoCs which default to 115200. I have noticed some timing
> > > > differences since the introduction of the threaded console. A
> > > > significant amount of information is dumped very early in the boot
> > > > process (between 0 and 4 seconds into boot), as most drivers are
> > > > probing during this time. It also happens to be when the earlycon
> > > > console hands over to the normal console. There is no abnormal
> > > > debugging enabled, the output is a standard (non-quiet) boot log. The
> > > > question is why is direct mode not triggering during a panic?
> > >
> > > Just to be clear, you are not losing any intermediate messages. Only the
> > > tail end of the kernel log was never printed. Is this correct?
> > >
> > > This may be the same issue being discussed here [0].
> > >
> > > John Ogness
> > >
> > > [0] https://lore.kernel.org/all/[email protected]
> >
> > If the last messages are missing then it is most likely the same
> > issue.
> >
> > Peter, could you please try if the patch at
> > https://lore.kernel.org/r/YqdSw/fJvnkRbjvc@alley
> > would make any difference?
>
> This patch permits the panic to come through to the console port. Thank you!
> Tested-by: Peter Geis <[email protected]>
Great news. Thanks a lot for feedback.
I guess that the patch is not final. But it seems to be a good
direction.
Best Regards,
Petr
On Mon 2022-06-13 16:11:19, David Laight wrote:
> From: Petr Mladek
> > Sent: 13 June 2022 11:14
> ...
> > Another interesting alternative is the Peter Zijlstra's mode
> > where all messages are printed to the console "immediately".
> > They are serialized only by the CPU-reentrant lock.
> >
> > This mode is not good for production system. But it might
> > be good for debugging. The good thing is that the behavior
> > is well defined.
>
> ISTM that all messages should be output (even if this means
> that cpu spin waiting for a serial console) until userspace
> has a chance to set an option to change the behaviour.
There are systems that do not boot because of softlockups
caused by slow consoles.
I am pretty sure that we will add the option to disable
printk kthreads sooner or later. But we prefer to improve
the code so that it works out of box. If we add the option
already know then people will disable kthreads and
do not report problematic scenarios.
> Oh, and can someone stop the distros hiding the console output.
> I want to see the 'oops' traceback when I break the kernel!
Different users have different needs. AFAIK, distros hide
the messages because they scare users and overload support.
It is probably more effective to enable them only for debugging
when users notice some malfunction.
I do not know the details. I guess that there were years of
evolution. The default in SUSE was probably set by some
project managers or people taking care of boot splash,
installation, ...
Best Regards,
Petr