If message sizes average larger than expected (more than 32
characters), the data_ring will wrap before the desc_ring. Once the
data_ring wraps, it will start invalidating descriptors. These
invalid descriptors hang around until they are eventually recycled
when the desc_ring wraps. Readers do not care about invalid
descriptors, but they still need to iterate past them. If the
average message size is much larger than 32 characters, then there
will be many invalid descriptors preceding the valid descriptors.
The function prb_first_valid_seq() always begins at the oldest
descriptor and searches for the first valid descriptor. This can
be rather expensive for the above scenario. And, in fact, because
of its heavy usage in /dev/kmsg, there have been reports of long
delays and even RCU stalls.
For code that does not need to search from the oldest record,
replace prb_first_valid_seq() usage with prb_read_valid_*()
functions, which provide a start sequence number to search from.
Fixes: 896fbe20b4e2333fb55 ("printk: use the lockless ringbuffer")
Reported-by: kernel test robot <[email protected]>
Reported-by: J. Avila <[email protected]>
Signed-off-by: John Ogness <[email protected]>
---
patch against next-20210205
kernel/printk/printk.c | 29 ++++++++++++++++++-----------
1 file changed, 18 insertions(+), 11 deletions(-)
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 5a95c688621f..035aae771ea1 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -735,9 +735,9 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
logbuf_lock_irq();
}
- if (user->seq < prb_first_valid_seq(prb)) {
+ if (r->info->seq != user->seq) {
/* our last seen message is gone, return error and reset */
- user->seq = prb_first_valid_seq(prb);
+ user->seq = r->info->seq;
ret = -EPIPE;
logbuf_unlock_irq();
goto out;
@@ -812,6 +812,7 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
static __poll_t devkmsg_poll(struct file *file, poll_table *wait)
{
struct devkmsg_user *user = file->private_data;
+ struct printk_info info;
__poll_t ret = 0;
if (!user)
@@ -820,9 +821,9 @@ static __poll_t devkmsg_poll(struct file *file, poll_table *wait)
poll_wait(file, &log_wait, wait);
logbuf_lock_irq();
- if (prb_read_valid(prb, user->seq, NULL)) {
+ if (prb_read_valid_info(prb, user->seq, &info, NULL)) {
/* return error when data has vanished underneath us */
- if (user->seq < prb_first_valid_seq(prb))
+ if (info.seq != user->seq)
ret = EPOLLIN|EPOLLRDNORM|EPOLLERR|EPOLLPRI;
else
ret = EPOLLIN|EPOLLRDNORM;
@@ -1559,6 +1560,7 @@ static void syslog_clear(void)
int do_syslog(int type, char __user *buf, int len, int source)
{
+ struct printk_info info;
bool clear = false;
static int saved_console_loglevel = LOGLEVEL_DEFAULT;
int error;
@@ -1629,9 +1631,13 @@ int do_syslog(int type, char __user *buf, int len, int source)
/* Number of chars in the log buffer */
case SYSLOG_ACTION_SIZE_UNREAD:
logbuf_lock_irq();
- if (syslog_seq < prb_first_valid_seq(prb)) {
- /* messages are gone, move to first one */
- syslog_seq = prb_first_valid_seq(prb);
+ if (prb_read_valid_info(prb, syslog_seq, &info, NULL)) {
+ if (info.seq != syslog_seq) {
+ /* messages are gone, move to first one */
+ syslog_seq = info.seq;
+ syslog_partial = 0;
+ }
+ } else {
syslog_partial = 0;
}
if (source == SYSLOG_FROM_PROC) {
@@ -1643,7 +1649,6 @@ int do_syslog(int type, char __user *buf, int len, int source)
error = prb_next_seq(prb) - syslog_seq;
} else {
bool time = syslog_partial ? syslog_time : printk_time;
- struct printk_info info;
unsigned int line_count;
u64 seq;
@@ -3429,9 +3434,11 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
goto out;
logbuf_lock_irqsave(flags);
- if (dumper->cur_seq < prb_first_valid_seq(prb)) {
- /* messages are gone, move to first available one */
- dumper->cur_seq = prb_first_valid_seq(prb);
+ if (prb_read_valid_info(prb, dumper->cur_seq, &info, NULL)) {
+ if (info.seq != dumper->cur_seq) {
+ /* messages are gone, move to first available one */
+ dumper->cur_seq = info.seq;
+ }
}
/* last entry */
--
2.20.1
On (21/02/05 15:23), John Ogness wrote:
> If message sizes average larger than expected (more than 32
> characters), the data_ring will wrap before the desc_ring. Once the
> data_ring wraps, it will start invalidating descriptors. These
> invalid descriptors hang around until they are eventually recycled
> when the desc_ring wraps. Readers do not care about invalid
> descriptors, but they still need to iterate past them. If the
> average message size is much larger than 32 characters, then there
> will be many invalid descriptors preceding the valid descriptors.
>
> The function prb_first_valid_seq() always begins at the oldest
> descriptor and searches for the first valid descriptor. This can
> be rather expensive for the above scenario. And, in fact, because
> of its heavy usage in /dev/kmsg, there have been reports of long
> delays and even RCU stalls.
>
> For code that does not need to search from the oldest record,
> replace prb_first_valid_seq() usage with prb_read_valid_*()
> functions, which provide a start sequence number to search from.
>
> Fixes: 896fbe20b4e2333fb55 ("printk: use the lockless ringbuffer")
> Reported-by: kernel test robot <[email protected]>
Can we please also ask the kernel test robot to test this patch?
-ss
On 2021-02-08, Sergey Senozhatsky <[email protected]> wrote:
> Can we please also ask the kernel test robot to test this patch?
LKP is an automated service. The problem was reported for an older
commit. The new patch will not apply.
I will try to contact the LKP team and see how we can get some sort of
verification.
@Avila: Can you also verify that this patch fixes your issue [0]?
John Ogness
[0] https://lkml.kernel.org/r/[email protected]
Hello John,
Thank you for looking into this! Yes, I can confirm that your patch seems
to have resolved the issue I'd reported.
Thanks,
Avila
On Mon, Feb 8, 2021 at 1:21 AM John Ogness <[email protected]> wrote:
>
> On 2021-02-08, Sergey Senozhatsky <[email protected]> wrote:
> > Can we please also ask the kernel test robot to test this patch?
>
> LKP is an automated service. The problem was reported for an older
> commit. The new patch will not apply.
>
> I will try to contact the LKP team and see how we can get some sort of
> verification.
>
> @Avila: Can you also verify that this patch fixes your issue [0]?
>
> John Ogness
>
> [0] https://lkml.kernel.org/r/[email protected]
On (21/02/05 15:23), John Ogness wrote:
> If message sizes average larger than expected (more than 32
> characters), the data_ring will wrap before the desc_ring. Once the
> data_ring wraps, it will start invalidating descriptors. These
> invalid descriptors hang around until they are eventually recycled
> when the desc_ring wraps. Readers do not care about invalid
> descriptors, but they still need to iterate past them. If the
> average message size is much larger than 32 characters, then there
> will be many invalid descriptors preceding the valid descriptors.
>
> The function prb_first_valid_seq() always begins at the oldest
> descriptor and searches for the first valid descriptor. This can
> be rather expensive for the above scenario. And, in fact, because
> of its heavy usage in /dev/kmsg, there have been reports of long
> delays and even RCU stalls.
>
> For code that does not need to search from the oldest record,
> replace prb_first_valid_seq() usage with prb_read_valid_*()
> functions, which provide a start sequence number to search from.
>
> Fixes: 896fbe20b4e2333fb55 ("printk: use the lockless ringbuffer")
> Reported-by: kernel test robot <[email protected]>
> Reported-by: J. Avila <[email protected]>
> Signed-off-by: John Ogness <[email protected]>
Acked-by: Sergey Senozhatsky <[email protected]>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 5a95c688621f..035aae771ea1 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -735,9 +735,9 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
> logbuf_lock_irq();
> }
>
> - if (user->seq < prb_first_valid_seq(prb)) {
> + if (r->info->seq != user->seq) {
> /* our last seen message is gone, return error and reset */
> - user->seq = prb_first_valid_seq(prb);
Yeah, I can see how this pattern can be expensive, it would have been
less obvious and harder to spot had it been something like this
valid_seq = prb_first_valid_seq(prb);
if (user->seq < valid_seq) {
user->seq = valid_seq;
...
}
Great analysis, John.
I wonder if Intel test robot measures all test execution times; I do
recall "we saw N% performance improvement after patch P" emails, but
not sure if all of the tests are being tracked.
-ss
On Fri 2021-02-05 15:23:28, John Ogness wrote:
> If message sizes average larger than expected (more than 32
> characters), the data_ring will wrap before the desc_ring. Once the
> data_ring wraps, it will start invalidating descriptors. These
> invalid descriptors hang around until they are eventually recycled
> when the desc_ring wraps. Readers do not care about invalid
> descriptors, but they still need to iterate past them. If the
> average message size is much larger than 32 characters, then there
> will be many invalid descriptors preceding the valid descriptors.
>
> The function prb_first_valid_seq() always begins at the oldest
> descriptor and searches for the first valid descriptor. This can
> be rather expensive for the above scenario. And, in fact, because
> of its heavy usage in /dev/kmsg, there have been reports of long
> delays and even RCU stalls.
>
> For code that does not need to search from the oldest record,
> replace prb_first_valid_seq() usage with prb_read_valid_*()
> functions, which provide a start sequence number to search from.
>
> Fixes: 896fbe20b4e2333fb55 ("printk: use the lockless ringbuffer")
> Reported-by: kernel test robot <[email protected]>
> Reported-by: J. Avila <[email protected]>
> Signed-off-by: John Ogness <[email protected]>
> ---
> patch against next-20210205
>
> kernel/printk/printk.c | 29 ++++++++++++++++++-----------
> 1 file changed, 18 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 5a95c688621f..035aae771ea1 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1559,6 +1560,7 @@ static void syslog_clear(void)
>
> int do_syslog(int type, char __user *buf, int len, int source)
> {
> + struct printk_info info;
> bool clear = false;
> static int saved_console_loglevel = LOGLEVEL_DEFAULT;
> int error;
> @@ -1629,9 +1631,13 @@ int do_syslog(int type, char __user *buf, int len, int source)
> /* Number of chars in the log buffer */
> case SYSLOG_ACTION_SIZE_UNREAD:
> logbuf_lock_irq();
> - if (syslog_seq < prb_first_valid_seq(prb)) {
> - /* messages are gone, move to first one */
> - syslog_seq = prb_first_valid_seq(prb);
> + if (prb_read_valid_info(prb, syslog_seq, &info, NULL)) {
> + if (info.seq != syslog_seq) {
> + /* messages are gone, move to first one */
> + syslog_seq = info.seq;
> + syslog_partial = 0;
> + }
> + } else {
> syslog_partial = 0;
I am scratching my head when prb_read_valid_info(prb,
syslog_seq, &info, NULL)) might fail.
It might fail when syslog_seq points to the next message
after the last valid one. In this case, we could return
immediately (after releasing the lock) because there are
zero unread messages.
Anyway, syslog_partial must be zero in this case. syslog_seq
should stay when the last read was partial. And there should
always be at least one valid message in the log buffer
be design.
Do I get it correctly, please?
IMHO, it would deserve a comment and maybe even a warning.
What about something like?
/* Number of chars in the log buffer */
case SYSLOG_ACTION_SIZE_UNREAD:
logbuf_lock_irq();
if (!prb_read_valid_info(prb, syslog_seq, &info, NULL)) {
/* No unread message */
if (syslog_partial) {
/* This should never happen. */
pr_err_once("Unable to read any message even when the last syslog read was partial: %zu", syslog_partial);
syslog_partial = 0;
}
logbuf_unlock_irq();
return 0;
}
if (info.seq != syslog_seq) {
/* messages are gone, move to first one */
syslog_seq = info.seq;
syslog_partial = 0;
}
if (source == SYSLOG_FROM_PROC) {
/*
* Short-cut for poll(/"proc/kmsg") which simply checks
[...]
Best Regards,
Petr
On 2021-02-08, Sergey Senozhatsky <[email protected]> wrote:
>> Can we please also ask the kernel test robot to test this patch?
Oliver Sang from LKP was able to verify that the RCU stall problem is
not seen anymore on their side. See his response below.
Thanks Oliver!
John Ogness
On 2021-02-10, Oliver Sang <[email protected]> wrote:
> On Mon, Feb 08, 2021 at 10:35:27AM +0106, John Ogness wrote:
>> Hello LKP Project,
>>
>> Thank you for your valuable and excellent work!
>>
>> You recently detected a problem:
>>
>> https://lists.01.org/hyperkitty/list/[email protected]/thread/STZF3OODVA5KOG447JR2AJJXREWIPRXD/
>>
>> We have posted a patch to fix the issue:
>>
>> https://lkml.kernel.org/r/[email protected]
>
> Hi John Ogness,
>
> by applying the patch upon below commit:
> commit: b031a684bfd01d633c79d281bd0cf11c2f834ada ("printk: remove logbuf_lock writer-protection of ringbuffer")
>
> we didn't reproduce the previous INFO:rcu_tasks_detected_stalls_on_tasks
> issue in 30 runs:
>
> b031a684bfd01d63: ("printk: remove logbuf_lock writer-protection of ringbuffer")
> 7e926a042bfad8b7: ("printk: avoid prb_first_valid_seq() where possible")
>
> b031a684bfd01d63 7e926a042bfad8b7334b4677d3
> ---------------- --------------------------
> fail:runs %reproduction fail:runs
> | | |
> 10:21 -48% :30 dmesg.INFO:rcu_tasks_detected_stalls_on_tasks
> 19:21 -90% :30 last_state.is_incomplete_run
> 1:21 -5% :30 last_state.post_run
>
>
>>
>> Using a local lkp installation I can verify the problem is fixed. But we
>> would like to know if there possibilities to verify fixes using the LKP
>> test robot? Or is there any way to check that the test robot sees the
>> problem is fixed?
>>
>> Thanks.
>>
>> John Ogness
On 2021-02-09, Petr Mladek <[email protected]> wrote:
>> @@ -1629,9 +1631,13 @@ int do_syslog(int type, char __user *buf, int len, int source)
>> /* Number of chars in the log buffer */
>> case SYSLOG_ACTION_SIZE_UNREAD:
>> logbuf_lock_irq();
>> - if (syslog_seq < prb_first_valid_seq(prb)) {
>> - /* messages are gone, move to first one */
>> - syslog_seq = prb_first_valid_seq(prb);
>> + if (prb_read_valid_info(prb, syslog_seq, &info, NULL)) {
>> + if (info.seq != syslog_seq) {
>> + /* messages are gone, move to first one */
>> + syslog_seq = info.seq;
>> + syslog_partial = 0;
>> + }
>> + } else {
>> syslog_partial = 0;
>
> I am scratching my head when prb_read_valid_info(prb,
> syslog_seq, &info, NULL)) might fail.
It can fail because the descriptor has been invalidated/recycled by
writers and perhaps there is no valid record that has yet come after it.
> It might fail when syslog_seq points to the next message
> after the last valid one. In this case, we could return
> immediately (after releasing the lock) because there are
> zero unread messages.
Yes, we could just return 0 in this case. If we are returning and not
modifying @syslog_seq, then there is no need to reset
@syslog_partial. At some point a reader will notice that the record is
gone and reset @syslog_partial accordingly.
> Anyway, syslog_partial must be zero in this case. syslog_seq
> should stay when the last read was partial. And there should
> always be at least one valid message in the log buffer
> be design.
A record can be invalidated at any time. It is a normal case that a
re-read of a record (to get the rest of the partial) can lead to the
record no longer being available.
> IMHO, it would deserve a comment and maybe even a warning.
I don't think we need a warning. It is something that can happen and it
is not a problem.
> What about something like?
>
> /* Number of chars in the log buffer */
> case SYSLOG_ACTION_SIZE_UNREAD:
> logbuf_lock_irq();
> if (!prb_read_valid_info(prb, syslog_seq, &info, NULL)) {
> /* No unread message */
> if (syslog_partial) {
> /* This should never happen. */
> pr_err_once("Unable to read any message even when the last syslog read was partial: %zu", syslog_partial);
> syslog_partial = 0;
> }
> logbuf_unlock_irq();
> return 0;
> }
I recommend changing your suggestion to:
> if (!prb_read_valid_info(prb, syslog_seq, &info, NULL)) {
> /*
> * No unread messages. No need to check/reset
> * syslog_partial. When a reader does read a new
> * message it will notice and appropriately update
> * syslog_seq and reset syslog_partial.
> */
> logbuf_unlock_irq();
> return 0;
> }
> if (info.seq != syslog_seq) {
> /* messages are gone, move to first one */
> syslog_seq = info.seq;
> syslog_partial = 0;
> }
John Ogness
On Wed 2021-02-10 19:32:10, John Ogness wrote:
> On 2021-02-09, Petr Mladek <[email protected]> wrote:
> >> @@ -1629,9 +1631,13 @@ int do_syslog(int type, char __user *buf, int len, int source)
> >> /* Number of chars in the log buffer */
> >> case SYSLOG_ACTION_SIZE_UNREAD:
> >> logbuf_lock_irq();
> >> - if (syslog_seq < prb_first_valid_seq(prb)) {
> >> - /* messages are gone, move to first one */
> >> - syslog_seq = prb_first_valid_seq(prb);
> >> + if (prb_read_valid_info(prb, syslog_seq, &info, NULL)) {
> >> + if (info.seq != syslog_seq) {
> >> + /* messages are gone, move to first one */
> >> + syslog_seq = info.seq;
> >> + syslog_partial = 0;
> >> + }
> >> + } else {
> >> syslog_partial = 0;
> >
> > I am scratching my head when prb_read_valid_info(prb,
> > syslog_seq, &info, NULL)) might fail.
>
> It can fail because the descriptor has been invalidated/recycled by
> writers and perhaps there is no valid record that has yet come after it.
I see. From some reasons I though that there should always be at
least one message in the commited state. But it is enough when
it is in reusable state. I should have double checked it.
> I recommend changing your suggestion to:
>
> > if (!prb_read_valid_info(prb, syslog_seq, &info, NULL)) {
> > /*
> > * No unread messages. No need to check/reset
> > * syslog_partial. When a reader does read a new
> > * message it will notice and appropriately update
> > * syslog_seq and reset syslog_partial.
> > */
The following comment might be enough after all.
/* No unread messages. */
My main concern was that we cleared syslog_partial and continued.
I thought that we might miss a bug this way. But it seems to
be perfectly fine. I just have to update my mental picture.
Otherwise. the fact that syslog_partial will be fixed by the next
successful call is more or less obvious if we change the code as you
propose.
Please, send an updated patch.
Best Regards,
Petr
> > logbuf_unlock_irq();
> > return 0;
> > }
> > if (info.seq != syslog_seq) {
> > /* messages are gone, move to first one */
> > syslog_seq = info.seq;
> > syslog_partial = 0;
> > }
>
> John Ogness