2022-08-29 09:18:40

by Hans de Goede

[permalink] [raw]
Subject: 6.0 tty regression, NULL pointer deref in flush_to_ldisc

Hi All,

This weekend I noticed that on various Bay Trail based systems which have
their bluetooth HCI connected over an uart (using hci_uart driver /
using the drivers/tty/serial bus) there is a NULL pointer deref in
flush_to_ldisc, see below for the full backtrace.

I *suspect* that this is caused by commit 6bb6fa6908eb
("tty: Implement lookahead to process XON/XOFF timely").

I can cleanly revert this by reverting the following commits:

ab24a01b2765 ("tty: Add closing marker into comment in tty_ldisc.h")
65534736d9a5 ("tty: Use flow-control char function on closing path")
6bb6fa6908eb ("tty: Implement lookahead to process XON/XOFF timely")

ATM I don't have one of the affected systems handy. I will give
a 6.0-rc3 kernel with these 3 commits reverted a try tonight (CEST)
and I'll let you know the results.

Note I can NOT confirm yet that these reverts fix things, so please
don't revert anything yet. I just wanted to give people a headsup
about this issue.

Also maybe we can fix the new lookahead code instead of reverting.
I would be happy to add a patch adding some debugging prints the
systems run fine after the backtrace as long as I don't suspend them
so gathering logs is easy.

Regards,

Hans



[ 28.626537] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 28.626555] #PF: supervisor instruction fetch in kernel mode
[ 28.626563] #PF: error_code(0x0010) - not-present page
[ 28.626569] PGD 0 P4D 0
[ 28.626580] Oops: 0010 [#1] PREEMPT SMP PTI
[ 28.626589] CPU: 2 PID: 8 Comm: kworker/u8:0 Tainted: G C E 6.0.0-rc2+ #102
[ 28.626598] Hardware name: MPMAN Converter9/Converter9, BIOS 5.6.5 07/28/2015
[ 28.626604] Workqueue: events_unbound flush_to_ldisc
[ 28.626617] RIP: 0010:0x0
[ 28.626633] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 28.626639] RSP: 0018:ffffacec40087e28 EFLAGS: 00010202
[ 28.626648] RAX: 0000000000000000 RBX: ffff92dc05fee000 RCX: 0000000000000001
[ 28.626654] RDX: 0000000000000000 RSI: ffff92dc05fee020 RDI: ffff92dc07341040
[ 28.626660] RBP: ffff92dc07341048 R08: ffff92dc05fee020 R09: 00000000f1e77022
[ 28.626667] R10: ffffacec40087e30 R11: 000000002f1e7702 R12: ffff92dc07341040
[ 28.626673] R13: ffff92dc07341090 R14: 0000000000000000 R15: 0000000000000001
[ 28.626679] FS: 0000000000000000(0000) GS:ffff92dc7bb00000(0000) knlGS:0000000000000000
[ 28.626687] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 28.626693] CR2: ffffffffffffffd6 CR3: 00000000060c6000 CR4: 00000000001006e0
[ 28.626700] Call Trace:
[ 28.626706] <TASK>
[ 28.626712] flush_to_ldisc+0x178/0x190
[ 28.626728] process_one_work+0x257/0x570
[ 28.626749] worker_thread+0x4f/0x3a0
[ 28.626762] ? process_one_work+0x570/0x570
[ 28.626772] kthread+0xf5/0x120
[ 28.626782] ? kthread_complete_and_exit+0x20/0x20
[ 28.626794] ret_from_fork+0x22/0x30
[ 28.626815] </TASK>
[ 28.626820] Modules linked in: fjes(-) snd_soc_rl6231 snd_intel_sdw_acpi hci_uart dw_dmac soc_button_array dptf_power int3406_thermal snd_soc_core btqca int3401_thermal btrtl processor_thermal_device btbcm processor_thermal_rfim snd_compress processor_thermal_mbox processor_thermal_rapl ac97_bus btintel snd_pcm_dmaengine intel_rapl_common int3403_thermal snd_seq int3400_thermal int340x_thermal_zone snd_seq_device acpi_thermal_rel bluetooth intel_int0002_vgpio(E) kxcjk_1013 atomisp_gc0310(CE) industrialio_triggered_buffer atomisp_ov2680(CE) snd_pcm kfifo_buf atomisp_gmin_platform(CE) industrialio acpi_pad silead(+) videodev mc snd_timer snd ecdh_generic rfkill soundcore mei_txe mei dwc3_pci lpc_ich vfat fat zram mmc_block i915 crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel drm_buddy drm_display_helper cec ttm video wmi(E) sdhci_acpi sdhci mmc_core pwm_lpss_platform pwm_lpss ip6_tables ip_tables i2c_dev ipmi_devintf ipmi_msghandler fuse
[ 28.627005] CR2: 0000000000000000
[ 28.627013] ---[ end trace 0000000000000000 ]---
[ 28.627020] RIP: 0010:0x0
[ 28.627032] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 28.627038] RSP: 0018:ffffacec40087e28 EFLAGS: 00010202
[ 28.627047] RAX: 0000000000000000 RBX: ffff92dc05fee000 RCX: 0000000000000001
[ 28.627053] RDX: 0000000000000000 RSI: ffff92dc05fee020 RDI: ffff92dc07341040
[ 28.627059] RBP: ffff92dc07341048 R08: ffff92dc05fee020 R09: 00000000f1e77022
[ 28.627065] R10: ffffacec40087e30 R11: 000000002f1e7702 R12: ffff92dc07341040
[ 28.627071] R13: ffff92dc07341090 R14: 0000000000000000 R15: 0000000000000001
[ 28.627077] FS: 0000000000000000(0000) GS:ffff92dc7bb00000(0000) knlGS:0000000000000000
[ 28.627085] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 28.627091] CR2: ffffffffffffffd6 CR3: 00000000060c6000 CR4: 00000000001006e0


2022-08-29 09:37:53

by Jiri Slaby

[permalink] [raw]
Subject: Weird RIP printed in BUGs [was: 6.0 tty regression, NULL pointer deref in flush_to_ldisc]

Hi,

On 29. 08. 22, 10:37, Hans de Goede wrote:
> [ 28.626537] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 28.626555] #PF: supervisor instruction fetch in kernel mode
> [ 28.626563] #PF: error_code(0x0010) - not-present page
> [ 28.626569] PGD 0 P4D 0
> [ 28.626580] Oops: 0010 [#1] PREEMPT SMP PTI
> [ 28.626589] CPU: 2 PID: 8 Comm: kworker/u8:0 Tainted: G C E 6.0.0-rc2+ #102
> [ 28.626598] Hardware name: MPMAN Converter9/Converter9, BIOS 5.6.5 07/28/2015
> [ 28.626604] Workqueue: events_unbound flush_to_ldisc
> [ 28.626617] RIP: 0010:0x0
> [ 28.626633] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.

Irrelevant to the original report, this new message format (the old
being "Bad RIP value") confuses me. It always makes me think how can RIP
be -42. So can we either:

1) print regs->ip value (0x0000000000000000) instead of prologue value
(regs->ip - 42 = 0xffffffffffffffd6) here? Even though we really pass
"regs->ip - 42" to copy_code()), or

2) don't print "RIP" in that message. So only "at 0xffffffffffffffd6"

? (I can send a patch for whichever is preferred, if anything.)

thanks,
--
js
suse labs

2022-08-29 10:34:48

by Hans de Goede

[permalink] [raw]
Subject: Re: 6.0 tty regression, NULL pointer deref in flush_to_ldisc

Hi,

On 8/29/22 11:36, Ilpo Järvinen wrote:
> On Mon, 29 Aug 2022, Hans de Goede wrote:
>
>> Hi All,
>>
>> This weekend I noticed that on various Bay Trail based systems which have
>> their bluetooth HCI connected over an uart (using hci_uart driver /
>> using the drivers/tty/serial bus) there is a NULL pointer deref in
>> flush_to_ldisc, see below for the full backtrace.
>>
>> I *suspect* that this is caused by commit 6bb6fa6908eb
>> ("tty: Implement lookahead to process XON/XOFF timely").
>>
>> I can cleanly revert this by reverting the following commits:
>>
>> ab24a01b2765 ("tty: Add closing marker into comment in tty_ldisc.h")
>> 65534736d9a5 ("tty: Use flow-control char function on closing path")
>> 6bb6fa6908eb ("tty: Implement lookahead to process XON/XOFF timely")
>>
>> ATM I don't have one of the affected systems handy. I will give
>> a 6.0-rc3 kernel with these 3 commits reverted a try tonight (CEST)
>> and I'll let you know the results.
>>
>> Note I can NOT confirm yet that these reverts fix things, so please
>> don't revert anything yet. I just wanted to give people a headsup
>> about this issue.
>>
>> Also maybe we can fix the new lookahead code instead of reverting.
>> I would be happy to add a patch adding some debugging prints the
>> systems run fine after the backtrace as long as I don't suspend them
>> so gathering logs is easy.
>
> I guess this will help:
>
> https://lore.kernel.org/linux-kernel/[email protected]/

Ah, yes that indeed looks like it should help. I'll give that a try tonight
and report back the result.

Regards,

Hans

2022-08-29 10:36:02

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: 6.0 tty regression, NULL pointer deref in flush_to_ldisc

On Mon, 29 Aug 2022, Hans de Goede wrote:

> Hi All,
>
> This weekend I noticed that on various Bay Trail based systems which have
> their bluetooth HCI connected over an uart (using hci_uart driver /
> using the drivers/tty/serial bus) there is a NULL pointer deref in
> flush_to_ldisc, see below for the full backtrace.
>
> I *suspect* that this is caused by commit 6bb6fa6908eb
> ("tty: Implement lookahead to process XON/XOFF timely").
>
> I can cleanly revert this by reverting the following commits:
>
> ab24a01b2765 ("tty: Add closing marker into comment in tty_ldisc.h")
> 65534736d9a5 ("tty: Use flow-control char function on closing path")
> 6bb6fa6908eb ("tty: Implement lookahead to process XON/XOFF timely")
>
> ATM I don't have one of the affected systems handy. I will give
> a 6.0-rc3 kernel with these 3 commits reverted a try tonight (CEST)
> and I'll let you know the results.
>
> Note I can NOT confirm yet that these reverts fix things, so please
> don't revert anything yet. I just wanted to give people a headsup
> about this issue.
>
> Also maybe we can fix the new lookahead code instead of reverting.
> I would be happy to add a patch adding some debugging prints the
> systems run fine after the backtrace as long as I don't suspend them
> so gathering logs is easy.

I guess this will help:

https://lore.kernel.org/linux-kernel/[email protected]/

--
i.

2022-08-29 21:27:44

by Hans de Goede

[permalink] [raw]
Subject: Re: 6.0 tty regression, NULL pointer deref in flush_to_ldisc

Hi,

On 8/29/22 11:36, Ilpo Järvinen wrote:
> On Mon, 29 Aug 2022, Hans de Goede wrote:
>
>> Hi All,
>>
>> This weekend I noticed that on various Bay Trail based systems which have
>> their bluetooth HCI connected over an uart (using hci_uart driver /
>> using the drivers/tty/serial bus) there is a NULL pointer deref in
>> flush_to_ldisc, see below for the full backtrace.
>>
>> I *suspect* that this is caused by commit 6bb6fa6908eb
>> ("tty: Implement lookahead to process XON/XOFF timely").
>>
>> I can cleanly revert this by reverting the following commits:
>>
>> ab24a01b2765 ("tty: Add closing marker into comment in tty_ldisc.h")
>> 65534736d9a5 ("tty: Use flow-control char function on closing path")
>> 6bb6fa6908eb ("tty: Implement lookahead to process XON/XOFF timely")
>>
>> ATM I don't have one of the affected systems handy. I will give
>> a 6.0-rc3 kernel with these 3 commits reverted a try tonight (CEST)
>> and I'll let you know the results.
>>
>> Note I can NOT confirm yet that these reverts fix things, so please
>> don't revert anything yet. I just wanted to give people a headsup
>> about this issue.
>>
>> Also maybe we can fix the new lookahead code instead of reverting.
>> I would be happy to add a patch adding some debugging prints the
>> systems run fine after the backtrace as long as I don't suspend them
>> so gathering logs is easy.
>
> I guess this will help:
>
> https://lore.kernel.org/linux-kernel/[email protected]/

I can confirm that that patch fixes things, thanks.

Regards,

Hans

2022-09-06 07:41:42

by Jiri Slaby

[permalink] [raw]
Subject: [PATCH] x86/dumpstack: Don't mention RIP in "Code:"

Commit 238c91115cd0 ("x86/dumpstack: Fix misleading instruction pointer
error message") changed the "Code:" line in bug reports when RIP is an
invalid pointer. In particular, the report currently says (for example):

BUG: kernel NULL pointer dereference, address: 0000000000000000
...
RIP: 0010:0x0
Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.

That "Unable to access opcode bytes at RIP 0xffffffffffffffd6." is
quite confusing as RIP value is 0, not -42. That -42 comes from
"regs->ip - PROLOGUE_SIZE", because Code is dumped with some prologue
(and epilogue).

So do not mention "RIP" on this line in this context.

Cc: Mark Mossberg <[email protected]>
Cc: Borislav Petkov <[email protected]>
Signed-off-by: Jiri Slaby <[email protected]>
---
arch/x86/kernel/dumpstack.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index afae4dd77495..b3dba35f466e 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -128,7 +128,7 @@ void show_opcodes(struct pt_regs *regs, const char *loglvl)
/* No access to the user space stack of other tasks. Ignore. */
break;
default:
- printk("%sCode: Unable to access opcode bytes at RIP 0x%lx.\n",
+ printk("%sCode: Unable to access opcode bytes at 0x%lx.\n",
loglvl, prologue);
break;
}
--
2.37.3

2022-09-07 03:53:36

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/dumpstack: Don't mention RIP in "Code:"

On Tue, Sep 06, 2022 at 09:11:23AM +0200, Jiri Slaby wrote:
> Commit 238c91115cd0 ("x86/dumpstack: Fix misleading instruction pointer
> error message") changed the "Code:" line in bug reports when RIP is an
> invalid pointer. In particular, the report currently says (for example):
>
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> ...
> RIP: 0010:0x0
> Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
>
> That "Unable to access opcode bytes at RIP 0xffffffffffffffd6." is
> quite confusing as RIP value is 0, not -42. That -42 comes from
> "regs->ip - PROLOGUE_SIZE", because Code is dumped with some prologue
> (and epilogue).
>
> So do not mention "RIP" on this line in this context.
>
> Cc: Mark Mossberg <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Signed-off-by: Jiri Slaby <[email protected]>
> ---
> arch/x86/kernel/dumpstack.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> index afae4dd77495..b3dba35f466e 100644
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -128,7 +128,7 @@ void show_opcodes(struct pt_regs *regs, const char *loglvl)
> /* No access to the user space stack of other tasks. Ignore. */
> break;
> default:
> - printk("%sCode: Unable to access opcode bytes at RIP 0x%lx.\n",
> + printk("%sCode: Unable to access opcode bytes at 0x%lx.\n",
> loglvl, prologue);
> break;
> }

I guess... and it says "opcode bytes" to denote that it is trying to
access instructions so yeah, that RIP might be superfluous.

Acked-by: Borislav Petkov <[email protected]>

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Subject: [tip: x86/misc] x86/dumpstack: Don't mention RIP in "Code: "

The following commit has been merged into the x86/misc branch of tip:

Commit-ID: 5258b80e60da6d8908ae2846b234ed8d9d9d4a19
Gitweb: https://git.kernel.org/tip/5258b80e60da6d8908ae2846b234ed8d9d9d4a19
Author: Jiri Slaby <[email protected]>
AuthorDate: Tue, 06 Sep 2022 09:11:23 +02:00
Committer: Borislav Petkov <[email protected]>
CommitterDate: Tue, 20 Sep 2022 16:11:54 +02:00

x86/dumpstack: Don't mention RIP in "Code: "

Commit

238c91115cd0 ("x86/dumpstack: Fix misleading instruction pointer error message")

changed the "Code:" line in bug reports when RIP is an invalid pointer.
In particular, the report currently says (for example):

BUG: kernel NULL pointer dereference, address: 0000000000000000
...
RIP: 0010:0x0
Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.

That

Unable to access opcode bytes at RIP 0xffffffffffffffd6.

is quite confusing as RIP value is 0, not -42. That -42 comes from
"regs->ip - PROLOGUE_SIZE", because Code is dumped with some prologue
(and epilogue).

So do not mention "RIP" on this line in this context.

Signed-off-by: Jiri Slaby <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/dumpstack.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index afae4dd..b3dba35 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -128,7 +128,7 @@ void show_opcodes(struct pt_regs *regs, const char *loglvl)
/* No access to the user space stack of other tasks. Ignore. */
break;
default:
- printk("%sCode: Unable to access opcode bytes at RIP 0x%lx.\n",
+ printk("%sCode: Unable to access opcode bytes at 0x%lx.\n",
loglvl, prologue);
break;
}