2015-12-03 07:48:05

by Peter Rosin

[permalink] [raw]
Subject: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

Hi!

If I enable CONFIG_CPU_SW_DOMAIN_PAN, I sometimes (but not always) get the
following (or very similar) on boot.

Cheers,
Peter

Unhandled fault: page domain fault (0x81b) at 0x00086578
pgd = c2810000
[00086578] *pgd=22819831, *pte=224fe34f, *ppte=224fe83f
Internal error: : 81b [#1] ARM
Modules linked in:
CPU: 0 PID: 907 Comm: ntpd Not tainted 4.3.0+ #29
Hardware name: Atmel SAMA5
task: c398dac0 ti: c2804000 task.ti: c2804000
PC is at memcpy+0x50/0x330
LR is at 0x1
pc : [<c01daff0>]??? lr : [<00000001>]??? psr: 80070013
sp : c2805ed4? ip : 00000007? fp : 000d3808
r10: 00000000? r9 : 00000080? r8 : 00000001
r7 : 00000010? r6 : 00000010? r5 : 003f7374? r4 : 00000000
r3 : 00000002? r2 : ffffffe0? r1 : c2805f38? r0 : 00086578
Flags: Nzcv? IRQs on? FIQs on? Mode SVC_32? ISA ARM? Segment none
Control: 10c53c7d? Table: 22810059? DAC: 00000051
Process ntpd (pid: 907, stack limit = 0xc2804208)
Stack: (0xc2805ed4 to 0xc2806000)
5ec0:????????????????????????????????????????????? 00000000 c2804000 00000000
5ee0: c2805f18 00086578 00086578 c01e76c4 c2819218 c2ff3b74 000d4344 00086578
5f00: 00000000 00000051 0000007c c0010224 c2804000 c004c988 00000002 00000000
5f20: 003f7374 00000010 00000010 00000001 00000007 00000001 01f40000 565ff060
5f40: 000183ef 00002710 00000000 00000000 00000000 00000000 00000000 00000000
5f60: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
5f80: 00000000 00000000 00000000 00000000 00000000 00000000 000d3968 0005e38f
5fa0: 000d4344 c0010060 000d3968 0005e38f 00086578 00000000 ffffffff 00000001
5fc0: 000d3968 0005e38f 000d4344 0000007c 000d3814 00000001 000d4338 000d3808
5fe0: 00081e7c bec558b4 00026d3d b6d09126 00000030 00086578 009e3675 2a090bb6
[<c01daff0>] (memcpy) from [<c01e76c4>] (__copy_to_user_memcpy+0x138/0x17c)
[<c01e76c4>] (__copy_to_user_memcpy) from [<c004c988>] (SyS_adjtimex+0xd4/0xf0)
[<c004c988>] (SyS_adjtimex) from [<c0010060>] (ret_fast_syscall+0x0/0x3c)
Code: f5d1f05c f5d1f07c e8b151f8 e2522020 (e8a051f8)
---[ end trace 04981945a3df2e5e ]---


2015-12-03 08:34:54

by Peter Rosin

[permalink] [raw]
Subject: RE: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

I wrote:
> If I enable CONFIG_CPU_SW_DOMAIN_PAN, I sometimes (but not always) get the
> following (or very similar) on boot.

I should have said "if I don't disable", as the option is "default y".

Also, if it survives on boot, below is an example of later trouble (after 30+ minutes on
this occasion).

Cheers,
Peter

Unhandled fault: page domain fault (0x81b) at 0x00190f40
pgd = c2890000
[00190f40] *pgd=22f96831, *pte=21d9934f, *ppte=21d9983f
Internal error: : 81b [#1] ARM
Modules linked in:
CPU: 0 PID: 970 Comm: tse Not tainted 4.3.0+ #29
Hardware name: Atmel SAMA5
task: c3957480 ti: c2d62000 task.ti: c2d62000
PC is at memcpy+0x50/0x330
LR is at 0x0
pc : [<c01daff0>] lr : [<00000000>] psr: 80070013
sp : c2d63de4 ip : 00000000 fp : 00000000
r10: 00000000 r9 : 00000084 r8 : 00000000
r7 : 00000000 r6 : 1e6602e3 r5 : 00000000 r4 : 00000003
r3 : 00000002 r2 : ffffffe4 r1 : c2d63e4c r0 : 00190f40
Flags: Nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control: 10c53c7d Table: 22890059 DAC: 00000051
Process tse (pid: 970, stack limit = 0xc2d62208)
Stack: (0xc2d63de4 to 0xc2d64000)
3de0: 00000000 c2d62000 00000000 c2d63e2c 00190f40 00190f40 c01e76c4
3e00: c2f96640 c2fee874 00000000 00190f40 00000051 00000000 c2f7f000 c2ebd000
3e20: c3be3800 c0389620 400a0013 00000002 00000003 00000000 1e6602e3 00000000
3e40: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3e60: 00000000 00000000 00000000 00000000 1e66a56c 000015f9 00000000 00000000
3e80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3ea0: 00000000 00000000 00000000 00000000 c3be3800 00190f40 c3a281c0 0000000e
3ec0: 00190f40 c038aa6c 00000001 c3be3800 b611b948 00001194 00000000 b5ffed8c
3ee0: c2d62000 c3be3800 00190f40 c3a281c0 0000000e 00190f40 c2d62000 00000000
3f00: 00000000 c038b704 00190f40 00000000 b611b948 00001194 c0844123 00190f40
3f20: c3be1c60 c3a281c0 0000000e 00190f40 c2d62000 c00a8b38 00000000 00000000
3f40: c2f14018 00000000 0000000d 00000000 b5ffedc0 00000000 0000000d 00000000
3f60: b5ffedc0 c00aa5dc 00000000 00000000 00190e10 c3a281c1 0000000e c3a281c0
3f80: c0844123 00190f40 c2d62000 c00a8d54 00190e10 00190dc0 00000000 00000036
3fa0: c0010224 c0010060 00190e10 00190dc0 0000000e c0844123 00190f40 00000000
3fc0: 00190e10 00190dc0 00000000 00000036 00000001 00000001 00000000 00000000
3fe0: b6c974a8 b5ffed7c b6c3d98d b6b2dc26 200a0030 0000000e 00000000 00000000
[<c01daff0>] (memcpy) from [<c01e76c4>] (__copy_to_user_memcpy+0x138/0x17c)
[<c01e76c4>] (__copy_to_user_memcpy) from [<c0389620>] (snd_pcm_sync_ptr+0x18c/0x1d0)
[<c0389620>] (snd_pcm_sync_ptr) from [<c038aa6c>] (snd_pcm_common_ioctl1+0x428/0xe4c)
[<c038aa6c>] (snd_pcm_common_ioctl1) from [<c038b704>] (snd_pcm_playback_ioctl1+0x274/0x5d4)
[<c038b704>] (snd_pcm_playback_ioctl1) from [<c00a8b38>] (do_vfs_ioctl+0x468/0x650)
[<c00a8b38>] (do_vfs_ioctl) from [<c00a8d54>] (SyS_ioctl+0x34/0x5c)
[<c00a8d54>] (SyS_ioctl) from [<c0010060>] (ret_fast_syscall+0x0/0x3c)
Code: f5d1f05c f5d1f07c e8b151f8 e2522020 (e8a051f8)
---[ end trace 680c06373934b76e ]---

2015-12-03 11:00:49

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

On Thu, Dec 03, 2015 at 08:33:13AM +0000, Peter Rosin wrote:
> I wrote:
> > If I enable CONFIG_CPU_SW_DOMAIN_PAN, I sometimes (but not always) get the
> > following (or very similar) on boot.
>
> I should have said "if I don't disable", as the option is "default y".
>
> Also, if it survives on boot, below is an example of later trouble (after
> 30+ minutes on this occasion).

Please apply this patch so we (might) get a slightly better oops dump,
and then try to reproduce.

diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 7a7c4ce..92789ce 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -95,6 +95,22 @@ void __show_regs(struct pt_regs *regs)
{
unsigned long flags;
char buf[64];
+#ifndef CONFIG_CPU_V7M
+ unsigned int domain;
+#ifdef CONFIG_CPU_SW_DOMAIN_PAN
+ /*
+ * Get the domain register for the parent context. In user
+ * mode, we don't save the DACR, so lets use what it should
+ * be. For other modes, we place it after the pt_regs struct.
+ */
+ if (user_mode(regs))
+ domain = DACR_UACCESS_ENABLE;
+ else
+ domain = *(unsigned int *)(regs + 1);
+#else
+ domain = get_domain();
+#endif
+#endif

show_regs_print_info(KERN_DEFAULT);

@@ -123,21 +139,8 @@ void __show_regs(struct pt_regs *regs)

#ifndef CONFIG_CPU_V7M
{
- unsigned int domain = get_domain();
const char *segment;

-#ifdef CONFIG_CPU_SW_DOMAIN_PAN
- /*
- * Get the domain register for the parent context. In user
- * mode, we don't save the DACR, so lets use what it should
- * be. For other modes, we place it after the pt_regs struct.
- */
- if (user_mode(regs))
- domain = DACR_UACCESS_ENABLE;
- else
- domain = *(unsigned int *)(regs + 1);
-#endif
-
if ((domain & domain_mask(DOMAIN_USER)) ==
domain_val(DOMAIN_USER, DOMAIN_NOACCESS))
segment = "none";
@@ -163,11 +166,11 @@ void __show_regs(struct pt_regs *regs)
buf[0] = '\0';
#ifdef CONFIG_CPU_CP15_MMU
{
- unsigned int transbase, dac = get_domain();
+ unsigned int transbase;
asm("mrc p15, 0, %0, c2, c0\n\t"
: "=r" (transbase));
snprintf(buf, sizeof(buf), " Table: %08x DAC: %08x",
- transbase, dac);
+ transbase, domain);
}
#endif
asm("mrc p15, 0, %0, c1, c0\n" : "=r" (ctrl));

--
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2015-12-03 11:40:01

by Peter Rosin

[permalink] [raw]
Subject: RE: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

Russell King wrote:
> On Thu, Dec 03, 2015 at 08:33:13AM +0000, Peter Rosin wrote:
> > I wrote:
> > > If I enable CONFIG_CPU_SW_DOMAIN_PAN, I sometimes (but not always) get the
> > > following (or very similar) on boot.
> >
> > I should have said "if I don't disable", as the option is "default y".
> >
> > Also, if it survives on boot, below is an example of later trouble (after
> > 30+ minutes on this occasion).
>
> Please apply this patch so we (might) get a slightly better oops dump,
> and then try to reproduce.

Sure thing, but it's still "DAC: 00000051"...

Cheers,
Peter

Unhandled fault: page domain fault (0x81b) at 0x00086578
pgd = c280c000
[00086578] *pgd=22f77831, *pte=2347034f, *ppte=2347083f
Internal error: : 81b [#1] ARM
Modules linked in:
CPU: 0 PID: 904 Comm: ntpd Not tainted 4.3.0+ #30
Hardware name: Atmel SAMA5
task: c39bb500 ti: c2fca000 task.ti: c2fca000
PC is at memcpy+0x50/0x330
LR is at 0x1
pc : [<c01daff0>] lr : [<00000001>] psr: 800f0013
sp : c2fcbed4 ip : 00000007 fp : 000d3808
r10: 00000000 r9 : 00000080 r8 : 00000001
r7 : 00000010 r6 : 00000010 r5 : 003f5c28 r4 : 00000000
r3 : 00000002 r2 : ffffffe0 r1 : c2fcbf38 r0 : 00086578
Flags: Nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control: 10c53c7d Table: 2280c059 DAC: 00000051
Process ntpd (pid: 904, stack limit = 0xc2fca208)
Stack: (0xc2fcbed4 to 0xc2fcc000)
bec0: 00000000 c2fca000 00000000
bee0: c2fcbf18 00086578 00086578 c01e76c4 c2f77218 c2ff70f4 000d4344 00086578
bf00: 00000000 00000051 0000007c c0010224 c2fca000 c004c988 00000002 00000000
bf20: 003f5c28 00000010 00000010 00000001 00000007 00000001 01f40000 5660283b
bf40: 000ae9dc 00002710 00000000 00000000 00000000 00000000 00000000 00000000
bf60: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bf80: 00000000 00000000 00000000 00000000 00000000 00000000 000d3968 0005e38f
bfa0: 000d4344 c0010060 000d3968 0005e38f 00086578 00000000 ffffffff 00000001
bfc0: 000d3968 0005e38f 000d4344 0000007c 000d3814 00000001 000d4338 000d3808
bfe0: 00081e7c bec3e8b4 00026d3d b6ced126 00000030 00086578 8499c1b5 48743d49
[<c01daff0>] (memcpy) from [<c01e76c4>] (__copy_to_user_memcpy+0x138/0x17c)
[<c01e76c4>] (__copy_to_user_memcpy) from [<c004c988>] (SyS_adjtimex+0xd4/0xf0)
[<c004c988>] (SyS_adjtimex) from [<c0010060>] (ret_fast_syscall+0x0/0x3c)
Code: f5d1f05c f5d1f07c e8b151f8 e2522020 (e8a051f8)
---[ end trace bd9256bb17081c58 ]---

2015-12-03 11:51:50

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

On Thu, Dec 03, 2015 at 11:38:20AM +0000, Peter Rosin wrote:
> Russell King wrote:
> > On Thu, Dec 03, 2015 at 08:33:13AM +0000, Peter Rosin wrote:
> > > I wrote:
> > > > If I enable CONFIG_CPU_SW_DOMAIN_PAN, I sometimes (but not always) get the
> > > > following (or very similar) on boot.
> > >
> > > I should have said "if I don't disable", as the option is "default y".
> > >
> > > Also, if it survives on boot, below is an example of later trouble (after
> > > 30+ minutes on this occasion).
> >
> > Please apply this patch so we (might) get a slightly better oops dump,
> > and then try to reproduce.
>
> Sure thing, but it's still "DAC: 00000051"...

Thanks, that confirms that something in the uaccess-with-memcpy code
is clearing the DACR back to 0x51.

This has come up several times before, and I really can't spot the
problem in this code, so I've always said to disable the
uaccess-with-memcpy code. Personally, I'd like to see the back
of that code...

--
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2015-12-03 12:09:49

by Peter Rosin

[permalink] [raw]
Subject: RE: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

Russell King wrote:
> On Thu, Dec 03, 2015 at 11:38:20AM +0000, Peter Rosin wrote:
> > Russell King wrote:
> > > On Thu, Dec 03, 2015 at 08:33:13AM +0000, Peter Rosin wrote:
> > > > I wrote:
> > > > > If I enable CONFIG_CPU_SW_DOMAIN_PAN, I sometimes (but not always) get the
> > > > > following (or very similar) on boot.
> > > >
> > > > I should have said "if I don't disable", as the option is "default y".
> > > >
> > > > Also, if it survives on boot, below is an example of later trouble (after
> > > > 30+ minutes on this occasion).
> > >
> > > Please apply this patch so we (might) get a slightly better oops dump,
> > > and then try to reproduce.
> >
> > Sure thing, but it's still "DAC: 00000051"...
>
> Thanks, that confirms that something in the uaccess-with-memcpy code
> is clearing the DACR back to 0x51.
>
> This has come up several times before, and I really can't spot the
> problem in this code, so I've always said to disable the
> uaccess-with-memcpy code. Personally, I'd like to see the back
> of that code...

Ok, but it's not me doing crazy things if that's what you are implying
(not saying that you are), because the sama5_defconfig has

CONFIG_UACCESS_WITH_MEMCPY=y

So, something needs to happen or sama5 remains default-broken.

Cheers,
Peter

2015-12-03 13:37:40

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

On Thu, Dec 03, 2015 at 12:08:11PM +0000, Peter Rosin wrote:
> Russell King wrote:
> > On Thu, Dec 03, 2015 at 11:38:20AM +0000, Peter Rosin wrote:
> > > Russell King wrote:
> > > > On Thu, Dec 03, 2015 at 08:33:13AM +0000, Peter Rosin wrote:
> > > > > I wrote:
> > > > > > If I enable CONFIG_CPU_SW_DOMAIN_PAN, I sometimes (but not always) get the
> > > > > > following (or very similar) on boot.
> > > > >
> > > > > I should have said "if I don't disable", as the option is "default y".
> > > > >
> > > > > Also, if it survives on boot, below is an example of later trouble (after
> > > > > 30+ minutes on this occasion).
> > > >
> > > > Please apply this patch so we (might) get a slightly better oops dump,
> > > > and then try to reproduce.
> > >
> > > Sure thing, but it's still "DAC: 00000051"...
> >
> > Thanks, that confirms that something in the uaccess-with-memcpy code
> > is clearing the DACR back to 0x51.
> >
> > This has come up several times before, and I really can't spot the
> > problem in this code, so I've always said to disable the
> > uaccess-with-memcpy code. Personally, I'd like to see the back
> > of that code...
>
> Ok, but it's not me doing crazy things if that's what you are implying
> (not saying that you are), because the sama5_defconfig has
>
> CONFIG_UACCESS_WITH_MEMCPY=y
>
> So, something needs to happen or sama5 remains default-broken.

I have no solution for this, other than saying that uaccess-with-memcpy
seems to be (for some unknown reason) incompatible with SW PAN.

Out of the two features, I'd go for SW PAN over uaccess-with-memcpy,
but others will have a different opinion. The real solution is to
track down what's going on and why, but I don't think anyone has the
motivation to do that.

I've looked into this problem, and I've been unable to identify what's
going on here. I can see no reason for the DACR to be set to 0x51 in
this code.

The entry path into this code is via __copy_to_user(), which saves and
sets the DACR to 0x55 before calling arm_copy_to_user(), which for
uaccess-with-memcpy() is in arch/arm/lib/uaccess_with_memcpy.c. That
tail-calls into __copy_to_user_memcpy(), which should run with the
DACR set to 0x55.

The only place that the DACR is changed is inside __put_user(), which
saves the DACR before also setting it to 0x55, and then restores the
old DACR value, which, because the old value was 0x55, will have no
effect.

So, I can see no way that the DACR should ever be 0x51 inside
__copy_to_user_memcpy(), but you are seeing such a scenario. I've no
idea how you could ever get a value of 0x51 out of the DACR within this
code.

--
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2015-12-03 16:13:45

by Peter Rosin

[permalink] [raw]
Subject: RE: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

Russell King wrote:
> On Thu, Dec 03, 2015 at 12:08:11PM +0000, Peter Rosin wrote:
> > Russell King wrote:
> > > On Thu, Dec 03, 2015 at 11:38:20AM +0000, Peter Rosin wrote:
> > > > Russell King wrote:
> > > > > On Thu, Dec 03, 2015 at 08:33:13AM +0000, Peter Rosin wrote:
> > > > > > I wrote:
> > > > > > > If I enable CONFIG_CPU_SW_DOMAIN_PAN, I sometimes (but not always) get the
> > > > > > > following (or very similar) on boot.
> > > > > >
> > > > > > I should have said "if I don't disable", as the option is "default y".
> > > > > >
> > > > > > Also, if it survives on boot, below is an example of later trouble (after
> > > > > > 30+ minutes on this occasion).
> > > > >
> > > > > Please apply this patch so we (might) get a slightly better oops dump,
> > > > > and then try to reproduce.
> > > >
> > > > Sure thing, but it's still "DAC: 00000051"...
> > >
> > > Thanks, that confirms that something in the uaccess-with-memcpy code
> > > is clearing the DACR back to 0x51.
> > >
> > > This has come up several times before, and I really can't spot the
> > > problem in this code, so I've always said to disable the
> > > uaccess-with-memcpy code. Personally, I'd like to see the back
> > > of that code...
> >
> > Ok, but it's not me doing crazy things if that's what you are implying
> > (not saying that you are), because the sama5_defconfig has
> >
> > CONFIG_UACCESS_WITH_MEMCPY=y
> >
> > So, something needs to happen or sama5 remains default-broken.
>
> I have no solution for this, other than saying that uaccess-with-memcpy
> seems to be (for some unknown reason) incompatible with SW PAN.
>
> Out of the two features, I'd go for SW PAN over uaccess-with-memcpy,
> but others will have a different opinion. The real solution is to
> track down what's going on and why, but I don't think anyone has the
> motivation to do that.
>
> I've looked into this problem, and I've been unable to identify what's
> going on here. I can see no reason for the DACR to be set to 0x51 in
> this code.
>
> The entry path into this code is via __copy_to_user(), which saves and
> sets the DACR to 0x55 before calling arm_copy_to_user(), which for
> uaccess-with-memcpy() is in arch/arm/lib/uaccess_with_memcpy.c. That
> tail-calls into __copy_to_user_memcpy(), which should run with the
> DACR set to 0x55.
>
> The only place that the DACR is changed is inside __put_user(), which
> saves the DACR before also setting it to 0x55, and then restores the
> old DACR value, which, because the old value was 0x55, will have no
> effect.
>
> So, I can see no way that the DACR should ever be 0x51 inside
> __copy_to_user_memcpy(), but you are seeing such a scenario. I've no
> idea how you could ever get a value of 0x51 out of the DACR within this
> code.

Since it seems like a race is at the bottom of the observed problems, I'm
going to look for things that look racy. The things that stand out to me
are:

* uaccess.h:modify_domain() does a read-modify-write on DACR using
get_domain and set_domain, and I don't see any locking. Is that
safe? Why?

* uaccess_with_memcpy.c:__copy_to_user() has a mode in which it copies
"non-atomically" (if faulthandler_disabled() returns 0). If a fault
happens during __copy_to_user, what prevents some other thread from
clobbering DACR?

* In uaccess.h:uaccess_save_and_enable(), what prevents a context
switch between the get_domain and set_domain calls?

Just asking questions, I have no prior experience with this code...

Cheers,
Peter

2015-12-03 16:41:29

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

On Thu, Dec 03, 2015 at 04:12:06PM +0000, Peter Rosin wrote:
> Since it seems like a race is at the bottom of the observed problems, I'm
> going to look for things that look racy. The things that stand out to me
> are:
>
> * uaccess.h:modify_domain() does a read-modify-write on DACR using
> get_domain and set_domain, and I don't see any locking. Is that
> safe? Why?

It's safe:
* the DACR is per-CPU
* all exceptions preserve the original DACR value when they return.
This is done by storing the DACR value at entry onto the stack, along
with the register set, and restoring it along with the register set
on exit from exception processing, as if "nothing ever happened".
This includes if the exception processing caused a switch to another
thread.

> * uaccess_with_memcpy.c:__copy_to_user() has a mode in which it copies
> "non-atomically" (if faulthandler_disabled() returns 0). If a fault
> happens during __copy_to_user, what prevents some other thread from
> clobbering DACR?

See the second point above. Moreover, if we sleep in down_read(),
then __switch_to() reads the current DACR value and saves it in the
thread information, and will restore that value when resuming the
thread - even if the thread has been migrated to a different CPU.

> * In uaccess.h:uaccess_save_and_enable(), what prevents a context
> switch between the get_domain and set_domain calls?

Nothing, but it doesn't matter, because the DACR register is saved
and restored to preserve its value across all exceptions and thread
switches.

I suspect the only way to nail this down is to litter the uaccess
code (virtually every alternate line) with:

BUG_ON(get_domain() & domain_mask(DOMAIN_USER) ==
domain_val(DOMAIN_USER, DOMAIN_NOACCESS));

to narrow down the exact point where the domain register seemingly
gets reset. Maybe it'll provide some hint.

--
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2015-12-03 17:27:18

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

On Thu, Dec 03, 2015 at 04:41:18PM +0000, Russell King - ARM Linux wrote:
> On Thu, Dec 03, 2015 at 04:12:06PM +0000, Peter Rosin wrote:
> > * uaccess_with_memcpy.c:__copy_to_user() has a mode in which it copies
> > "non-atomically" (if faulthandler_disabled() returns 0). If a fault
> > happens during __copy_to_user, what prevents some other thread from
> > clobbering DACR?
>
> See the second point above. Moreover, if we sleep in down_read(),
> then __switch_to() reads the current DACR value and saves it in the
> thread information, and will restore that value when resuming the
> thread - even if the thread has been migrated to a different CPU.

I thought this was correct, but it isn't - that's what my original solution
did, but I think when Will reviewed it, we decided it wasn't necessary -
and it isn't necessary for every single case with the exception of this
one. This is exactly what's going wrong: the down_read() in these paths
calls into the scheduler, which switches away. When we come back, the
DACR value is reset by the other thread to 0x51.

There's a few ways to solve this:

1. Make the thread switching code save and restore the DACR register as
it would do for domains. This imposes an overhead on every single
context switch whether or not we happen to be in this _single_
troublesome code. (Patch attached - as there's several, I'm attaching
them.)

2. Add additional code to the uaccess-with-memcpy stuff to reset the
DACR value prior to using memcpy() or memset(). (Patch attached.)

3. Make uaccess-with-memcpy depend on !CPU_SW_DOMAINS_PAN (suggested by
Will)

4. Delete the uaccess-with-memcpy code (also suggested by Will.)

I think the best thing I can do is say... "Discuss amongst yourselves" :)

--
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.


Attachments:
(No filename) (1.83 kB)
umemcpy1.diff (1.56 kB)
umemcpy2.diff (1.74 kB)
Download all attachments

2015-12-03 18:28:19

by Nicolas Pitre

[permalink] [raw]
Subject: Re: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

On Thu, 3 Dec 2015, Russell King - ARM Linux wrote:

> On Thu, Dec 03, 2015 at 04:41:18PM +0000, Russell King - ARM Linux wrote:
> > On Thu, Dec 03, 2015 at 04:12:06PM +0000, Peter Rosin wrote:
> > > * uaccess_with_memcpy.c:__copy_to_user() has a mode in which it copies
> > > "non-atomically" (if faulthandler_disabled() returns 0). If a fault
> > > happens during __copy_to_user, what prevents some other thread from
> > > clobbering DACR?
> >
> > See the second point above. Moreover, if we sleep in down_read(),
> > then __switch_to() reads the current DACR value and saves it in the
> > thread information, and will restore that value when resuming the
> > thread - even if the thread has been migrated to a different CPU.
>
> I thought this was correct, but it isn't - that's what my original solution
> did, but I think when Will reviewed it, we decided it wasn't necessary -
> and it isn't necessary for every single case with the exception of this
> one. This is exactly what's going wrong: the down_read() in these paths
> calls into the scheduler, which switches away. When we come back, the
> DACR value is reset by the other thread to 0x51.
>
> There's a few ways to solve this:
>
> 1. Make the thread switching code save and restore the DACR register as
> it would do for domains. This imposes an overhead on every single
> context switch whether or not we happen to be in this _single_
> troublesome code. (Patch attached - as there's several, I'm attaching
> them.)
>
> 2. Add additional code to the uaccess-with-memcpy stuff to reset the
> DACR value prior to using memcpy() or memset(). (Patch attached.)
>
> 3. Make uaccess-with-memcpy depend on !CPU_SW_DOMAINS_PAN (suggested by
> Will)
>
> 4. Delete the uaccess-with-memcpy code (also suggested by Will.)
>
> I think the best thing I can do is say... "Discuss amongst yourselves" :)

Personally, I'd advocate for #2 or #4. Prior commit 0f64b247e6 I was
already leaning towards #4.

So if some people are still relying on uaccess-with-memcpy and #2 fixes
it then it's all good. I'd suggest surrounding the DACR accesses with
#ifdef CONFIG_CPU_SW_DOMAIN_PAN in the final patch.


Nicolas

2015-12-03 21:39:29

by Peter Rosin

[permalink] [raw]
Subject: RE: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

Russell King wrote:
> On Thu, Dec 03, 2015 at 04:41:18PM +0000, Russell King - ARM Linux wrote:
> > On Thu, Dec 03, 2015 at 04:12:06PM +0000, Peter Rosin wrote:
> > > * uaccess_with_memcpy.c:__copy_to_user() has a mode in which it copies
> > > "non-atomically" (if faulthandler_disabled() returns 0). If a fault
> > > happens during __copy_to_user, what prevents some other thread from
> > > clobbering DACR?
> >
> > See the second point above. Moreover, if we sleep in down_read(),
> > then __switch_to() reads the current DACR value and saves it in the
> > thread information, and will restore that value when resuming the
> > thread - even if the thread has been migrated to a different CPU.
>
> I thought this was correct, but it isn't - that's what my original solution
> did, but I think when Will reviewed it, we decided it wasn't necessary -
> and it isn't necessary for every single case with the exception of this
> one. This is exactly what's going wrong: the down_read() in these paths
> calls into the scheduler, which switches away. When we come back, the
> DACR value is reset by the other thread to 0x51.
>
> There's a few ways to solve this:
>
> 1. Make the thread switching code save and restore the DACR register as
> it would do for domains. This imposes an overhead on every single
> context switch whether or not we happen to be in this _single_
> troublesome code. (Patch attached - as there's several, I'm attaching
> them.)
>
> 2. Add additional code to the uaccess-with-memcpy stuff to reset the
> DACR value prior to using memcpy() or memset(). (Patch attached.)

I took both patches for a quick spin (a dozen boots and one hour uptime
after that for each patch) and no incidents. I have not gathered data,
but the crash on boot feels like it's quite a bit above 50% when there
is a problem so this feels good (I used 5 clean reboots when I bisected
and that worked).

Reported-by: Peter Rosin <[email protected]>
Tested-by: Peter Rosin <[email protected]>

(and please don't forget to cc stable)

> 3. Make uaccess-with-memcpy depend on !CPU_SW_DOMAINS_PAN (suggested by
> Will)
>
> 4. Delete the uaccess-with-memcpy code (also suggested by Will.)
>
> I think the best thing I can do is say... "Discuss amongst yourselves" :)

I have no personal preference on what should be done, I only had
copy_to_user_memcpy enabled since that was what Atmel fed me.

Cheers,
Peter

2015-12-05 13:41:18

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

On Thu, Dec 03, 2015 at 01:28:12PM -0500, Nicolas Pitre wrote:
> On Thu, 3 Dec 2015, Russell King - ARM Linux wrote:
>
> > On Thu, Dec 03, 2015 at 04:41:18PM +0000, Russell King - ARM Linux wrote:
> > > On Thu, Dec 03, 2015 at 04:12:06PM +0000, Peter Rosin wrote:
> > > > * uaccess_with_memcpy.c:__copy_to_user() has a mode in which it copies
> > > > "non-atomically" (if faulthandler_disabled() returns 0). If a fault
> > > > happens during __copy_to_user, what prevents some other thread from
> > > > clobbering DACR?
> > >
> > > See the second point above. Moreover, if we sleep in down_read(),
> > > then __switch_to() reads the current DACR value and saves it in the
> > > thread information, and will restore that value when resuming the
> > > thread - even if the thread has been migrated to a different CPU.
> >
> > I thought this was correct, but it isn't - that's what my original solution
> > did, but I think when Will reviewed it, we decided it wasn't necessary -
> > and it isn't necessary for every single case with the exception of this
> > one. This is exactly what's going wrong: the down_read() in these paths
> > calls into the scheduler, which switches away. When we come back, the
> > DACR value is reset by the other thread to 0x51.
> >
> > There's a few ways to solve this:
> >
> > 1. Make the thread switching code save and restore the DACR register as
> > it would do for domains. This imposes an overhead on every single
> > context switch whether or not we happen to be in this _single_
> > troublesome code. (Patch attached - as there's several, I'm attaching
> > them.)
> >
> > 2. Add additional code to the uaccess-with-memcpy stuff to reset the
> > DACR value prior to using memcpy() or memset(). (Patch attached.)
> >
> > 3. Make uaccess-with-memcpy depend on !CPU_SW_DOMAINS_PAN (suggested by
> > Will)
> >
> > 4. Delete the uaccess-with-memcpy code (also suggested by Will.)
> >
> > I think the best thing I can do is say... "Discuss amongst yourselves" :)
>
> Personally, I'd advocate for #2 or #4. Prior commit 0f64b247e6 I was
> already leaning towards #4.
>
> So if some people are still relying on uaccess-with-memcpy and #2 fixes
> it then it's all good. I'd suggest surrounding the DACR accesses with
> #ifdef CONFIG_CPU_SW_DOMAIN_PAN in the final patch.

Last time this was discussed, people were unhappy about removing it
as they were seeing performance advantages with it enabled. Of course,
that was with it being buggy.

I think at this point, short of deleting it, I'd opt for (2) so the
overhead is attributed to the appropriate place, and not spread across
the entire system. That should then be the prelude for another round
of performance evaluation to see whether it is still advantageous to
have the uaccess-with-memcpy code before making a final decision whether
to revert (2) and apply (3) instead, or to go with (4).

--
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2015-12-10 00:22:25

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

On Thu, Dec 03, 2015 at 09:37:51PM +0000, Peter Rosin wrote:
> I took both patches for a quick spin (a dozen boots and one hour uptime
> after that for each patch) and no incidents. I have not gathered data,
> but the crash on boot feels like it's quite a bit above 50% when there
> is a problem so this feels good (I used 5 clean reboots when I bisected
> and that worked).
>
> Reported-by: Peter Rosin <[email protected]>
> Tested-by: Peter Rosin <[email protected]>
>
> (and please don't forget to cc stable)

I've decided to do a more in-depth fix, so that we also solve the issue
that when we schedule in these down_read()s, we don't leak the permissive
domain register setting into the switched-to context.

Can you test this patch please? Thanks.

8<====
Subject: [PATCH] ARM: fix uaccess_with_memcpy() with SW_DOMAIN_PAN

The uaccess_with_memcpy() code is currently incompatible with the SW
PAN code: it takes locks within the region that we've changed the DACR,
potentially sleeping as a result. As we do not save and restore the
DACR across co-operative sleep events, can lead to an incorrect DACR
value later in this code path.

Signed-off-by: Russell King <[email protected]>
---
arch/arm/include/asm/uaccess.h | 4 ++++
arch/arm/lib/uaccess_with_memcpy.c | 29 +++++++++++++++++++++++------
2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/arm/include/asm/uaccess.h b/arch/arm/include/asm/uaccess.h
index 8cc85a4ebec2..35c9db857ebe 100644
--- a/arch/arm/include/asm/uaccess.h
+++ b/arch/arm/include/asm/uaccess.h
@@ -510,10 +510,14 @@ __copy_to_user_std(void __user *to, const void *from, unsigned long n);
static inline unsigned long __must_check
__copy_to_user(void __user *to, const void *from, unsigned long n)
{
+#ifndef CONFIG_UACCESS_WITH_MEMCPY
unsigned int __ua_flags = uaccess_save_and_enable();
n = arm_copy_to_user(to, from, n);
uaccess_restore(__ua_flags);
return n;
+#else
+ return arm_copy_to_user(to, from, n);
+#endif
}

extern unsigned long __must_check
diff --git a/arch/arm/lib/uaccess_with_memcpy.c b/arch/arm/lib/uaccess_with_memcpy.c
index d72b90905132..588bbc288396 100644
--- a/arch/arm/lib/uaccess_with_memcpy.c
+++ b/arch/arm/lib/uaccess_with_memcpy.c
@@ -88,6 +88,7 @@ pin_page_for_write(const void __user *_addr, pte_t **ptep, spinlock_t **ptlp)
static unsigned long noinline
__copy_to_user_memcpy(void __user *to, const void *from, unsigned long n)
{
+ unsigned long ua_flags;
int atomic;

if (unlikely(segment_eq(get_fs(), KERNEL_DS))) {
@@ -118,7 +119,9 @@ __copy_to_user_memcpy(void __user *to, const void *from, unsigned long n)
if (tocopy > n)
tocopy = n;

+ ua_flags = uaccess_save_and_enable();
memcpy((void *)to, from, tocopy);
+ uaccess_restore(ua_flags);
to += tocopy;
from += tocopy;
n -= tocopy;
@@ -145,14 +148,21 @@ arm_copy_to_user(void __user *to, const void *from, unsigned long n)
* With frame pointer disabled, tail call optimization kicks in
* as well making this test almost invisible.
*/
- if (n < 64)
- return __copy_to_user_std(to, from, n);
- return __copy_to_user_memcpy(to, from, n);
+ if (n < 64) {
+ unsigned long ua_flags = uaccess_save_and_enable();
+ n = __copy_to_user_std(to, from, n);
+ uaccess_restore(ua_flags);
+ } else {
+ n = __copy_to_user_memcpy(to, from, n);
+ }
+ return n;
}

static unsigned long noinline
__clear_user_memset(void __user *addr, unsigned long n)
{
+ unsigned long ua_flags;
+
if (unlikely(segment_eq(get_fs(), KERNEL_DS))) {
memset((void *)addr, 0, n);
return 0;
@@ -175,7 +185,9 @@ __clear_user_memset(void __user *addr, unsigned long n)
if (tocopy > n)
tocopy = n;

+ ua_flags = uaccess_save_and_enable();
memset((void *)addr, 0, tocopy);
+ uaccess_restore(ua_flags);
addr += tocopy;
n -= tocopy;

@@ -193,9 +205,14 @@ __clear_user_memset(void __user *addr, unsigned long n)
unsigned long arm_clear_user(void __user *addr, unsigned long n)
{
/* See rational for this in __copy_to_user() above. */
- if (n < 64)
- return __clear_user_std(addr, n);
- return __clear_user_memset(addr, n);
+ if (n < 64) {
+ unsigned long ua_flags = uaccess_save_and_enable();
+ n = __clear_user_std(addr, n);
+ uaccess_restore(ua_flags);
+ } else {
+ n = __clear_user_memset(addr, n);
+ }
+ return n;
}

#if 0
--
2.1.0

--
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2015-12-10 15:31:34

by Peter Rosin

[permalink] [raw]
Subject: RE: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

Russell King wrote:
> On Thu, Dec 03, 2015 at 09:37:51PM +0000, Peter Rosin wrote:
> > I took both patches for a quick spin (a dozen boots and one hour uptime
> > after that for each patch) and no incidents. I have not gathered data,
> > but the crash on boot feels like it's quite a bit above 50% when there
> > is a problem so this feels good (I used 5 clean reboots when I bisected
> > and that worked).
> >
> > Reported-by: Peter Rosin <[email protected]>
> > Tested-by: Peter Rosin <[email protected]>
> >
> > (and please don't forget to cc stable)
>
> I've decided to do a more in-depth fix, so that we also solve the issue
> that when we schedule in these down_read()s, we don't leak the permissive
> domain register setting into the switched-to context.
>
> Can you test this patch please? Thanks.

Still looking good.

Cheers,
Peter

2015-12-10 16:21:05

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

On Thu, Dec 10, 2015 at 03:29:37PM +0000, Peter Rosin wrote:
> Russell King wrote:
> > On Thu, Dec 03, 2015 at 09:37:51PM +0000, Peter Rosin wrote:
> > > I took both patches for a quick spin (a dozen boots and one hour uptime
> > > after that for each patch) and no incidents. I have not gathered data,
> > > but the crash on boot feels like it's quite a bit above 50% when there
> > > is a problem so this feels good (I used 5 clean reboots when I bisected
> > > and that worked).
> > >
> > > Reported-by: Peter Rosin <[email protected]>
> > > Tested-by: Peter Rosin <[email protected]>
> > >
> > > (and please don't forget to cc stable)
> >
> > I've decided to do a more in-depth fix, so that we also solve the issue
> > that when we schedule in these down_read()s, we don't leak the permissive
> > domain register setting into the switched-to context.
> >
> > Can you test this patch please? Thanks.
>
> Still looking good.

Does that mean I can add your reported and tested-by to this latest
patch?

Thanks.

--
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2015-12-10 18:34:27

by Peter Rosin

[permalink] [raw]
Subject: RE: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

Russell King wrote:
> On Thu, Dec 10, 2015 at 03:29:37PM +0000, Peter Rosin wrote:
> > Russell King wrote:
> > > On Thu, Dec 03, 2015 at 09:37:51PM +0000, Peter Rosin wrote:
> > > > I took both patches for a quick spin (a dozen boots and one hour
> > > > uptime after that for each patch) and no incidents. I have not
> > > > gathered data, but the crash on boot feels like it's quite a bit
> > > > above 50% when there is a problem so this feels good (I used 5
> > > > clean reboots when I bisected and that worked).
> > > >
> > > > Reported-by: Peter Rosin <[email protected]>
> > > > Tested-by: Peter Rosin <[email protected]>
> > > >
> > > > (and please don't forget to cc stable)
> > >
> > > I've decided to do a more in-depth fix, so that we also solve the
> > > issue that when we schedule in these down_read()s, we don't leak the
> > > permissive domain register setting into the switched-to context.
> > >
> > > Can you test this patch please? Thanks.
> >
> > Still looking good.
>
> Does that mean I can add your reported and tested-by to this latest patch?

Right, I thought that was obvious, sorry for the confusion.

Cheers,
Peter

2015-12-30 16:53:30

by Peter Rosin

[permalink] [raw]
Subject: RE: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

[I repeat myself just in case my last message disappeared. It would be
a shame if 4.4 was also regressed because of a missing response.]

I wrote:
> Russell King wrote:
> > On Thu, Dec 10, 2015 at 03:29:37PM +0000, Peter Rosin wrote:
> > > Russell King wrote:
> > > > On Thu, Dec 03, 2015 at 09:37:51PM +0000, Peter Rosin wrote:
> > > > > I took both patches for a quick spin (a dozen boots and one hour
> > > > > uptime after that for each patch) and no incidents. I have not
> > > > > gathered data, but the crash on boot feels like it's quite a bit
> > > > > above 50% when there is a problem so this feels good (I used 5
> > > > > clean reboots when I bisected and that worked).
> > > > >
> > > > > Reported-by: Peter Rosin <[email protected]>
> > > > > Tested-by: Peter Rosin <[email protected]>
> > > > >
> > > > > (and please don't forget to cc stable)
> > > >
> > > > I've decided to do a more in-depth fix, so that we also solve the
> > > > issue that when we schedule in these down_read()s, we don't leak
> > > > the permissive domain register setting into the switched-to context.
> > > >
> > > > Can you test this patch please? Thanks.
> > >
> > > Still looking good.
> >
> > Does that mean I can add your reported and tested-by to this latest patch?
>
> Right, I thought that was obvious, sorry for the confusion.

Reported-by: Peter Rosin <[email protected]>
Tested-by: Peter Rosin <[email protected]>

(and please don't forget to cc stable)

Cheers,
Peter

2015-12-30 16:59:09

by Peter Rosin

[permalink] [raw]
Subject: RE: Domain faults when CONFIG_CPU_SW_DOMAIN_PAN is enabled

I wrote:
> [I repeat myself just in case my last message disappeared. It would be
> a shame if 4.4 was also regressed because of a missing response.]

Crap, I should have checked one more time before hitting send. The patch
is already in there! Sorry for the confusion.

Cheers,
Peter