2015-06-01 17:58:20

by Tony Lindgren

[permalink] [raw]
Subject: Re: runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)

* Matthijs van Duin <[email protected]> [150530 08:24]:
> On 29 May 2015 at 17:50, Tony Lindgren <[email protected]> wrote:
> > I believe some TI kernels use strongly-ordered mappings, mainline
> > kernel does not. Which kernel version are you using?
>
> Normally I periodically rebuild based on Robert C Nelson's -bone
> kernel (but with heavily customized config). I also tried a plain
> 4.1.0-rc5-bone3, the generic 4.1.0-rc5-armv7-x0 (the most
> vanilla-looking kernel I could find in my debian package list), and
> for the heck of it also the classic 3.14.43-ti-r66.
>
> In all cases I observed a synchronous bus error (dubiously reported as
> "external abort on non-linefetch (0x1818)") on an AM335x with this
> trivial test:
>
> int main() {
> int fd = open( "/dev/mem", O_RDWR | O_DSYNC );
> if( fd < 0 ) return 1;
> void *ptr = mmap( NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0x42000000 );
> if( ptr == MAP_FAILED ) return 1;
> *(volatile int *)ptr = 0;
> return 0;
> }
>
> I even considered for a moment that maybe the AM335x has some "all
> writes non-posted" thing enabled (which I think is available as a
> switch on OMAP 4/5?). It seemed unlikely, but since most of my
> exploration of interconnect behaviour was done on a DM814x, I
> double-checked by performing the same write in a baremetal test
> program (with that region configured Device-type in the MMU). As
> expected, no data abort occurred, so writes most certainly are posted.
>
> So I have trouble coming up with any explanation for this other than
> the use of strongly-ordered mappings.
>
> (Curiously BTW, omitting O_DSYNC made no difference.)

I think these kernels are missing the configuration for l3-noc
driver?

I tried it on omap4 that has l3-noc configured, and it first produces
"Unhandled fault: external abort on non-linefetch (0x1818) at 0xb6fd7000",
and the L3 interrupt only after that. So yeah, you're right, we can't
use the interrupts here. I somehow remembered we'd get only the L3
interrupt if configured.

Regards,

Tony


2015-06-01 20:33:20

by Matthijs van Duin

[permalink] [raw]
Subject: Re: runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)

On 1 June 2015 at 19:58, Tony Lindgren <[email protected]> wrote:
> I think these kernels are missing the configuration for l3-noc
> driver?

Yup. Since I'm pretty sure I have all the necessary info I was hoping
look into that... somewhere in my copious spare time...

> I tried it on omap4 that has l3-noc configured, and it first produces
> "Unhandled fault: external abort on non-linefetch (0x1818) at 0xb6fd7000",

(Though making a patch to fix that annoyingly wrong and useless
message is higher on my list of priorities)

> and the L3 interrupt only after that. So yeah, you're right, we can't
> use the interrupts here. I somehow remembered we'd get only the L3
> interrupt if configured.

The bus error is not influenced by L3 error reporting config afaik,
and it will always win from the irq: even though the irq is almost
certainly asserted first, it can't be taken until the load/store
instruction completes, and then the fault will take precedence.

While implementing L3 error reporting in my forth system I ran into a
tricky scenario though: it turns out that if an irq occurs while the
cpu is waiting for instruction fetch, it does allow the irq to be
taken. The interrupted fetch is abandoned and any bus error it may
have produced is ignored since exception entry/exit is an implicit
instruction sync barrier. On return it is simply refetched...

Hence, the result from attempting to execute code from an invalid address:
fetching from [invalid]
irq entry (LR=[invalid])
L3 error displayed
irq exit
fetching from [invalid]
irq entry (LR=[invalid])
L3 error displayed
irq exit
fetching from [invalid]
...
(repeat until watchdog expires)


Anyhow, so we still have the puzzling fact that apparently neither of
us was expecting device memory to use a strongly-ordered mapping,
getting a bus error on a write (outside MPUSS itself) shows that it
does.

I've tried to read arch/arm/mm/mmu.c to find out why, but so far I'm
feeling hopelessly lost there... (the multitude of ARM architecture
versions/flavors supported aren't helping.)

2015-06-01 20:52:31

by Tony Lindgren

[permalink] [raw]
Subject: Re: runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)

* Matthijs van Duin <[email protected]> [150601 13:34]:
> On 1 June 2015 at 19:58, Tony Lindgren <[email protected]> wrote:
> > I think these kernels are missing the configuration for l3-noc
> > driver?
>
> Yup. Since I'm pretty sure I have all the necessary info I was hoping
> look into that... somewhere in my copious spare time...
>
> > I tried it on omap4 that has l3-noc configured, and it first produces
> > "Unhandled fault: external abort on non-linefetch (0x1818) at 0xb6fd7000",
>
> (Though making a patch to fix that annoyingly wrong and useless
> message is higher on my list of priorities)
>
> > and the L3 interrupt only after that. So yeah, you're right, we can't
> > use the interrupts here. I somehow remembered we'd get only the L3
> > interrupt if configured.
>
> The bus error is not influenced by L3 error reporting config afaik,
> and it will always win from the irq: even though the irq is almost
> certainly asserted first, it can't be taken until the load/store
> instruction completes, and then the fault will take precedence.
>
> While implementing L3 error reporting in my forth system I ran into a
> tricky scenario though: it turns out that if an irq occurs while the
> cpu is waiting for instruction fetch, it does allow the irq to be
> taken. The interrupted fetch is abandoned and any bus error it may
> have produced is ignored since exception entry/exit is an implicit
> instruction sync barrier. On return it is simply refetched...
>
> Hence, the result from attempting to execute code from an invalid address:
> fetching from [invalid]
> irq entry (LR=[invalid])
> L3 error displayed
> irq exit
> fetching from [invalid]
> irq entry (LR=[invalid])
> L3 error displayed
> irq exit
> fetching from [invalid]
> ...
> (repeat until watchdog expires)

OK that must be the case I've seen then. Probably that happens
when a device is not clocked.

> Anyhow, so we still have the puzzling fact that apparently neither of
> us was expecting device memory to use a strongly-ordered mapping,
> getting a bus error on a write (outside MPUSS itself) shows that it
> does.

Hmm well it should be just MT_DEVICE for anything Linux ioremaps..
Care to verify that from a device driver that does ioremap on it
first?

> I've tried to read arch/arm/mm/mmu.c to find out why, but so far I'm
> feeling hopelessly lost there... (the multitude of ARM architecture
> versions/flavors supported aren't helping.)

Heh yeah too much hardware churn going on :)

Regards,

Tony

2015-06-02 04:21:51

by Matthijs van Duin

[permalink] [raw]
Subject: Re: runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)

On 1 June 2015 at 22:52, Tony Lindgren <[email protected]> wrote:
> OK that must be the case I've seen then. Probably that happens
> when a device is not clocked.

It happens for any interconnect error reported as a result of
instruction fetch, but that is itself not a very common occurrence and
obviously doesn't apply to device memory.

Another case where the L3 error irq may be taken first is if the bus
error is asynchronous, but I don't think this combo can be produced on
a dm81xx or am335x, but that's mainly due to the strictly in-order
Cortex-A8 making almost every abort synchronous. I'd expect async
aborts are more common on an A9.

> Hmm well it should be just MT_DEVICE for anything Linux ioremaps..

Yikes, so both /dev/mem and uio are behaving unlike any device driver:
both use remap_pfn_range() after running the vm_page_prot though
pgprot_noncached() to set the memory type to L_PTE_MT_UNCACHED, which
counterintuitively turns out to mean strongly-ordered. o.O Especially
uio is acting inappropriate here imho.

But this is problematic... these ranges are already mapped by the
kernel, and ARM explicitly forbids mapping the same physical range
twice with different memory attributes (and it's not the only
architecture to do so). Hmmz...

Anyhow, drifting a bit off-topic here. I'm going to some more reading
and thinking about this.