LinuxLists.cc - Oops: 17 SMP ARM (v3.16-rc2)

2014-06-25 14:17:04

Subject: Oops: 17 SMP ARM (v3.16-rc2)

Hello kernel people,

I have a similar issue with v3.16-rc2 as previously reported by Waldemar Brodkorb for v3.15-rc4.
https://lkml.org/lkml/2014/5/9/330

We are running a benchmark application, sometimes using perf, with heavy traffic over NFS.
The error is sporadic and it seems to occur more frequently when using perf.

Linux imx6-test0 3.16.0-rc2+ #1 SMP Wed Jun 25 15:04:16 CEST 2014 armv7l armv7l armv7l GNU/Linux

Any help is greatly appreciated.

Best regards,
Mattis Lorentzon

Unable to handle kernel paging request at virtual address ffffffff
pgd = 9e338000
[ffffffff] *pgd=2fffd821, *pte=00000000, *ppte=00000000
Internal error: Oops: 17 [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 146 Comm: stereo Not tainted 3.16.0-rc2+ #1
task: 9e07a700 ti: 81c42000 task.ti: 81c42000
PC is at find_get_entry+0x60/0xfc
LR is at radix_tree_lookup_slot+0x1c/0x2c
pc : [<800a34d8>] lr : [<80290448>] psr: a0000013
sp : 81c43d98 ip : 00000000 fp : 81c43dcc
r10: 00000001 r9 : 9e30e3c0 r8 : 000002a7
r7 : 9f3758a0 r6 : 00000000 r5 : 00000001 r4 : 00000000
r3 : 81c43d84 r2 : 00000000 r1 : 000002a7 r0 : ffffffff
Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
Control: 10c5387d Table: 2e33804a DAC: 00000015
Process stereo (pid: 146, stack limit = 0x81c42240)
Stack: (0x81c43d98 to 0x81c44000)
3d80: 00000000 00000000
3da0: 800a3478 000a6000 81c43ecc 00000000 9f37589c 00000000 806cb02a 000002a7
3dc0: 81c43e04 81c43dd0 800a406c 800a3484 80061ca0 9fc2dfe0 00000013 00000059
3de0: 9f37589c 9f375770 00000300 000002a7 9e30e3c0 000002a7 81c43e94 81c43e08
3e00: 800a50c4 800a4040 00000000 00000000 801d1818 00000000 00001000 00080001
3e20: 000002a6 9f3757f4 00000300 000a7000 00000000 801d1818 9e30e490 9f37567c
3e40: 81c43ee8 81c43ed4 00000000 00000000 804d87e0 80067098 00000004 9f375770
3e60: 81c43e94 81c43e70 801d491c 81c43ee8 9f375770 81c43ed4 9e30e3c0 9e07a700
3e80: 76907000 00000000 81c43ebc 81c43e98 801d1818 800a4dfc 80061ca0 80061b0c
3ea0: 9f375770 00200000 00000000 81c43f78 81c43f44 81c43ec0 800e1348 801d17b8
3ec0: 00100000 81c43ed0 800e1764 76907000 00100000 00000000 000a7000 00059000
3ee0: 81c43ecc 00000001 9e30e3c0 00000000 00000000 00000000 9e07a700 00000000
3f00: 00000000 00000000 00200000 00000000 00100000 00000000 00000000 00000000
3f20: 9e30e3c0 9e30e3c0 76907000 81c43f78 9e30e3c0 00100000 81c43f74 81c43f48
3f40: 800e1adc 800e12b8 00000000 0027cce0 00200000 00000000 9e30e3c0 9e30e3c0
3f60: 00100000 76907000 81c43fa4 81c43f78 800e2200 800e1a58 00200000 00000000
3f80: 0027cce0 00000000 0007cce0 00000003 8000ebc4 81c42000 00000000 81c43fa8
3fa0: 8000ea00 800e21c8 0027cce0 00000000 00000003 76907000 00100000 00000000
3fc0: 0027cce0 00000000 0007cce0 00000003 0142b5a0 00000000 00000000 00000000
3fe0: 00000000 7ec59d94 76dc26ac 76e1762c 60000010 00000003 00000000 00000000
Backtrace:
[<800a3478>] (find_get_entry) from [<800a406c>] (pagecache_get_page+0x38/0x1d8)
r8:000002a7 r7:806cb02a r6:00000000 r5:9f37589c r4:00000000
[<800a4034>] (pagecache_get_page) from [<800a50c4>] (generic_file_read_iter+0x2d4/0x750)
r10:000002a7 r9:9e30e3c0 r8:000002a7 r7:00000300 r6:9f375770 r5:9f37589c
r4:00000059
[<800a4df0>] (generic_file_read_iter) from [<801d1818>] (nfs_file_read+0x6c/0xa8)
r10:00000000 r9:76907000 r8:9e07a700 r7:9e30e3c0 r6:81c43ed4 r5:9f375770
r4:81c43ee8
[<801d17ac>] (nfs_file_read) from [<800e1348>] (new_sync_read+0x9c/0xc4)
r6:81c43f78 r5:00000000 r4:00200000
[<800e12ac>] (new_sync_read) from [<800e1adc>] (vfs_read+0x90/0x150)
r8:00100000 r7:9e30e3c0 r6:81c43f78 r5:76907000 r4:9e30e3c0
[<800e1a4c>] (vfs_read) from [<800e2200>] (SyS_read+0x44/0x98)
r9:76907000 r8:00100000 r7:9e30e3c0 r6:9e30e3c0 r5:00000000 r4:00200000
[<800e21bc>] (SyS_read) from [<8000ea00>] (ret_fast_syscall+0x0/0x48)
r9:81c42000 r8:8000ebc4 r7:00000003 r6:0007cce0 r5:00000000 r4:0027cce0
Code: e1a01008 eb07b3d6 e3500000 0a00001c (e5904000)
---[ end trace bebb56a5d6f464ed ]---

***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

Attachments:

dmesg.txt (12.56 kB)
dmesg.txt config.gz (14.43 kB)
config.gz Download all attachments

2014-06-26 13:16:10

by Mattis Lorentzon

[permalink] [raw]

Subject: RE: Oops: 17 SMP ARM (v3.16-rc2)

Hi again,

The Oops seems to have been introduced somewhere between v3.12 and v3.13:

- The Oops is reproducible within seconds when running Linux 3.16-rc2.
- We have observed the Oops on 8 different hardware units and two different chipsets (Freescale i.MX6 and Xilinx Zynq).
- The Oops has not been seen on Linux 3.12 so it appears to be good.
- The Oops has been seen on Linux 3.13, 3.14, 3.15, 3.16-rc2 so these appear to be bad.

Configs and a couple of Oops reports are attached.

Best regards,
Mattis Lorentzon

> Hello kernel people,
>
> I have a similar issue with v3.16-rc2 as previously reported by Waldemar
> Brodkorb for v3.15-rc4.
> https://lkml.org/lkml/2014/5/9/330
>
> We are running a benchmark application, sometimes using perf, with heavy
> traffic over NFS.
> The error is sporadic and it seems to occur more frequently when using perf.
>
> Linux imx6-test0 3.16.0-rc2+ #1 SMP Wed Jun 25 15:04:16 CEST 2014 armv7l
> armv7l armv7l GNU/Linux
>
> Any help is greatly appreciated.
>
> Best regards,
> Mattis Lorentzon
>
> Unable to handle kernel paging request at virtual address ffffffff pgd =
> 9e338000 [ffffffff] *pgd=2fffd821, *pte=00000000, *ppte=00000000 Internal
> error: Oops: 17 [#1] SMP ARM Modules linked in:
> CPU: 0 PID: 146 Comm: stereo Not tainted 3.16.0-rc2+ #1
> task: 9e07a700 ti: 81c42000 task.ti: 81c42000 PC is at
> find_get_entry+0x60/0xfc LR is at radix_tree_lookup_slot+0x1c/0x2c
> pc : [<800a34d8>] lr : [<80290448>] psr: a0000013
> sp : 81c43d98 ip : 00000000 fp : 81c43dcc
> r10: 00000001 r9 : 9e30e3c0 r8 : 000002a7
> r7 : 9f3758a0 r6 : 00000000 r5 : 00000001 r4 : 00000000
> r3 : 81c43d84 r2 : 00000000 r1 : 000002a7 r0 : ffffffff
> Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
> Control: 10c5387d Table: 2e33804a DAC: 00000015 Process stereo (pid: 146,
> stack limit = 0x81c42240)
> Stack: (0x81c43d98 to 0x81c44000)
> 3d80: 00000000 00000000
> 3da0: 800a3478 000a6000 81c43ecc 00000000 9f37589c 00000000 806cb02a
> 000002a7
> 3dc0: 81c43e04 81c43dd0 800a406c 800a3484 80061ca0 9fc2dfe0 00000013
> 00000059
> 3de0: 9f37589c 9f375770 00000300 000002a7 9e30e3c0 000002a7 81c43e94
> 81c43e08
> 3e00: 800a50c4 800a4040 00000000 00000000 801d1818 00000000 00001000
> 00080001
> 3e20: 000002a6 9f3757f4 00000300 000a7000 00000000 801d1818 9e30e490
> 9f37567c
> 3e40: 81c43ee8 81c43ed4 00000000 00000000 804d87e0 80067098 00000004
> 9f375770
> 3e60: 81c43e94 81c43e70 801d491c 81c43ee8 9f375770 81c43ed4 9e30e3c0
> 9e07a700
> 3e80: 76907000 00000000 81c43ebc 81c43e98 801d1818 800a4dfc 80061ca0
> 80061b0c
> 3ea0: 9f375770 00200000 00000000 81c43f78 81c43f44 81c43ec0 800e1348
> 801d17b8
> 3ec0: 00100000 81c43ed0 800e1764 76907000 00100000 00000000 000a7000
> 00059000
> 3ee0: 81c43ecc 00000001 9e30e3c0 00000000 00000000 00000000 9e07a700
> 00000000
> 3f00: 00000000 00000000 00200000 00000000 00100000 00000000 00000000
> 00000000
> 3f20: 9e30e3c0 9e30e3c0 76907000 81c43f78 9e30e3c0 00100000 81c43f74
> 81c43f48
> 3f40: 800e1adc 800e12b8 00000000 0027cce0 00200000 00000000 9e30e3c0
> 9e30e3c0
> 3f60: 00100000 76907000 81c43fa4 81c43f78 800e2200 800e1a58 00200000
> 00000000
> 3f80: 0027cce0 00000000 0007cce0 00000003 8000ebc4 81c42000 00000000
> 81c43fa8
> 3fa0: 8000ea00 800e21c8 0027cce0 00000000 00000003 76907000 00100000
> 00000000
> 3fc0: 0027cce0 00000000 0007cce0 00000003 0142b5a0 00000000 00000000
> 00000000
> 3fe0: 00000000 7ec59d94 76dc26ac 76e1762c 60000010 00000003 00000000
> 00000000
> Backtrace:
> [<800a3478>] (find_get_entry) from [<800a406c>]
> (pagecache_get_page+0x38/0x1d8)
> r8:000002a7 r7:806cb02a r6:00000000 r5:9f37589c r4:00000000 [<800a4034>]
> (pagecache_get_page) from [<800a50c4>]
> (generic_file_read_iter+0x2d4/0x750)
> r10:000002a7 r9:9e30e3c0 r8:000002a7 r7:00000300 r6:9f375770 r5:9f37589c
> r4:00000059
> [<800a4df0>] (generic_file_read_iter) from [<801d1818>]
> (nfs_file_read+0x6c/0xa8)
> r10:00000000 r9:76907000 r8:9e07a700 r7:9e30e3c0 r6:81c43ed4 r5:9f375770
> r4:81c43ee8
> [<801d17ac>] (nfs_file_read) from [<800e1348>]
> (new_sync_read+0x9c/0xc4)
> r6:81c43f78 r5:00000000 r4:00200000
> [<800e12ac>] (new_sync_read) from [<800e1adc>] (vfs_read+0x90/0x150)
> r8:00100000 r7:9e30e3c0 r6:81c43f78 r5:76907000 r4:9e30e3c0 [<800e1a4c>]
> (vfs_read) from [<800e2200>] (SyS_read+0x44/0x98)
> r9:76907000 r8:00100000 r7:9e30e3c0 r6:9e30e3c0 r5:00000000 r4:00200000
> [<800e21bc>] (SyS_read) from [<8000ea00>] (ret_fast_syscall+0x0/0x48)
> r9:81c42000 r8:8000ebc4 r7:00000003 r6:0007cce0 r5:00000000 r4:0027cce0
> Code: e1a01008 eb07b3d6 e3500000 0a00001c (e5904000) ---[ end trace
> bebb56a5d6f464ed ]---

***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

Attachments:

oops_v3.13.txt (3.67 kB)
oops_v3.13.txt oops_v3.15.txt (9.62 kB)
oops_v3.15.txt config-v3.12.gz (12.63 kB)
config-v3.12.gz config-v3.13.gz (12.86 kB)
config-v3.13.gz config-v3.14.gz (12.99 kB)
config-v3.14.gz config-v3.15.gz (14.34 kB)
config-v3.15.gz Download all attachments

2014-06-26 14:01:21

by Russell King - ARM Linux

[permalink] [raw]

Subject: Re: Oops: 17 SMP ARM (v3.16-rc2)

On Wed, Jun 25, 2014 at 01:55:05PM +0000, Mattis Lorentzon wrote:
> Hello kernel people,

You may wish to also copy [email protected], which is
where ARM kernel people are.

> I have a similar issue with v3.16-rc2 as previously reported by Waldemar Brodkorb for v3.15-rc4.
> https://lkml.org/lkml/2014/5/9/330

This URL returns no useful information. I find that lkml.org is broken
more times than not in recent years. Please use a different archive
site when referring to posts, thanks.

> We are running a benchmark application, sometimes using perf, with heavy
> traffic over NFS.

I have had two iMX6 platforms running root-NFS for about the last six to
nine months with various workloads, and have never seen this oops.
Unfortunately, the description above gives very little information for
what the mechanism to trigger this bug may be. For example, if I wanted
to reproduce it, what would I need to do?

> The error is sporadic and it seems to occur more frequently when using perf.

So it occurs when not using perf?

> Linux imx6-test0 3.16.0-rc2+ #1 SMP Wed Jun 25 15:04:16 CEST 2014 armv7l armv7l armv7l GNU/Linux
>
> Any help is greatly appreciated.
>
> Best regards,
> Mattis Lorentzon
>
> Unable to handle kernel paging request at virtual address ffffffff
> pgd = 9e338000
> [ffffffff] *pgd=2fffd821, *pte=00000000, *ppte=00000000
> Internal error: Oops: 17 [#1] SMP ARM
> Modules linked in:
> CPU: 0 PID: 146 Comm: stereo Not tainted 3.16.0-rc2+ #1
> task: 9e07a700 ti: 81c42000 task.ti: 81c42000
> PC is at find_get_entry+0x60/0xfc
> LR is at radix_tree_lookup_slot+0x1c/0x2c
> pc : [<800a34d8>] lr : [<80290448>] psr: a0000013
> sp : 81c43d98 ip : 00000000 fp : 81c43dcc
> r10: 00000001 r9 : 9e30e3c0 r8 : 000002a7
> r7 : 9f3758a0 r6 : 00000000 r5 : 00000001 r4 : 00000000
> r3 : 81c43d84 r2 : 00000000 r1 : 000002a7 r0 : ffffffff
...
> Code: e1a01008 eb07b3d6 e3500000 0a00001c (e5904000)

Right, so radix_tree_lookup_slot returned 0xffffffff. I've no idea how
that happened, and I'm not about to try reading and trying to understand
that code. However, as that is generic code, I find it unlikely that
the code is buggy. So, I suspect something else must be going on here,
such as a compiler bug or memory corruption.

Your other oops dumps also show various other functions apparantly
returning 0xffffffff. I can't believe that there's more than one bug
doing this, so I doubt the problem is in these functions. Something
else must be going on.

--
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

2014-06-26 14:44:57

by Mattis Lorentzon

[permalink] [raw]

Subject: RE: Oops: 17 SMP ARM (v3.16-rc2)

Thank you for your reply,

> On Wed, Jun 25, 2014 at 01:55:05PM +0000, Mattis Lorentzon wrote:
> > I have a similar issue with v3.16-rc2 as previously reported by Waldemar
> Brodkorb for v3.15-rc4.
> > https://lkml.org/lkml/2014/5/9/330
>
> This URL returns no useful information. I find that lkml.org is broken more
> times than not in recent years. Please use a different archive site when
> referring to posts, thanks.

http://lkml.iu.edu/hypermail/linux/kernel/1405.1/01114.html

> I have had two iMX6 platforms running root-NFS for about the last six to nine
> months with various workloads, and have never seen this oops.
> Unfortunately, the description above gives very little information for what
> the mechanism to trigger this bug may be. For example, if I wanted to
> reproduce it, what would I need to do?

We have managed to trigger the Oops by just transferring a large file over nfs
cat /mnt/foo > /dev/null
where foo is a file that is approximately 2 GB. There may be some packet losses
on this network, perhaps this differs from your workload?

> > The error is sporadic and it seems to occur more frequently when using
> perf.
>
> So it occurs when not using perf?

Yes, certainly, see above.

We have done some more investigations, please find it in this mail:

http://lkml.iu.edu/hypermail/linux/kernel/1406.3/02190.html

The Oops seems to have been introduced somewhere between v3.12 and v3.13:

- The Oops is reproducible within seconds when running Linux 3.16-rc2.
- We have observed the Oops on 8 different hardware units and two different chipsets (Freescale i.MX6 and Xilinx Zynq).
- The Oops has not been seen on Linux 3.12 so it appears to be good.
- The Oops has been seen on Linux 3.13, 3.14, 3.15, 3.16-rc2 so these appear to be bad.

Configs and a couple of Oops reports are attached to the linked mail.

Best regards,
Mattis Lorentzon
***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

2014-06-26 15:14:34

by Russell King - ARM Linux

[permalink] [raw]

Subject: Re: Oops: 17 SMP ARM (v3.16-rc2)

On Thu, Jun 26, 2014 at 02:44:52PM +0000, Mattis Lorentzon wrote:
> Thank you for your reply,
>
> > On Wed, Jun 25, 2014 at 01:55:05PM +0000, Mattis Lorentzon wrote:
> > > I have a similar issue with v3.16-rc2 as previously reported by Waldemar
> > Brodkorb for v3.15-rc4.
> > > https://lkml.org/lkml/2014/5/9/330
> >
> > This URL returns no useful information. I find that lkml.org is broken more
> > times than not in recent years. Please use a different archive site when
> > referring to posts, thanks.
>
> http://lkml.iu.edu/hypermail/linux/kernel/1405.1/01114.html

I remember that report, but it was never resolved as I think no one has
any ideas what is causing these, and no one has any idea where to start
looking.

> We have managed to trigger the Oops by just transferring a large file
> over nfs
> cat /mnt/foo > /dev/null
> where foo is a file that is approximately 2 GB. There may be some
> packet losses on this network, perhaps this differs from your workload?

That's a similar workload to the one which is mentioned in the previous
report. I've just set a similar transfer going, but this will be a 16GB
file.

> We have done some more investigations, please find it in this mail:
>
> http://lkml.iu.edu/hypermail/linux/kernel/1406.3/02190.html

Yes, I saw that before I replied, and my reply was written with that
message in mind. That's what prompted this paragraph in my previous
reply:

"Your other oops dumps also show various other functions apparantly
returning 0xffffffff. I can't believe that there's more than one bug
doing this, so I doubt the problem is in these functions. Something
else must be going on."

One of the problems is that there's soo much work going on with the
kernel by many different parties, pulling it in various directions,
that no one really has an overview of all the changes, and so no one
has much of a feel what could be the cause of weird bugs like this.

I don't know what to suggest - you could try using git bisect to see
if you can track it down to a particular commit, but it sounds like
that's going to be very time consuming. You mentioned that 3.12
doesn't show the bug, but 3.13 does - so start off telling git bisect
that 3.12 is "good" and 3.13 is "bad".

Hopefully there won't be too many breakages during the 3.13 merge
window (between 3.12 and 3.13-rc1), but I don't have much faith in
that; people seem to have a habbit of holding back fixes until -rc1,
which makes _exactly_ this kind of bug much harder for people like
yourselves to track down - or maybe even impossible.

I'm afraid I can't offer very much help beyond this until either I can
produce it, or someone manages to identify a particular change which
caused this.

--
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

2014-06-27 11:21:58

by Russell King - ARM Linux

[permalink] [raw]

Subject: Re: Oops: 17 SMP ARM (v3.16-rc2)

On Thu, Jun 26, 2014 at 04:14:24PM +0100, Russell King - ARM Linux wrote:
> On Thu, Jun 26, 2014 at 02:44:52PM +0000, Mattis Lorentzon wrote:
> > We have managed to trigger the Oops by just transferring a large file
> > over nfs
> > cat /mnt/foo > /dev/null
> > where foo is a file that is approximately 2 GB. There may be some
> > packet losses on this network, perhaps this differs from your workload?
>
> That's a similar workload to the one which is mentioned in the previous
> report. I've just set a similar transfer going, but this will be a 16GB
> file.

I've run this transfer several times, but so far I've unable to reproduce
the issue here.

--
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

2014-06-27 16:31:17

by Russell King - ARM Linux

[permalink] [raw]

Subject: Re: Oops: 17 SMP ARM (v3.16-rc2)

Hi Fredrik,

On Fri, Jun 27, 2014 at 04:16:57PM +0000, Fredrik Noring wrote:
> Please find below a trace that appeared once with 3.16-rc2. Perhaps it is of
> some interest?

It's not that serious... I know that the FEC ethernet driver is
horrendously racy (I have had a patch set for about the last six months
which fixes some of its problems) but as I've had a lot of patches to
deal with, and it's been pushed to the back of the queue...

The races don't lead to data corruption though, merely timeouts and
some lost packets.

Now because things have changed during the last merge window, I've got
an even bigger problem sorting through that patch set and getting it
back into a submittable state. I've just sent out v2 for it onto the
[email protected] mailing list.

The initial version (marked RFC) attracted very little interest from
testers, or acks. I'd very much like to have some testing of it, so
if you want to try it out, I can provide you with a git URL, patches
or a combined patch.

--
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

2014-06-27 16:46:55

by Fredrik Noring

[permalink] [raw]

Subject: RE: Oops: 17 SMP ARM (v3.16-rc2)

Hi Russel,

> On Thu, Jun 26, 2014 at 04:14:24PM +0100, Russell King - ARM Linux wrote:
> > That's a similar workload to the one which is mentioned in the
> > previous report. I've just set a similar transfer going, but this
> > will be a 16GB file.
>
> I've run this transfer several times, but so far I've unable to reproduce the
> issue here.

Many thanks for testing this. We attempted to bisect, but unfortunately the
result was not conclusive. One reason might be that the config had to be
updated during the process, and so we did not end up with the exact same
configuration (things like e.g. IMX_SDMA in DMA_ENGINE etc.). Some runs
deadlocked without any visible Oops or printout. Some versions did not have
an entirely working console configuration.

Please find below a trace that appeared once with 3.16-rc2. Perhaps it is of
some interest?

(We also had memtester run for days on the i.MX6 hardware, without issues.)

All the best,
Fredrik

------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x270/0x27c()
NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.16.0-rc2 #19
Backtrace:
[<80012390>] (dump_backtrace) from [<8001266c>] (show_stack+0x18/0x1c)
r6:00000108 r5:00000000 r4:8064e29c r3:00000000
[<80012654>] (show_stack) from [<8049791c>] (dump_stack+0x8c/0x9c)
[<80497890>] (dump_stack) from [<80024f4c>] (warn_slowpath_common+0x74/0x90)
r5:00000009 r4:80631d70
[<80024ed8>] (warn_slowpath_common) from [<80024fa0>] (warn_slowpath_fmt+0x38/0x40)
r8:806320c0 r7:9d85a254 r6:9d879000 r5:9d85a000 r4:00000000
[<80024f6c>] (warn_slowpath_fmt) from [<803b8ff0>] (dev_watchdog+0x270/0x27c)
r3:9d85a000 r2:805c4790
[<803b8d80>] (dev_watchdog) from [<8002f280>] (call_timer_fn+0x6c/0xe4)
r10:80630008 r9:9d85a000 r8:803b8d80 r7:00000100 r6:80630000 r5:00000001
r4:80631dd8
[<8002f214>] (call_timer_fn) from [<8002fec8>] (run_timer_softirq+0x1d4/0x254)
r10:803b8d80 r9:806320c0 r8:9d85a000 r7:00000000 r6:80631e28 r5:80667040
r4:9d85a284
[<8002fcf4>] (run_timer_softirq) from [<8002945c>] (__do_softirq+0x17c/0x30c)
r10:00000001 r9:80632080 r8:40000001 r7:80630000 r6:00000100 r5:80632084
r4:00000020
[<800292e0>] (__do_softirq) from [<80029920>] (irq_exit+0xd0/0x114)
r10:80630000 r9:80665f19 r8:00000001 r7:f4000100 r6:00000000 r5:80630008
r4:80630000
[<80029850>] (irq_exit) from [<8000f348>] (handle_IRQ+0x4c/0x98)
r5:0000001d r4:8062ce44
[<8000f2fc>] (handle_IRQ) from [<80008614>] (gic_handle_irq+0x34/0x64)
r6:80631f20 r5:80638a40 r4:f400010c r3:000000a0
[<800085e0>] (gic_handle_irq) from [<800131c4>] (__irq_svc+0x44/0x58)
Exception stack(0x80631f20 to 0x80631f68)
1f20: 00000001 00000001 00000000 8063b6f0 8063852c 806384d8 80665f19 804a0040
1f40: 00000001 80665f19 80630000 80631f74 00000000 80631f68 800614b8 8000f6a8
1f60: 200f0013 ffffffff
r7:80631f54 r6:ffffffff r5:200f0013 r4:8000f6a8
[<8000f67c>] (arch_cpu_idle) from [<8005cbf8>] (cpu_startup_entry+0x10c/0x164)
[<8005caec>] (cpu_startup_entry) from [<80492b68>] (rest_init+0xc8/0xd8)
r7:80625028 r3:00000000
[<80492aa0>] (rest_init) from [<805f6c5c>] (start_kernel+0x39c/0x3a8)
r5:00000001 r4:806385d0
[<805f68c0>] (start_kernel) from [<10008074>] (0x10008074)
---[ end trace a7b7109ab2d04e11 ]---
***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

2014-06-30 06:22:40

by Fredrik Noring

[permalink] [raw]

Subject: RE: Oops: 17 SMP ARM (v3.16-rc2)

Hi Russell,

> -----Original Message-----
> It's not that serious... I know that the FEC ethernet driver is horrendously
> racy (I have had a patch set for about the last six months which fixes some of
> its problems) but as I've had a lot of patches to deal with, and it's been
> pushed to the back of the queue...
>
> The races don't lead to data corruption though, merely timeouts and some
> lost packets.

The serial port (uart1) and Ethernet are essentially the only things we use.
No disks, no graphics, no USB, etc. If not the Ethernet driver, what else is
likely to crash NFS so badly?

Also, we are happy to change our config if that would simplify things:

http://lkml.iu.edu/hypermail/linux/kernel/1406.3/01488/config.gz

> Now because things have changed during the last merge window, I've got an
> even bigger problem sorting through that patch set and getting it back into a
> submittable state. I've just sent out v2 for it onto the
> [email protected] mailing list.
>
> The initial version (marked RFC) attracted very little interest from testers, or
> acks. I'd very much like to have some testing of it, so if you want to try it
> out, I can provide you with a git URL, patches or a combined patch.

Sure! A combined gzip patch attachment is fine. Git over HTTP probably works
too.

All the best,
Fredrik

***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

2014-06-30 12:44:41

by Fredrik Noring

[permalink] [raw]

Subject: RE: Oops: 17 SMP ARM (v3.16-rc2)

Hi Russell,

It seems to be a compiler issue, where (GCC) 4.8.2 does not produce a properly
working kernel. Happily, (Fedora 2013.11.24-2.fc19) 4.8.1 appears to do a lot
better. No crashes so far with v3.16-rc2!

All the best,
Fredrik

> -----Original Message-----
> Hi Fredrik,
>
> On Fri, Jun 27, 2014 at 04:16:57PM +0000, Fredrik Noring wrote:
> > Please find below a trace that appeared once with 3.16-rc2. Perhaps it
> > is of some interest?
>
> It's not that serious... I know that the FEC ethernet driver is horrendously
> racy (I have had a patch set for about the last six months which fixes some of
> its problems) but as I've had a lot of patches to deal with, and it's been
> pushed to the back of the queue...
>
> The races don't lead to data corruption though, merely timeouts and some
> lost packets.
>
> Now because things have changed during the last merge window, I've got an
> even bigger problem sorting through that patch set and getting it back into a
> submittable state. I've just sent out v2 for it onto the
> [email protected] mailing list.
>
> The initial version (marked RFC) attracted very little interest from testers, or
> acks. I'd very much like to have some testing of it, so if you want to try it
> out, I can provide you with a git URL, patches or a combined patch.
>
> --
> FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
> improving, and getting towards what was expected from it.
***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

2014-06-30 13:00:46

by Lynch, Nathan

[permalink] [raw]

Subject: Re: Oops: 17 SMP ARM (v3.16-rc2)

On 06/30/2014 07:30 AM, Fredrik Noring wrote:
>>
>> On Fri, Jun 27, 2014 at 04:16:57PM +0000, Fredrik Noring wrote:
>>> Please find below a trace that appeared once with 3.16-rc2. Perhaps it
>>> is of some interest?
>>
>> It's not that serious... I know that the FEC ethernet driver is horrendously
>> racy (I have had a patch set for about the last six months which fixes some of
>> its problems) but as I've had a lot of patches to deal with, and it's been
>> pushed to the back of the queue...
>>
>> The races don't lead to data corruption though, merely timeouts and some
>> lost packets.

> It seems to be a compiler issue, where (GCC) 4.8.2 does not produce a
properly
> working kernel. Happily, (Fedora 2013.11.24-2.fc19) 4.8.1 appears to
do a lot
> better. No crashes so far with v3.16-rc2!
>

Did you narrow it down to a particular GCC bug? The symptoms you
reported remind me of:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58854

Sadly, unpatched GCC 4.8.1 and 4.8.2 are unsuitable for building ARM
kernels.

2014-07-02 06:03:04

by Fredrik Noring

[permalink] [raw]

Subject: RE: Oops: 17 SMP ARM (v3.16-rc2)

Hi Russell,

> -----Original Message-----
> > The initial version (marked RFC) attracted very little interest from
> > testers, or acks. I'd very much like to have some testing of it, so
> > if you want to try it out, I can provide you with a git URL, patches
> > or a combined patch.
>
> Sure! A combined gzip patch attachment is fine. Git over HTTP probably
> works too.

We are still interested in trying out your patches to improve network
performance. We can do some testing this week and in August.

Best regards,
Fredrik

***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

2014-12-16 15:16:42

by Mattis Lorentzon

[permalink] [raw]

Subject: RE: Oops: 17 SMP ARM (v3.16-rc2)

Hi Russell,

> Now because things have changed during the last merge window, I've got
> an even bigger problem sorting through that patch set and getting it
> back into a submittable state. I've just sent out v2 for it onto the
> [email protected] mailing list.
>
> The initial version (marked RFC) attracted very little interest from
> testers, or acks. I'd very much like to have some testing of it, so
> if you want to try it out, I can provide you with a git URL, patches or a
> combined patch.

We have run v3.16 for about three months now, and many millions of ssh
connections on eight separate systems, both without and with your network
patches. Our conclusion is that the patches clearly reduce the number of
network timeouts, and this is a great improvement. However, after a month
or so of uptime, the number of timeouts began to increase again, forcing us
to reboot the cards.

Best regards,
Mattis Lorentzon
***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************