2010-04-01 13:22:03

by Daniel Mack

[permalink] [raw]
Subject: Memory corruption with 2.6.32.10, but not with 2.6.34-rc3

Hi,

we observed repeated occurances of memory corruptions (Ooopes somewhere
deep down in the memory mangement code) on ARM PXA300 based boards.

The systems we see this on (arch/arm/mach-pxa/raumfeld.c) feature a
libertas chipset for WiFi, an ethernet controller (smsc9220), a USB
fullspeed host, and NAND flash which is used as UBIFS storage.

Currently, these boards run a 2.6.32.10 kernel. After collecting
evidences for a week or so about when and how and why the memory
corruptions happen, I tried a 2.6.34-rc3 today and the issue seems fixed
there. So - appearantly some important fix since 2.6.32 didn't get
enough care to be backported to the stable branch.

The bug is rather hard to trigger. What I currently do is: after the
system booted from NAND (UBIFS root partition), I wait for the WPA2
secured WiFi link to get active and then download a file (~8MB) over
WiFi to local storage. This download is done in an endless loop. Once in
a while this crashes the 2.6.32.10 kernel instantly, sometimes it takes
up to ~5hrs to happen.

Some findings I collected over the last weeks:

- when calling wget with '-O /dev/null' to not write any file
-> does NOT crash

- downloading via Ethernet instead of WiFi
-> does NOT crash

- writing the file to either a tmpfs parition or a fatfs (on USB
connected external media)
-> DOES still crash (so it is most likely not an UBIFS issue)

- passing --download-rate=50000 to wget (to limit the traffic
thruput to 50kb/s) _in_creases the probability of the crash

- running userspace applications which heavily allocate and
deallocate memory doesn't seem to make the bug more likely or
unlikely

So my current summary is that this is related to WiFi, but OTOH it still
only happens when file system traffic is issued.

We would like to have a fix for this annoying bug in the stable series
(especially 2.6.32.x) as well, but I don't have much ideas about where
to search for it. Hence, I would appreciate if maintainers could think
about any possible commits in the described time window which haven't
reached stable. Does the description ring anyone's bell?

I can cherry-pick things if anyone pin-points something and run
lont-time tests again. Any pointer appreciated.

Thanks,
Daniel



2010-04-01 17:00:09

by Daniel Mack

[permalink] [raw]
Subject: Re: Memory corruption with 2.6.32.10, but not with 2.6.34-rc3

On Thu, Apr 01, 2010 at 09:50:56AM -0700, Greg KH wrote:
> On Thu, Apr 01, 2010 at 03:21:56PM +0200, Daniel Mack wrote:
> > So my current summary is that this is related to WiFi, but OTOH it still
> > only happens when file system traffic is issued.
> >
> > We would like to have a fix for this annoying bug in the stable series
> > (especially 2.6.32.x) as well, but I don't have much ideas about where
> > to search for it. Hence, I would appreciate if maintainers could think
> > about any possible commits in the described time window which haven't
> > reached stable. Does the description ring anyone's bell?
>
> I can't think of any USB specific patches that would be related to this,
> sorry.

Yes, I'd rule out USB anyway. It crashes without any USB function as
well. I just copied you as the maintainer of the stable tree :)

Daniel


2010-04-01 17:58:04

by Daniel Mack

[permalink] [raw]
Subject: Re: Memory corruption with 2.6.32.10, but not with 2.6.34-rc3

On Thu, Apr 01, 2010 at 07:29:59PM +0200, Anders Grafstr?m wrote:
> Daniel Mack wrote:
> > We would like to have a fix for this annoying bug in the stable series
> > (especially 2.6.32.x) as well, but I don't have much ideas about where
> > to search for it. Hence, I would appreciate if maintainers could think
> > about any possible commits in the described time window which haven't
> > reached stable. Does the description ring anyone's bell?
> >
> > I can cherry-pick things if anyone pin-points something and run
> > lont-time tests again. Any pointer appreciated.
>
> You could try this one:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8e4971f2fb2380ce66196136e113d04196b80fcd

Thanks. Unfortunately, that didn't fix it (this time, it crashed after a
few minutes only). But shouldn't we have that in -stable anyway?

Daniel


2010-04-01 16:52:30

by Greg KH

[permalink] [raw]
Subject: Re: Memory corruption with 2.6.32.10, but not with 2.6.34-rc3

On Thu, Apr 01, 2010 at 03:21:56PM +0200, Daniel Mack wrote:
>
> I can cherry-pick things if anyone pin-points something and run
> lont-time tests again. Any pointer appreciated.

Oh, how about running 'git bisect' to try to find the solution? Just
remember to reverse 'good' and 'bad' for when you tell git bisect what
the results are.

thanks,

greg k-h

2010-04-01 16:59:04

by Daniel Mack

[permalink] [raw]
Subject: Re: Memory corruption with 2.6.32.10, but not with 2.6.34-rc3

On Thu, Apr 01, 2010 at 09:51:44AM -0700, Greg KH wrote:
> On Thu, Apr 01, 2010 at 03:21:56PM +0200, Daniel Mack wrote:
> >
> > I can cherry-pick things if anyone pin-points something and run
> > lont-time tests again. Any pointer appreciated.
>
> Oh, how about running 'git bisect' to try to find the solution? Just
> remember to reverse 'good' and 'bad' for when you tell git bisect what
> the results are.

Jep, I thought about that of course. But unfortunately, the platform
got merged mainline in the middle of that time window which makes
bisecting tricky. And worse than that - every test run take around half
a day at least :(

Daniel

2010-04-01 16:52:26

by Greg KH

[permalink] [raw]
Subject: Re: Memory corruption with 2.6.32.10, but not with 2.6.34-rc3

On Thu, Apr 01, 2010 at 03:21:56PM +0200, Daniel Mack wrote:
> Hi,
>
> we observed repeated occurances of memory corruptions (Ooopes somewhere
> deep down in the memory mangement code) on ARM PXA300 based boards.
>
> The systems we see this on (arch/arm/mach-pxa/raumfeld.c) feature a
> libertas chipset for WiFi, an ethernet controller (smsc9220), a USB
> fullspeed host, and NAND flash which is used as UBIFS storage.
>
> Currently, these boards run a 2.6.32.10 kernel. After collecting
> evidences for a week or so about when and how and why the memory
> corruptions happen, I tried a 2.6.34-rc3 today and the issue seems fixed
> there. So - appearantly some important fix since 2.6.32 didn't get
> enough care to be backported to the stable branch.
>
> The bug is rather hard to trigger. What I currently do is: after the
> system booted from NAND (UBIFS root partition), I wait for the WPA2
> secured WiFi link to get active and then download a file (~8MB) over
> WiFi to local storage. This download is done in an endless loop. Once in
> a while this crashes the 2.6.32.10 kernel instantly, sometimes it takes
> up to ~5hrs to happen.
>
> Some findings I collected over the last weeks:
>
> - when calling wget with '-O /dev/null' to not write any file
> -> does NOT crash
>
> - downloading via Ethernet instead of WiFi
> -> does NOT crash
>
> - writing the file to either a tmpfs parition or a fatfs (on USB
> connected external media)
> -> DOES still crash (so it is most likely not an UBIFS issue)
>
> - passing --download-rate=50000 to wget (to limit the traffic
> thruput to 50kb/s) _in_creases the probability of the crash
>
> - running userspace applications which heavily allocate and
> deallocate memory doesn't seem to make the bug more likely or
> unlikely
>
> So my current summary is that this is related to WiFi, but OTOH it still
> only happens when file system traffic is issued.
>
> We would like to have a fix for this annoying bug in the stable series
> (especially 2.6.32.x) as well, but I don't have much ideas about where
> to search for it. Hence, I would appreciate if maintainers could think
> about any possible commits in the described time window which haven't
> reached stable. Does the description ring anyone's bell?

I can't think of any USB specific patches that would be related to this,
sorry.

good luck,

greg k-h

2010-04-05 11:00:00

by Uwe Kleine-König

[permalink] [raw]
Subject: Re: Memory corruption with 2.6.32.10, but not with 2.6.34-rc3

On Thu, Apr 01, 2010 at 06:58:57PM +0200, Daniel Mack wrote:
> On Thu, Apr 01, 2010 at 09:51:44AM -0700, Greg KH wrote:
> > On Thu, Apr 01, 2010 at 03:21:56PM +0200, Daniel Mack wrote:
> > >
> > > I can cherry-pick things if anyone pin-points something and run
> > > lont-time tests again. Any pointer appreciated.
> >
> > Oh, how about running 'git bisect' to try to find the solution? Just
> > remember to reverse 'good' and 'bad' for when you tell git bisect what
> > the results are.
>
> Jep, I thought about that of course. But unfortunately, the platform
> got merged mainline in the middle of that time window which makes
> bisecting tricky. And worse than that - every test run take around half
> a day at least :(
What you can do is backport the platform-support on top of rev initially
marked good (in a branch named say foo) and when asked for testing do:

git merge --no-commit foo
<test>
git reset --hard
git bisect {good|bad}

Assuming the platform-support got in in one go (and you shouldn't test
in the middle, which you can simply skip), the merge should always work
just fine.

Best regards
Uwe

--
Pengutronix e.K. | Uwe Kleine-K?nig |
Industrial Linux Solutions | http://www.pengutronix.de/ |

2010-04-01 17:30:08

by Anders Grafström

[permalink] [raw]
Subject: Re: Memory corruption with 2.6.32.10, but not with 2.6.34-rc3

Daniel Mack wrote:
> We would like to have a fix for this annoying bug in the stable series
> (especially 2.6.32.x) as well, but I don't have much ideas about where
> to search for it. Hence, I would appreciate if maintainers could think
> about any possible commits in the described time window which haven't
> reached stable. Does the description ring anyone's bell?
>
> I can cherry-pick things if anyone pin-points something and run
> lont-time tests again. Any pointer appreciated.

You could try this one:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8e4971f2fb2380ce66196136e113d04196b80fcd