2011-10-29 17:46:15

by Tomas Janousek

[permalink] [raw]
Subject: iwlagn: memory corruption with WPA enterprise

Hello,

ever since I got this ThinkPad T420, I've been having strange issues, which I
figured are caused by the iwlwifi driver. What happens is that when I'm
connected to a WPA enterprise (not PSK) access point, sooner or later things
start crashing -- sometimes the web browser goes down, and when I run some
heavy task like kernel compilation, I can be almost sure it will fail.

Some more information is provided by other people in this launchpad ticket:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/815148

I'm not a Ubuntu user, though, I'm running Debian with upstream kernels I
compile myself, and the issue still persists. I was able to reproduce it with
3.0.0 from Debian package, with upstream 3.0.4, with 3.1 and even with todays
davem/net-next merged into 3.1 (didn't try current Linus' tree, but that I
guess might be unnecessary).

Also, I tried two versions of uCode: 17.168.5.1 from Debian package and
17.168.5.3 from the intellinuxwireless webpage. None helps.

My reproduction steps are a bit different from what's in that launchpad
ticket, so I spell them here:

1. Connect to a WPA-EAP (enterprise) network.
2. Do some networking (torrent is a safe bet).
3. Run make -j4 on the kernel tree.
4. Wait until the compiler reports an internal error, segfaults, or does some
other weird thing it shouldn't do.
5. When this happens, things start to really go crazy -- if you for example
run 10 gccs in parallel again and again, and continue to do some
networking, it will fail every few seconds. Disconnecting from the network
and connecting again brings you to step 4 again, having to wait until
something somewhere goes wrong and thing start crashing.

Step 4 is a bit tricky and might sometimes take more than an hour (causing you
to repeat step 3). I've had great success accelerating it by telling my
friends that I just found a configuration which seems to work okay. This,
however, might not really help you reproduce the issue, but it did save me a
couple of tenminutes, I am sure.

Is there anything I can do to track this down? Perhaps try some experimental
uCode or something?

Thanks,
--
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/


2011-10-31 15:03:22

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

On Sat, Oct 29, 2011 at 07:15:54PM +0200, Tomáš Janoušek wrote:
> Is there anything I can do to track this down? Perhaps try some experimental
> uCode or something?

You may try debugging patches I posted a while ago:
http://marc.info/?l=linux-mm&m=131914560820378&w=2
http://marc.info/?l=linux-mm&m=131914560820293&w=2
http://marc.info/?l=linux-mm&m=131914560820317&w=2

With a bit of luck, kernel should panic and dump call-trace when
bad code start to write at memory addresses where is not suppose
to.

You have to compile kernel with CONFIG_DEBUG_PAGEALLOC and add
corrupt_dbg=1 to catch memory corruption. However that may not
work if you have small amount of memory.

Also would be good to enable other debug options:

CONFIG_DEBUG_OBJECTS=y
CONFIG_DEBUG_OBJECTS_FREE=y
CONFIG_DEBUG_OBJECTS_TIMERS=y
CONFIG_DEBUG_OBJECTS_WORK=y
CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1

CONFIG_DEBUG_SG=y

CONFIG_DEBUG_LIST=y

Stanislaw

2011-11-21 13:09:20

by Tomas Janousek

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hello,

On Mon, Nov 21, 2011 at 02:05:28PM +0100, Stanislaw Gruszka wrote:
> The farther we get the problem is more and more strange.
>
> Device that write to wrong address, would generate:
>
> DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
> DMAR:[fault reason 05] PTE Write access is not set

And that's exactly what happens if I don't disable firewire-ohci, because of
that stupid Ricoh multifunction blah blah issue.

> But only if
>
> PCI-DMA: Using DMAR IOMMU
>
> was printed in dmesg before. DMAR can be disabled by graphics
> driver, or maybe by other drivers too. Then above print will
> be missed in dmesg and protection would not work.
>
> Try "dmesg | grep DMAR" to see if DMA remapping is really is
> in use.

Well, this message is not printed, but as I said, loading firewire-ohci
triggers DMAR faults, so it should be in use anyway.

--
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/

2011-11-10 16:06:15

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

On Thu, Nov 10, 2011 at 01:53:47PM +0100, Tomáš Janoušek wrote:
> > If "dmesg | grep corrupt" will show "Setting corrupt debug order to 1"
> > patches are in use. Anyway I need to test the patches locally, to see
> > if they work as expected, perhaps exception is generated but call-trace
> > is not printed.
>
> It does say that, yes.

I tested patches. They generate call-trace and make kernel panic when I
wrote at random address from user address space. However to make kernel
panic, we should keep as much as possible free memory, otherwise bad code
corrupt not-protected data. In other words, when you run memory intensive
application, corruption may happen on valid data. So to catch the bug,
you should just use network, and perhaps stress up cpu i.e: by this bash
command:

while true; do : ; done

Please also configure CONFIG_DEBUG_SET_MODULE_RONX=y, it protect modules
text/read-only memory against corruption.

> > Is this happen only with "Intel Corporation Centrino Advanced-N 6205" or
> > with some other adapters?
>
> I don't have any other iwlwifi adapters, so I wouldn't know. The people in
> that Ubuntu bugreport have exactly that card as well, but in another notebook.
> And they claim it works in newer Ubuntu, but I am running latest kernels with
> latest uCode, so I'm out of ideas what else could be wrong.

That's good hint for Intel folks. Would be ideal if any developer could
reproduce that. I do not have this exact adapter model.

> > > Perhaps it would be cheaper to just get another card in that case.
> > > :-)
> >
> > That will left issue unresolved :-(
>
> Yeah, but considering how few people report this, I'm starting to feel that it
> might in fact be a hardware issue.

It's possible, but I don think so. In my practice, majority of corruption
problems was caused by software. All true hardware corruptions I meet, was on
development boards, many months before they went into production.

> (We've got a lot of Lenovos here, mostly T520 and T420s, most of them running
> Fedora, and nobody has reported memory corruption problems.

Are there any others with 6205? If not that would confirm issue is
related with that model.

> Perhaps I should try to connect to this WPA Enterprise using Windows and see
> if anything goes wrong. However, I have no clue as to what shall I do to
> reproduce the issue in Windows.)

You may first try some older kernel as Wey suggested, i.e. 2.6.38.

Thanks
Stanislaw

2011-11-09 16:05:41

by Wey-Yi Guy

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

On Wed, 2011-11-09 at 07:54 -0800, Tomáš Janoušek wrote:
> Hello Stanislaw,
>
> On Mon, Oct 31, 2011 at 05:03:43PM +0100, Stanislaw Gruszka wrote:
> > You may try debugging patches I posted a while ago:
> > http://marc.info/?l=linux-mm&m=131914560820378&w=2
> > http://marc.info/?l=linux-mm&m=131914560820293&w=2
> > http://marc.info/?l=linux-mm&m=131914560820317&w=2
> >
> > With a bit of luck, kernel should panic and dump call-trace when
> > bad code start to write at memory addresses where is not suppose
> > to.
>
> Thanks for your suggestions. I did as you told me, applied those 3 patches on
> top of 3.1 + net-next (the one from 29 Oct 2011), enabled all those things in
> config and passed corrupt_dbg=1 on cmdline, but the problem happens without
> anything being written to dmesg.
>
> Am I just lacking a bit of luck, or could it mean something (like that the
> error is in hardware, the microcode, or something like that)?
>
I am not sure it is uCode related, is there any chance you can bisect
the problem?

we have a major code re-structure around .38/.39 time and not sure it
cause some un-expected problem.

at the meantime, we will following your instruction and see if we can
re-produce the problem you are seeing

Thanks
Wey



2011-11-21 13:38:36

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

On Mon, Nov 21, 2011 at 02:09:16PM +0100, Tomáš Janoušek wrote:
> On Mon, Nov 21, 2011 at 02:05:28PM +0100, Stanislaw Gruszka wrote:
> > The farther we get the problem is more and more strange.
> >
> > Device that write to wrong address, would generate:
> >
> > DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
> > DMAR:[fault reason 05] PTE Write access is not set
>
> And that's exactly what happens if I don't disable firewire-ohci, because of
> that stupid Ricoh multifunction blah blah issue.

So maybe problem is caused by Ricoh, not by iwlagn. But if so
why blacklisting iwlagn help? Wired. Did you disable firewire in
BIOS, or just blacklist module?

> > Try "dmesg | grep DMAR" to see if DMA remapping is really is
> > in use.
>
> Well, this message is not printed, but as I said, loading firewire-ohci
> triggers DMAR faults, so it should be in use anyway.

So IOMMU is in use and it does not prevent corruption, crap.

Ok maybe let's try to find some better reproducer first.
I wrote simple program that fill memory with some pattern, and
then check every one second if pattern is still there.
It can be used like:
./checkmem 100M 30M
where first argument is size of memory it will alloc and check,
second specify number of internal loops to make cpu busy (bigger
value will cause more cpu power consumption). Many instances of
the program can be running at once.

Tomáš, please try to reproduce with that program, I'm attaching
it. When corruption will be detected, checkmem will print invalid
values, maybe would be possible to find out what contents is
written to memory.

Stanislaw


Attachments:
(No filename) (1.58 kB)
checkmem.c (2.28 kB)
Download all attachments

2011-11-10 09:18:23

by Tomas Janousek

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hello,

On Wed, Nov 09, 2011 at 05:51:59PM +0100, Stanislaw Gruszka wrote:
> I just discovered that CONFIG_DEBUG_PAGEALLOC does not work as expected.
> It leave most of free pages unprotected, hence unintentional write to
> them is not discovered. I'm attaching additional patch, which should
> make detection actually work.
>
> If kernel will does not boot with corrupt_dbg=1, you may try to catch
> corruption without that option. Attached patch should make it possible,
> however having corrupt_dbg=1 increase probability of the catch.

Okay, I applied this additional patch, and by the increased memory usage (as
shown by free) I concluded that it indeed works. However, I was still able to
reproduce the issue without a single error being written to dmesg. :-(

I will try some really old kernels as Wey suggested to see whether it makes
any sense to bisect it, but if it does, it might take more time than I can
make free. Perhaps it would be cheaper to just get another card in that case.
:-)

Kind regards,
--
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/

2011-11-11 05:46:38

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

On Thu, Nov 10, 2011 at 05:30:51PM +0100, Tomáš Janoušek wrote:
> I think all those T520 and T420s models have exactly the same wireless card. I
> don't see why they wouldn't have. :-)
Well, laptops models have common mainboards, but peripheral components
might not be identical. Especially wireless card might vary from unit
to unit. For example:

T520 with iwl6300:
http://smolt.fedoraproject.org/client/show/pub_0cbc8731-47d3-4ecf-8add-aa832b34edcd
T520 with iwl1000:
http://smolt.fedoraproject.org/client/show/pub_02f601d6-7a93-4c0e-bc01-a4a95db36086
T520 with iwl6205:
http://smolt.fedoraproject.org/client/show/pub_0232252f-75a9-4050-a69b-c8e95e00eaf2

Stanislaw

2011-11-10 16:18:57

by Wey-Yi Guy

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

On Thu, 2011-11-10 at 08:07 -0800, Stanislaw Gruszka wrote:
> On Thu, Nov 10, 2011 at 01:53:47PM +0100, Tomáš Janoušek wrote:
> > > If "dmesg | grep corrupt" will show "Setting corrupt debug order to 1"
> > > patches are in use. Anyway I need to test the patches locally, to see
> > > if they work as expected, perhaps exception is generated but call-trace
> > > is not printed.
> >
> > It does say that, yes.
>
> I tested patches. They generate call-trace and make kernel panic when I
> wrote at random address from user address space. However to make kernel
> panic, we should keep as much as possible free memory, otherwise bad code
> corrupt not-protected data. In other words, when you run memory intensive
> application, corruption may happen on valid data. So to catch the bug,
> you should just use network, and perhaps stress up cpu i.e: by this bash
> command:
>
> while true; do : ; done
>
> Please also configure CONFIG_DEBUG_SET_MODULE_RONX=y, it protect modules
> text/read-only memory against corruption.
>
> > > Is this happen only with "Intel Corporation Centrino Advanced-N 6205" or
> > > with some other adapters?
> >
> > I don't have any other iwlwifi adapters, so I wouldn't know. The people in
> > that Ubuntu bugreport have exactly that card as well, but in another notebook.
> > And they claim it works in newer Ubuntu, but I am running latest kernels with
> > latest uCode, so I'm out of ideas what else could be wrong.
>
> That's good hint for Intel folks. Would be ideal if any developer could
> reproduce that. I do not have this exact adapter model.
>
> > > > Perhaps it would be cheaper to just get another card in that case.
> > > > :-)
> > >
> > > That will left issue unresolved :-(
> >
> > Yeah, but considering how few people report this, I'm starting to feel that it
> > might in fact be a hardware issue.
>
> It's possible, but I don think so. In my practice, majority of corruption
> problems was caused by software. All true hardware corruptions I meet, was on
> development boards, many months before they went into production.
>
> > (We've got a lot of Lenovos here, mostly T520 and T420s, most of them running
> > Fedora, and nobody has reported memory corruption problems.
>
> Are there any others with 6205? If not that would confirm issue is
> related with that model.

We try very hard on 6205 but can not reproduce this issue, I agree with
Stanislaw the memory corruption is most likely a sw problem.
1. please try older kernel, and possible bisect the kernel
2. could you provide your system information (model, CPU, memory,
graphic, ...), also the OS/kernel version and .config file. not sure we
have the similar system available, but I will like to see if there
anything stand out.

Thanks
Wey

>
> > Perhaps I should try to connect to this WPA Enterprise using Windows and see
> > if anything goes wrong. However, I have no clue as to what shall I do to
> > reproduce the issue in Windows.)
>
> You may first try some older kernel as Wey suggested, i.e. 2.6.38.
>




2011-11-10 16:42:33

by Tomas Janousek

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hello,

On Thu, Nov 10, 2011 at 07:24:45AM -0800, Guy, Wey-Yi wrote:
> 2. could you provide your system information (model, CPU, memory,
> graphic, ...), also the OS/kernel version and .config file. not sure we
> have the similar system available, but I will like to see if there
> anything stand out.

I'm attaching outputs of dmesg, lspci, dmidecode, my .config (the one I used
with 3.1+net-next). The distro I use is a 32-bit Debian wheezy, with most
packages not being older than 4 months. Do you need anything else?

(Don't be scared by the vboxdrv messages. Those modules don't compile with
3.2-rc1, and the bug happens there as well.)

Regards,
--
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/


Attachments:
(No filename) (714.00 B)
dmesg (83.51 kB)
lspci (12.10 kB)
dmidecode (16.62 kB)
.config (95.49 kB)
Download all attachments

2011-11-21 14:32:30

by Tomas Janousek

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hello,

On Mon, Nov 21, 2011 at 02:40:57PM +0100, Stanislaw Gruszka wrote:
> On Mon, Nov 21, 2011 at 02:09:16PM +0100, Tomáš Janoušek wrote:
> > And that's exactly what happens if I don't disable firewire-ohci, because of
> > that stupid Ricoh multifunction blah blah issue.
>
> So maybe problem is caused by Ricoh, not by iwlagn. But if so
> why blacklisting iwlagn help? Wired. Did you disable firewire in
> BIOS, or just blacklist module?

I don't think so. The problem with Ricoh is that it has two PCI devices:
0d:00.0 and 0d:00.3, but both use 0d:00.0 in DMA requests, which triggers DMAR
faults. Of course, I can try disabling that in BIOS, but I doubt it will
change anything.

> Ok maybe let's try to find some better reproducer first.
> I wrote simple program that fill memory with some pattern, and
> then check every one second if pattern is still there.
> It can be used like:
> ./checkmem 100M 30M
> where first argument is size of memory it will alloc and check,
> second specify number of internal loops to make cpu busy (bigger
> value will cause more cpu power consumption). Many instances of
> the program can be running at once.
>
> Tomáš, please try to reproduce with that program, I'm attaching
> it. When corruption will be detected, checkmem will print invalid
> values, maybe would be possible to find out what contents is
> written to memory.

Okay, I'll try that later. Yesterday I tried to use memtester [1] to catch the
problem, but it never occured in it. Kernel compilation was segfaulting and
that tool was running alongside being all happy. But maybe it has different
memory usage pattern or something. Anyway, I'll try your program and if it
doesn't work, I'll make it so. :-)

[1] http://pyropus.ca/software/memtester/

Regards,
--
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/

2011-11-21 13:03:14

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hi Tomáš

On Sat, Nov 19, 2011 at 07:11:06PM +0100, Tomáš Janoušek wrote:
> On Mon, Nov 14, 2011 at 03:07:15PM +0100, Stanislaw Gruszka wrote:
> > On Fri, Nov 11, 2011 at 04:01:05PM +0100, Tomáš Janoušek wrote:
> > > Could you please elaborate on that thing with enabling IOMMU? The only thing I
> > > know about IOMMU is that it is somehow related to VT-d (passing whole PCI
> > > devices to virtual guests), and that I have to pass intel_iommu=off to kernel
> > > command line, otherwise the machine doesn't even boot. Is that a problem?
> > Yes. That mean iommu software or hardware is broken on your system.
> >
> > I have no other ideas how to track this down. I think now, this is
> > a firwmare issue. BTW, you suspected that from very beginning :-)
> > This could be also a driver issue, but AFAICT programing DMA do not
> > differ on 6205 from other devices, so bug in firmware is much more
> > probable reason of corruption.
>
> I have some news. I got IOMMU to work, because I identified the problem [1]
> and disabled firewire-ohci for the time being completely, but I'm not sure
> what do I need to do to make it catch the problem. I assumed that all I need
> is to intel_iommu=on and then all devices do DMA stuff in isolation, but I can
> still reproduce the issue without the smallest hint of an error in dmesg. Does
> it tell us anything, or shall I enable some more debugging stuff?
>
> [1] http://thread.gmane.org/gmane.linux.kernel.pci/8765/focus=1217800

The farther we get the problem is more and more strange.

Device that write to wrong address, would generate:

DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
DMAR:[fault reason 05] PTE Write access is not set

But only if

PCI-DMA: Using DMAR IOMMU

was printed in dmesg before. DMAR can be disabled by graphics
driver, or maybe by other drivers too. Then above print will
be missed in dmesg and protection would not work.

Try "dmesg | grep DMAR" to see if DMA remapping is really is
in use.

Stanislaw

2011-11-20 02:20:14

by Wey-Yi Guy

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

On Sat, 2011-11-19 at 10:11 -0800, Tomáš Janoušek wrote:
> Hello,
>
> On Mon, Nov 14, 2011 at 03:07:15PM +0100, Stanislaw Gruszka wrote:
> > On Fri, Nov 11, 2011 at 04:01:05PM +0100, Tomáš Janoušek wrote:
> > > Could you please elaborate on that thing with enabling IOMMU? The only thing I
> > > know about IOMMU is that it is somehow related to VT-d (passing whole PCI
> > > devices to virtual guests), and that I have to pass intel_iommu=off to kernel
> > > command line, otherwise the machine doesn't even boot. Is that a problem?
> > Yes. That mean iommu software or hardware is broken on your system.
> >
> > I have no other ideas how to track this down. I think now, this is
> > a firwmare issue. BTW, you suspected that from very beginning :-)
> > This could be also a driver issue, but AFAICT programing DMA do not
> > differ on 6205 from other devices, so bug in firmware is much more
> > probable reason of corruption.
>
> I have some news. I got IOMMU to work, because I identified the problem [1]
> and disabled firewire-ohci for the time being completely, but I'm not sure
> what do I need to do to make it catch the problem. I assumed that all I need
> is to intel_iommu=on and then all devices do DMA stuff in isolation, but I can
> still reproduce the issue without the smallest hint of an error in dmesg. Does
> it tell us anything, or shall I enable some more debugging stuff?
>
> [1] http://thread.gmane.org/gmane.linux.kernel.pci/8765/focus=1217800
>
> Anyway, I didn't get to trying 2.6.38/39 yet, but I will do that soon. It is
> also safe to say now that x86_64 is completely unaffected, as I was running
> various 64bit kernels the whole week without a single failure.
>


hmm, I don't have any IOMMU supported system, I guess it is the time to
get one :-)

Wey




2011-11-20 20:40:11

by Tomas Janousek

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hi,

On Sat, Nov 19, 2011 at 08:28:34PM -0800, wwguy wrote:
> On Sat, 2011-11-19 at 19:20 -0800, Tomáš Janoušek wrote:
> > Anyway, did you have time to try to reproduce it with the .config I sent?
> > (It may take hours, sometimes. :-/)
>
> Yes, I will try your configuration when I get back to the office Monday

Okay, thank you. I managed to test 2.6.38.8 and 2.6.39.4 today and the issue
is reproducible on both. Shall I test further in the past?

I also got a reply from the Ubuntu bugreport where someone claims he doesn't
experience the issue on current 32-bit Ubuntu 11.10, even though he did
experience it in older versions. I have no idea why that may be so, as they
have the same firmware as I have, and the kernel is 3.0 with almost no
modifications to the iwlagn driver.

I hope that you'll be able to reproduce the issue and fix it. Sadly, I still
haven't got a better reproducer. What I was doing today was running make clean
&& make -j4 in a while loop, having an active Skype videocall and browsing the
web while waiting for it to crash. Sometimes it took a few minutes, sometimes
more than an hour.

Regards,
--
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/

2011-11-20 04:34:55

by Wey-Yi Guy

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

On Sat, 2011-11-19 at 19:20 -0800, Tomáš Janoušek wrote:
> Hi,
>
> On Sat, Nov 19, 2011 at 06:13:53PM -0800, wwguy wrote:
> > hmm, I don't have any IOMMU supported system, I guess it is the time to
> > get one :-)
>
> Well, the bug manifests itself regardless of whether IOMMU is used or not...
>
> Anyway, did you have time to try to reproduce it with the .config I sent?
> (It may take hours, sometimes. :-/)
>
Yes, I will try your configuration when I get back to the office Monday

Thanks

Wey



2011-11-11 05:43:30

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

On Thu, Nov 10, 2011 at 11:31:45AM -0800, Adrian Chadd wrote:
> .. are you sure it's a software use-after-free?
I'm quite sure now this is not the problem here ...

> What about "NIC DMA'ing stuff into completely incorrect space" after free? :-)
> (Or a firmware/NIC bug where it scribbles to random memory at times..)
Seems that is the reason of corruption, since CONFIG_DEBUG_PAGEALLOC doest not
catch it. I'm not sure how to debug such issues, maybe enabling IOMMU will
allow to debug? Other than trying iommu, would be good to check if problem
also happens on 64bit kernels (CONFIG_IA32_EMULATION allow to use
64bit kernel with 32bit user-space), and configure CONFIG_DMA_API_DEBUG
to see if there are any mistakes with programming DMA.

Stanislaw

2011-11-09 16:51:19

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hi

On Wed, Nov 09, 2011 at 04:54:11PM +0100, Tomáš Janoušek wrote:
> On Mon, Oct 31, 2011 at 05:03:43PM +0100, Stanislaw Gruszka wrote:
> > You may try debugging patches I posted a while ago:
> > http://marc.info/?l=linux-mm&m=131914560820378&w=2
> > http://marc.info/?l=linux-mm&m=131914560820293&w=2
> > http://marc.info/?l=linux-mm&m=131914560820317&w=2
> >
> > With a bit of luck, kernel should panic and dump call-trace when
> > bad code start to write at memory addresses where is not suppose
> > to.
>
> Thanks for your suggestions. I did as you told me, applied those 3 patches on
> top of 3.1 + net-next (the one from 29 Oct 2011), enabled all those things in
> config and passed corrupt_dbg=1 on cmdline, but the problem happens without
> anything being written to dmesg.

I just discovered that CONFIG_DEBUG_PAGEALLOC does not work as expected.
It leave most of free pages unprotected, hence unintentional write to
them is not discovered. I'm attaching additional patch, which should
make detection actually work.

If kernel will does not boot with corrupt_dbg=1, you may try to catch
corruption without that option. Attached patch should make it possible,
however having corrupt_dbg=1 increase probability of the catch.

Thanks
Stanislaw


Attachments:
(No filename) (1.23 kB)
0001-mm-remove-debug_pagealloc_enabled.patch (3.15 kB)
Download all attachments

2011-11-20 03:20:20

by Tomas Janousek

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hi,

On Sat, Nov 19, 2011 at 06:13:53PM -0800, wwguy wrote:
> hmm, I don't have any IOMMU supported system, I guess it is the time to
> get one :-)

Well, the bug manifests itself regardless of whether IOMMU is used or not...

Anyway, did you have time to try to reproduce it with the .config I sent?
(It may take hours, sometimes. :-/)

Regards,
--
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/

2011-11-10 17:30:47

by Larry Finger

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

On 11/10/2011 10:42 AM, Tomáš Janoušek wrote:
> (Don't be scared by the vboxdrv messages. Those modules don't compile with
> 3.2-rc1, and the bug happens there as well.)

The following patch will fix the 3.2-rc1 build problem.

Index: vboxhost/vboxpci/linux/VBoxPci-linux.c
===================================================================
--- vboxhost.orig/vboxpci/linux/VBoxPci-linux.c
+++ vboxhost/vboxpci/linux/VBoxPci-linux.c
@@ -146,7 +146,11 @@ static int __init VBoxPciLinuxInit(void)
#endif

#ifdef VBOX_WITH_IOMMU
+# if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 2, 0)
+ if (iommu_present(&pci_bus_type))
+#else
if (iommu_found())
+#endif
printk(KERN_INFO "vboxpci: IOMMU found\n");
else
printk(KERN_INFO "vboxpci: IOMMU not found (not registered)\n");
@@ -984,9 +988,15 @@ int vboxPciOsInitVm(PVBOXRAWPCIDRVVM pT
printk(KERN_DEBUG "vboxPciOsInitVm: %p\n", pThis);
#endif
#ifdef VBOX_WITH_IOMMU
+# if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 2, 0)
+ if (iommu_present(&pci_bus_type))
+ {
+ pThis->pIommuDomain = iommu_domain_alloc(&pci_bus_type);
+#else
if (iommu_found())
{
pThis->pIommuDomain = iommu_domain_alloc();
+#endif
if (!pThis->pIommuDomain)
{
printk(KERN_DEBUG "cannot allocate IOMMU domain\n");



2011-11-14 14:05:57

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hi,

On Fri, Nov 11, 2011 at 04:01:05PM +0100, Tomáš Janoušek wrote:
> Could you please elaborate on that thing with enabling IOMMU? The only thing I
> know about IOMMU is that it is somehow related to VT-d (passing whole PCI
> devices to virtual guests), and that I have to pass intel_iommu=off to kernel
> command line, otherwise the machine doesn't even boot. Is that a problem?
Yes. That mean iommu software or hardware is broken on your system.

I have no other ideas how to track this down. I think now, this is
a firwmare issue. BTW, you suspected that from very beginning :-)
This could be also a driver issue, but AFAICT programing DMA do not
differ on 6205 from other devices, so bug in firmware is much more
probable reason of corruption.

Stanislaw

2011-11-19 18:11:08

by Tomas Janousek

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hello,

On Mon, Nov 14, 2011 at 03:07:15PM +0100, Stanislaw Gruszka wrote:
> On Fri, Nov 11, 2011 at 04:01:05PM +0100, Tomáš Janoušek wrote:
> > Could you please elaborate on that thing with enabling IOMMU? The only thing I
> > know about IOMMU is that it is somehow related to VT-d (passing whole PCI
> > devices to virtual guests), and that I have to pass intel_iommu=off to kernel
> > command line, otherwise the machine doesn't even boot. Is that a problem?
> Yes. That mean iommu software or hardware is broken on your system.
>
> I have no other ideas how to track this down. I think now, this is
> a firwmare issue. BTW, you suspected that from very beginning :-)
> This could be also a driver issue, but AFAICT programing DMA do not
> differ on 6205 from other devices, so bug in firmware is much more
> probable reason of corruption.

I have some news. I got IOMMU to work, because I identified the problem [1]
and disabled firewire-ohci for the time being completely, but I'm not sure
what do I need to do to make it catch the problem. I assumed that all I need
is to intel_iommu=on and then all devices do DMA stuff in isolation, but I can
still reproduce the issue without the smallest hint of an error in dmesg. Does
it tell us anything, or shall I enable some more debugging stuff?

[1] http://thread.gmane.org/gmane.linux.kernel.pci/8765/focus=1217800

Anyway, I didn't get to trying 2.6.38/39 yet, but I will do that soon. It is
also safe to say now that x86_64 is completely unaffected, as I was running
various 64bit kernels the whole week without a single failure.

Regards,
--
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/

2011-11-10 16:30:55

by Tomas Janousek

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hello,

On Thu, Nov 10, 2011 at 05:07:04PM +0100, Stanislaw Gruszka wrote:
> > (We've got a lot of Lenovos here, mostly T520 and T420s, most of them running
> > Fedora, and nobody has reported memory corruption problems.
>
> Are there any others with 6205? If not that would confirm issue is
> related with that model.

I think all those T520 and T420s models have exactly the same wireless card. I
don't see why they wouldn't have. :-)

> You may first try some older kernel as Wey suggested, i.e. 2.6.38.

Okay, I'll try that as soon as I get some more free time.

Thanks for the suggestions,
--
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/

2011-11-11 15:01:09

by Tomas Janousek

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hello,

On Fri, Nov 11, 2011 at 06:47:31AM +0100, Stanislaw Gruszka wrote:
> On Thu, Nov 10, 2011 at 05:30:51PM +0100, Tomáš Janoušek wrote:
> > I think all those T520 and T420s models have exactly the same wireless card. I
> > don't see why they wouldn't have. :-)
>
> Well, laptops models have common mainboards, but peripheral components
> might not be identical. Especially wireless card might vary from unit
> to unit. For example:
>
> T520 with iwl6300:
> http://smolt.fedoraproject.org/client/show/pub_0cbc8731-47d3-4ecf-8add-aa832b34edcd
> T520 with iwl1000:
> http://smolt.fedoraproject.org/client/show/pub_02f601d6-7a93-4c0e-bc01-a4a95db36086
> T520 with iwl6205:
> http://smolt.fedoraproject.org/client/show/pub_0232252f-75a9-4050-a69b-c8e95e00eaf2

These links don't work, but my colleagues indeed do have iwl6205. They are,
however, running 64-bit kernels, so I tried doing the same and I haven't been
able to reproduce the issue for the past few hours. So it seems like it
happens on 32-bit kernels only. This looks like a good workaround (better than
getting a different wireless card), so I can run 64-bit during working days
and experiment with 32-bit when I have time to debug this.

Could you please elaborate on that thing with enabling IOMMU? The only thing I
know about IOMMU is that it is somehow related to VT-d (passing whole PCI
devices to virtual guests), and that I have to pass intel_iommu=off to kernel
command line, otherwise the machine doesn't even boot. Is that a problem?

Kind regards,
--
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/

2011-11-10 12:53:51

by Tomas Janousek

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hello Stanislaw,

On Thu, Nov 10, 2011 at 12:47:33PM +0100, Stanislaw Gruszka wrote:
> If "dmesg | grep corrupt" will show "Setting corrupt debug order to 1"
> patches are in use. Anyway I need to test the patches locally, to see
> if they work as expected, perhaps exception is generated but call-trace
> is not printed.

It does say that, yes.

> Is this happen only with "Intel Corporation Centrino Advanced-N 6205" or
> with some other adapters?

I don't have any other iwlwifi adapters, so I wouldn't know. The people in
that Ubuntu bugreport have exactly that card as well, but in another notebook.
And they claim it works in newer Ubuntu, but I am running latest kernels with
latest uCode, so I'm out of ideas what else could be wrong.

> > Perhaps it would be cheaper to just get another card in that case.
> > :-)
>
> That will left issue unresolved :-(

Yeah, but considering how few people report this, I'm starting to feel that it
might in fact be a hardware issue. It is unfortunate that the wifi adapter in
this Lenovo model isn't placed in that part of the notebook on the back where
it could be easily replaced. If that was the case, I could easily swap it with
my colleagues without voiding the warranty.

(We've got a lot of Lenovos here, mostly T520 and T420s, most of them running
Fedora, and nobody has reported memory corruption problems. Mine is T420.
Perhaps I should try to connect to this WPA Enterprise using Windows and see
if anything goes wrong. However, I have no clue as to what shall I do to
reproduce the issue in Windows.)

Regards,
--
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/

2011-11-10 11:46:52

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hello Tomáš

On Thu, Nov 10, 2011 at 10:18:16AM +0100, Tomáš Janoušek wrote:
> On Wed, Nov 09, 2011 at 05:51:59PM +0100, Stanislaw Gruszka wrote:
> > I just discovered that CONFIG_DEBUG_PAGEALLOC does not work as expected.
> > It leave most of free pages unprotected, hence unintentional write to
> > them is not discovered. I'm attaching additional patch, which should
> > make detection actually work.
> >
> > If kernel will does not boot with corrupt_dbg=1, you may try to catch
> > corruption without that option. Attached patch should make it possible,
> > however having corrupt_dbg=1 increase probability of the catch.
>
> Okay, I applied this additional patch, and by the increased memory usage (as
> shown by free) I concluded that it indeed works. However, I was still able to
> reproduce the issue without a single error being written to dmesg. :-(
If "dmesg | grep corrupt" will show "Setting corrupt debug order to 1"
patches are in use. Anyway I need to test the patches locally, to see
if they work as expected, perhaps exception is generated but call-trace
is not printed.

Is this happen only with "Intel Corporation Centrino Advanced-N 6205" or
with some other adapters?

> I will try some really old kernels as Wey suggested to see whether it makes
> any sense to bisect it, but if it does, it might take more time than I can
> make free.
Yes, bisection is very time consuming, especially when reproducing
is not easy.

> Perhaps it would be cheaper to just get another card in that case.
> :-)
That will left issue unresolved :-(

Stanislaw

2011-11-10 19:31:46

by Adrian Chadd

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

.. are you sure it's a software use-after-free?
What about "NIC DMA'ing stuff into completely incorrect space" after free? :-)
(Or a firmware/NIC bug where it scribbles to random memory at times..)


Adrian

2011-11-09 15:54:15

by Tomas Janousek

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hello Stanislaw,

On Mon, Oct 31, 2011 at 05:03:43PM +0100, Stanislaw Gruszka wrote:
> You may try debugging patches I posted a while ago:
> http://marc.info/?l=linux-mm&m=131914560820378&w=2
> http://marc.info/?l=linux-mm&m=131914560820293&w=2
> http://marc.info/?l=linux-mm&m=131914560820317&w=2
>
> With a bit of luck, kernel should panic and dump call-trace when
> bad code start to write at memory addresses where is not suppose
> to.

Thanks for your suggestions. I did as you told me, applied those 3 patches on
top of 3.1 + net-next (the one from 29 Oct 2011), enabled all those things in
config and passed corrupt_dbg=1 on cmdline, but the problem happens without
anything being written to dmesg.

Am I just lacking a bit of luck, or could it mean something (like that the
error is in hardware, the microcode, or something like that)?

Regards,
--
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/

2012-02-13 13:29:41

by Tomas Janousek

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hi,

On Mon, Feb 13, 2012 at 02:09:04PM +0100, Stanislaw Gruszka wrote:
> I also found this bug report
> https://bugzilla.kernel.org/show_bug.cgi?id=37742
> where one user report iwlwifi corruption catched by IOMMU.

I wasn't able to catch anything using IOMMU, and I also wasn't able to
reproduce the issue using any userspace memory checking tool. Hence I tend to
believe we're not dealing with a memory corruption at all, perhaps something
like certain CPU flags/registers not being correctly saved/restored during
wlan interrupts or something. When I have a sufficient amount of free time,
I'll try to check this hypothesis and perhaps pinpoint the instruction the
result of which is corrupted during wlan operation.

> Tomáš, I do not remember, do you have the same problems on
> older kernels i.e < 3.0

Yeah. I was able to reproduce it with 2.6.38.8 and 2.6.39.4 at least.

Regards,
--
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/

2012-02-14 09:20:59

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

On Fri, Feb 10, 2012 at 07:09:29PM +0100, Tomáš Janoušek wrote:
> For the last few months, I've happily used a 64-bit kernel and have had no
> problems whatsoever. About a week ago, I started using virtual machines in
> KVM. And today I found that I have exactly the same problem, but only _inside_
> the virtual machine. I can't reliably scp a file from the internet to my
> virtual machine. It works fine when I scp to the host, it works fine when I'm
> on a WPA-PSK network. And it happens even if I tell kvm to emulate e1000, not
> only with virtio-net. How strange is that?
>
> And while this is happening, the host is running just fine. The host has a
> 64-bit kernel with a 32-bit userspace, so if something was wrong with the
> 32-bit mode of my processor, it would've appeared on the host as well, no?
>
> It's also worth mentioning that if I build openssl with "no-asm 386", scp
> works just fine.
Good hint.

> So it doesn't look like a memory corruption after all. It
> seems as if certain CPU instructions didn't work properly if running on a
> 32-bit kernel with a WiFi adapter doing something. But how can it be
> that those same CPU instructions work on a 64-bit host with 32-bit userspace?
> At the same time! That's just completely insane, and I can't think of an
> explanation. Shall I get a new CPU perhaps? :-)
>
>
> Please, give me any ideas that you might have.

That make sense! Your "CPU instructions break things" theory sounds crazy,
but I think it's logical. WPA enterprise differ from WPA-PSA (pre shared
key) that the key changed periodically, SSL is used when keys are changed
(via wpa_supplicant). So looks like 32-bit openssl generate object code
that trigger bug on CPU, which crash other processes.

Please forward details about this issue to [email protected] and proper
vendor engineer in non public manner, as this hw bug could be possibly
exploitable (hardware bug can not be fixed, but kernel could disable
appropriate functionality or use some other workaround).

Thanks
Stanislaw

2012-02-10 18:27:14

by Tomas Janousek

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hi guys,

On Sun, Nov 20, 2011 at 09:40:07PM +0100, Tomáš Janoušek wrote:
> > Yes, I will try your configuration when I get back to the office Monday

Did you have any luck? I just found out something which is almost completely
insane.

For the last few months, I've happily used a 64-bit kernel and have had no
problems whatsoever. About a week ago, I started using virtual machines in
KVM. And today I found that I have exactly the same problem, but only _inside_
the virtual machine. I can't reliably scp a file from the internet to my
virtual machine. It works fine when I scp to the host, it works fine when I'm
on a WPA-PSK network. And it happens even if I tell kvm to emulate e1000, not
only with virtio-net. How strange is that?

And while this is happening, the host is running just fine. The host has a
64-bit kernel with a 32-bit userspace, so if something was wrong with the
32-bit mode of my processor, it would've appeared on the host as well, no?

It's also worth mentioning that if I build openssl with "no-asm 386", scp
works just fine. So it doesn't look like a memory corruption after all. It
seems as if certain CPU instructions didn't work properly if running on a
32-bit kernel with a WiFi adapter doing something. But how can it be
that those same CPU instructions work on a 64-bit host with 32-bit userspace?
At the same time! That's just completely insane, and I can't think of an
explanation. Shall I get a new CPU perhaps? :-)

Please, give me any ideas that you might have.

Regards,
--
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/

2012-02-13 13:09:11

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

On Mon, Feb 13, 2012 at 10:25:39AM +0100, Stanislaw Gruszka wrote:
> On Fri, Feb 10, 2012 at 07:09:29PM +0100, Tomáš Janoušek wrote:
> > On Sun, Nov 20, 2011 at 09:40:07PM +0100, Tomáš Janoušek wrote:
> > > > Yes, I will try your configuration when I get back to the office Monday
> >
> > Did you have any luck?
> I think I tried to reproduce that problem and failed, but honestly I do
> not remember right now ...
>
> > I just found out something which is almost completely
> > insane.
> >
> > For the last few months, I've happily used a 64-bit kernel and have had no
> > problems whatsoever. About a week ago, I started using virtual machines in
> > KVM. And today I found that I have exactly the same problem, but only _inside_
> > the virtual machine. I can't reliably scp a file from the internet to my
> > virtual machine. It works fine when I scp to the host, it works fine when I'm
> > on a WPA-PSK network. And it happens even if I tell kvm to emulate e1000, not
> > only with virtio-net. How strange is that?
> >
> > And while this is happening, the host is running just fine. The host has a
> > 64-bit kernel with a 32-bit userspace, so if something was wrong with the
> > 32-bit mode of my processor, it would've appeared on the host as well, no?
> >
> > It's also worth mentioning that if I build openssl with "no-asm 386", scp
> > works just fine. So it doesn't look like a memory corruption after all. It
> > seems as if certain CPU instructions didn't work properly if running on a
> > 32-bit kernel with a WiFi adapter doing something. But how can it be
> > that those same CPU instructions work on a 64-bit host with 32-bit userspace?
> > At the same time! That's just completely insane, and I can't think of an
> > explanation. Shall I get a new CPU perhaps? :-)
>
> Currently there are discussion about compilator problems that
> can result a corruption
> http://lwn.net/Articles/478657/
> Perhaps this problem is something similar.
>
> Also, if you look at lspci -vt, does it show that corruption happen
> only when PCI bridge is used (however that would not explain why it
> only happens with WPA enterprise).

I also found this bug report
https://bugzilla.kernel.org/show_bug.cgi?id=37742
where one user report iwlwifi corruption catched by IOMMU.

Tomáš, I do not remember, do you have the same problems on
older kernels i.e < 3.0

Stanislaw

2012-02-13 09:25:50

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

On Fri, Feb 10, 2012 at 07:09:29PM +0100, Tomáš Janoušek wrote:
> On Sun, Nov 20, 2011 at 09:40:07PM +0100, Tomáš Janoušek wrote:
> > > Yes, I will try your configuration when I get back to the office Monday
>
> Did you have any luck?
I think I tried to reproduce that problem and failed, but honestly I do
not remember right now ...

> I just found out something which is almost completely
> insane.
>
> For the last few months, I've happily used a 64-bit kernel and have had no
> problems whatsoever. About a week ago, I started using virtual machines in
> KVM. And today I found that I have exactly the same problem, but only _inside_
> the virtual machine. I can't reliably scp a file from the internet to my
> virtual machine. It works fine when I scp to the host, it works fine when I'm
> on a WPA-PSK network. And it happens even if I tell kvm to emulate e1000, not
> only with virtio-net. How strange is that?
>
> And while this is happening, the host is running just fine. The host has a
> 64-bit kernel with a 32-bit userspace, so if something was wrong with the
> 32-bit mode of my processor, it would've appeared on the host as well, no?
>
> It's also worth mentioning that if I build openssl with "no-asm 386", scp
> works just fine. So it doesn't look like a memory corruption after all. It
> seems as if certain CPU instructions didn't work properly if running on a
> 32-bit kernel with a WiFi adapter doing something. But how can it be
> that those same CPU instructions work on a 64-bit host with 32-bit userspace?
> At the same time! That's just completely insane, and I can't think of an
> explanation. Shall I get a new CPU perhaps? :-)

Currently there are discussion about compilator problems that
can result a corruption
http://lwn.net/Articles/478657/
Perhaps this problem is something similar.

Also, if you look at lspci -vt, does it show that corruption happen
only when PCI bridge is used (however that would not explain why it
only happens with WPA enterprise).

Stanislaw

2012-03-05 15:12:01

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hello.

On Mon, Mar 05, 2012 at 04:00:31PM +0100, Tomáš Janoušek wrote:
> On Mon, Mar 05, 2012 at 03:57:01PM +0100, Stanislaw Gruszka wrote:
> > No problem. Can you remind me, is this reproducible on 64-bit kernel
> > with 32-bit user space? I'm asking because I would like to know if we
> > need to backport those fixes to our kernel. We do not enable
> > CONFIG_CRYPTO_AES_NI_INTEL on 32 bit kernel, only on 64 bit, but if this
> > problem happen with 32-bit user land with 64 bit kernel, we will need to
> > do backport.
>
> It happens in 32-bit KVM guests on a 64-bit host, so I guess you need it.

Perhaps bug was triggered because KVM 32-bit guest had CONFIG_CRYPTO_AES_NI_INTEL
enabled ?

Stanislaw

2012-03-05 15:18:10

by Tomas Janousek

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hello,

On Mon, Mar 05, 2012 at 04:11:41PM +0100, Stanislaw Gruszka wrote:
> Perhaps bug was triggered because KVM 32-bit guest had CONFIG_CRYPTO_AES_NI_INTEL
> enabled ?

Nope. CONFIG_CRYPTO_AES_NI_INTEL is enabled only in the 64-bit host kernel.
Guest kernel version is 2.6.32-220.2.1.el6.i686 and there's no mention of
AES_NI_INTEL in its .config.

(And I do run a 32-bit userspace, perhaps if I had 64-bit userspace, the guest
would be safe. Or maybe not. I have no idea and no time to get a 64-bit
userspace. :-))

--
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/

2012-03-05 14:01:35

by Tomas Janousek

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hi,

On Tue, Feb 14, 2012 at 10:20:21AM +0100, Stanislaw Gruszka wrote:
> > So it doesn't look like a memory corruption after all. It
> > seems as if certain CPU instructions didn't work properly if running on a
> > 32-bit kernel with a WiFi adapter doing something. But how can it be
> > that those same CPU instructions work on a 64-bit host with 32-bit userspace?
> > At the same time! That's just completely insane, and I can't think of an
> > explanation. Shall I get a new CPU perhaps? :-)
> >
> >
> > Please, give me any ideas that you might have.
>
> That make sense! Your "CPU instructions break things" theory sounds crazy,
> but I think it's logical. WPA enterprise differ from WPA-PSA (pre shared
> key) that the key changed periodically, SSL is used when keys are changed
> (via wpa_supplicant). So looks like 32-bit openssl generate object code
> that trigger bug on CPU, which crash other processes.

It seems that someone beat me to it. Since Linus fixed the FPU leaks in
3.3-rc4, I haven't experienced the problem. And I was this close! :-)

Anyway, thanks for assistance and sorry for being so slow to respond.

Regards,
--
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/

2012-03-05 14:57:33

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

On Mon, Mar 05, 2012 at 03:01:30PM +0100, Tomáš Janoušek wrote:
> > That make sense! Your "CPU instructions break things" theory sounds crazy,
> > but I think it's logical. WPA enterprise differ from WPA-PSA (pre shared
> > key) that the key changed periodically, SSL is used when keys are changed
> > (via wpa_supplicant). So looks like 32-bit openssl generate object code
> > that trigger bug on CPU, which crash other processes.
>
> It seems that someone beat me to it. Since Linus fixed the FPU leaks in
> 3.3-rc4, I haven't experienced the problem. And I was this close! :-)

Yeh, that was really nasty bug.

> Anyway, thanks for assistance and sorry for being so slow to respond.

No problem. Can you remind me, is this reproducible on 64-bit kernel
with 32-bit user space? I'm asking because I would like to know if we
need to backport those fixes to our kernel. We do not enable
CONFIG_CRYPTO_AES_NI_INTEL on 32 bit kernel, only on 64 bit, but if this
problem happen with 32-bit user land with 64 bit kernel, we will need to
do backport.

Thanks
Stanislaw

2012-03-05 15:00:34

by Tomas Janousek

[permalink] [raw]
Subject: Re: iwlagn: memory corruption with WPA enterprise

Hi,

On Mon, Mar 05, 2012 at 03:57:01PM +0100, Stanislaw Gruszka wrote:
> No problem. Can you remind me, is this reproducible on 64-bit kernel
> with 32-bit user space? I'm asking because I would like to know if we
> need to backport those fixes to our kernel. We do not enable
> CONFIG_CRYPTO_AES_NI_INTEL on 32 bit kernel, only on 64 bit, but if this
> problem happen with 32-bit user land with 64 bit kernel, we will need to
> do backport.

It happens in 32-bit KVM guests on a 64-bit host, so I guess you need it.

Regards,
--
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/