2014-06-07 00:06:49

by Nikolay Amiantov

[permalink] [raw]
Subject: What can change in ways Linux handles memory when all memory >4G is disabled? (x86)

Hello all,

I'm trying to resolve a cryptic problem with Lenovo T440p (and with
Dell XPS 15z, as it appears) and nvidia in my spare time. You can read
more at [1]. Basically: when the user disables and then re-enables
nvidia card (via ACPI, bbswitch or nouveau's dynpm) on new BIOS
versions, something becomes really wrong. User sees fs, usb devices
and network controllers faults of all kinds, system renders unusable
and user can observe filesystem corruption after reboot. Nvidia
drivers (or nouveau, or i915) can not even be loaded -- all that is
needed to trigger a bug is to call several ACPI methods to disable and
re-enable the card (e.g., via acpi-call module).

I've attached a debugger to Windows kernel to catch ACPI calls for
disabling and re-enabling NVIDIA card -- they don't really differ with
what bbswitch and others use. Furthermore, the difference between ACPI
DSDT tables in 1.14 (last good) and 1.16 (first broken) BIOSes are
minimal, and loading table from 1.14 into system running 1.16 does not
help. But -- all those devices are using memory I/O, so my current
theory is that memory is somehow corrupted. There are also some
changes in lspci output for nvidia [2].

I've played a bit with this theory in mind and found a very
interesting thing -- when I reserve all memory upper than 4G with
"memmap" kernel option ("memmap=99G$0x100000000"), everything works!
Also, I've written a small utility that fills memory with zeros using
/dev/mem and then checks it. I've checked reserved memory with it, and
it appears that no memory in that region is corrupted at all, which is
even more strange. I suspect that somehow when nvidia is enabled
I/O-mapped memory regions are corrupted, and only when upper memory is
not enabled. Also, memory map does not differ apart from missing last
big chunk of memory with and without "memmap", and with Windows, too.
If I enable even small chunk of "upper" memory (e.g.,
0x270000000-0x280000000), there are usual crashes.

Long story short: I'm interested how memory management can differ when
this "upper" memory regions are enabled?

P.S.: This is my first time posting to LKML, if I've done something
wrong, please tell!

[1]: https://github.com/Bumblebee-Project/bbswitch/issues/78
[2]: http://bpaste.net/show/350758/

--
Nikolay Amiantov.


2014-06-08 04:20:11

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: What can change in ways Linux handles memory when all memory >4G is disabled? (x86)

[+cc linux-pci, linux-pm]

On Fri, Jun 6, 2014 at 6:06 PM, Nikolay Amiantov <[email protected]> wrote:
> Hello all,
>
> I'm trying to resolve a cryptic problem with Lenovo T440p (and with
> Dell XPS 15z, as it appears) and nvidia in my spare time. You can read
> more at [1]. Basically: when the user disables and then re-enables
> nvidia card (via ACPI, bbswitch or nouveau's dynpm) on new BIOS
> versions, something becomes really wrong. User sees fs, usb devices
> and network controllers faults of all kinds, system renders unusable
> and user can observe filesystem corruption after reboot. Nvidia
> drivers (or nouveau, or i915) can not even be loaded -- all that is
> needed to trigger a bug is to call several ACPI methods to disable and
> re-enable the card (e.g., via acpi-call module).

I don't know what ACPI methods you're calling, but (as I'm sure you
know) it's not guaranteed to be safe to call random methods because
they can make arbitrary changes to the system.

> I've attached a debugger to Windows kernel to catch ACPI calls for
> disabling and re-enabling NVIDIA card -- they don't really differ with
> what bbswitch and others use. Furthermore, the difference between ACPI
> DSDT tables in 1.14 (last good) and 1.16 (first broken) BIOSes are
> minimal, and loading table from 1.14 into system running 1.16 does not
> help. But -- all those devices are using memory I/O, so my current
> theory is that memory is somehow corrupted. There are also some
> changes in lspci output for nvidia [2].

I skimmed through [1], but I'm not sure I understood everything.
Here's what I gleaned; please correct any mistaken impressions:

1) Suspend/resume is mentioned in [1], but the problem occurs even
without any suspend/resume.
2) The problem happens on a completely stock untainted upstream
kernel even with no nvidia, nouveau, or i915 drivers loaded.
3) Disabling the nvidia device (02:00.0) by executing an ACPI method
works fine, and the system works fine after the nvidia device is
disabled.
4) This ACPI method puts the nvidia device in D3cold state.
5) Problems start when enabling the nvidia device by executing
another ACPI method.

In the D3cold state, the PCI device is entirely powered off. After it
is re-enabled, e.g., by the ACPI method in 5) above, the device needs
to be completely re-initialized. Since you're executing the ACPI
method "by hand," outside the context of the Linux power management
system, there's nothing to re-initialize the device.

This by itself shouldn't be a problem; the device should power up with
its BARs zeroed out and disabled, bus mastering disabled, etc.

BUT the kernel doesn't know about these power changes you're making,
so some things will be broken. For example, while the nvidia device
is in D3cold, lspci will return garbage for that device. After it
returns to D0, lspci should work again, but now the state of the
device (BAR assignments, interrupts, etc.) is different from what
Linux thinks it is.

If a driver does anything with the device after it returns to D0, I
think things will break, because the PCI core already knows what
resources are assigned to the device, but the device forgot them when
it was powered off. So the PCI core would happily enable the device
but it will respond at the wrong addresses.

But I think you said problems happen even without any driver for the
nvidia device, so there's probably more going on. This is a video
device, and I wouldn't be surprised if there's some legacy VGA
behavior that doesn't follow the usual PCI rules.

Can you:

1) Collect complete "lspci -vvxxx" output from the whole system, with
the nvidia card enabled.
2) Disable nvidia card.
3) Collect complete dmesg log.
4) Try "lspci -s02:00.0". I expect this to show garbage if the nvidia
card is powered off.
5) Enable nvidia card.
6) Try "lspci -vvxxx" again. You mentioned changes to devices other
than nvidia, which sounds suspicious.
7) Collect dmesg log again. I don't expect changes here, because the
kernel probably doesn't notice the power transition.

Bjorn

> I've played a bit with this theory in mind and found a very
> interesting thing -- when I reserve all memory upper than 4G with
> "memmap" kernel option ("memmap=99G$0x100000000"), everything works!
> Also, I've written a small utility that fills memory with zeros using
> /dev/mem and then checks it. I've checked reserved memory with it, and
> it appears that no memory in that region is corrupted at all, which is
> even more strange. I suspect that somehow when nvidia is enabled
> I/O-mapped memory regions are corrupted, and only when upper memory is
> not enabled. Also, memory map does not differ apart from missing last
> big chunk of memory with and without "memmap", and with Windows, too.
> If I enable even small chunk of "upper" memory (e.g.,
> 0x270000000-0x280000000), there are usual crashes.
>
> Long story short: I'm interested how memory management can differ when
> this "upper" memory regions are enabled?
>
> P.S.: This is my first time posting to LKML, if I've done something
> wrong, please tell!
>
> [1]: https://github.com/Bumblebee-Project/bbswitch/issues/78
> [2]: http://bpaste.net/show/350758/
>
> --
> Nikolay Amiantov.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2014-06-08 17:22:33

by Nikolay Amiantov

[permalink] [raw]
Subject: Re: What can change in ways Linux handles memory when all memory >4G is disabled? (x86)

On Sun, Jun 8, 2014 at 8:19 AM, Bjorn Helgaas <[email protected]> wrote:
> [+cc linux-pci, linux-pm]
>
>
> I don't know what ACPI methods you're calling, but (as I'm sure you
> know) it's not guaranteed to be safe to call random methods because
> they can make arbitrary changes to the system.
Yes, I've tested this behaviour with bbswitch and nouveau's runpm
separately, because of this -- this problem is persisting without any
changes.
>
>
> I skimmed through [1], but I'm not sure I understood everything.
> Here's what I gleaned; please correct any mistaken impressions:
>
> 1) Suspend/resume is mentioned in [1], but the problem occurs even
> without any suspend/resume.

Yes, that's correct -- suspend/resume was mentioned because a lot of
people observe this bug after bbswitch module they are using disables
nvidia at boot and enables it again on suspend (I can't remember why
it does this). When this happens, on resume user observes black
screen, broken FS and so on.

> 2) The problem happens on a completely stock untainted upstream
> kernel even with no nvidia, nouveau, or i915 drivers loaded.
It depends on what you call "stock" -- something in kernel is needed
to trigger this behaviour, but I've tested it on ramdisk with only
acpi_call module loaded (which is non-stock, but only allows to do
arbitrary ACPI calls from userspace). This behaviour is same with
nouveau+i915, too (which can be called stock), and with bbswitch
(which can't be called so).
> 3) Disabling the nvidia device (02:00.0) by executing an ACPI method
> works fine, and the system works fine after the nvidia device is
> disabled.

Yes, the most popular "workaround" for this problem, giving you don't
care about nvidia and only want to lower power consumption, is to use
something like [1] (commented lines are calls how they are made in
Windows).

> 4) This ACPI method puts the nvidia device in D3cold state.

Right, as far as I understood.

> 5) Problems start when enabling the nvidia device by executing
> another ACPI method.

Again right, you can observe an example in [2].

>
> In the D3cold state, the PCI device is entirely powered off. After it
> is re-enabled, e.g., by the ACPI method in 5) above, the device needs
> to be completely re-initialized. Since you're executing the ACPI
> method "by hand," outside the context of the Linux power management
> system, there's nothing to re-initialize the device.
>
> This by itself shouldn't be a problem; the device should power up with
> its BARs zeroed out and disabled, bus mastering disabled, etc.
>
> BUT the kernel doesn't know about these power changes you're making,
> so some things will be broken. For example, while the nvidia device
> is in D3cold, lspci will return garbage for that device. After it
> returns to D0, lspci should work again, but now the state of the
> device (BAR assignments, interrupts, etc.) is different from what
> Linux thinks it is.
>
> If a driver does anything with the device after it returns to D0, I
> think things will break, because the PCI core already knows what
> resources are assigned to the device, but the device forgot them when
> it was powered off. So the PCI core would happily enable the device
> but it will respond at the wrong addresses.

Thanks for the explanations! I don't really know much about PCI or
Linux PCI subsystem internals, only some general theory, including
memory I/O and power states. This doesn't, however, explain why does
this bug is observable even with nouveau's proper dynpm or bbswitch.
I've looked through the source of bbswitch [3], and, AFAIU, it differs
from raw calls in those ways:

1) It calls only _DSM ACPI routine and then disables the device by
issuing calls on lines 260-277 (it saves some state and puts device to
D3 from what I can tell, maybe it will tell more to you).
2) It doesn't use ACPI at all for enabling the card, only puts device
to D0 again, restores state and sets something (lines 292-296).

>
> But I think you said problems happen even without any driver for the
> nvidia device, so there's probably more going on. This is a video
> device, and I wouldn't be surprised if there's some legacy VGA
> behavior that doesn't follow the usual PCI rules.
>
> Can you:
>
> 1) Collect complete "lspci -vvxxx" output from the whole system, with
> the nvidia card enabled.
> 2) Disable nvidia card.
> 3) Collect complete dmesg log.
> 4) Try "lspci -s02:00.0". I expect this to show garbage if the nvidia
> card is powered off.

>From what I have understood, you have wanted me to do this with raw
ACPI calls, not with other methods, correct?

> 5) Enable nvidia card.
> 6) Try "lspci -vvxxx" again. You mentioned changes to devices other
> than nvidia, which sounds suspicious.
> 7) Collect dmesg log again. I don't expect changes here, because the
> kernel probably doesn't notice the power transition.

There are some problems with (5..7), because after nvidia is enabled
again, the system goes berserk with no way to do some output besides,
maybe, doing a screen photo (which I've used). I can do this with >4G
of memory disabled, however, which (as I've said) somehow puts
everything in order -- I have done it this way, too. dmesg log has no
relevant changes.

Again, for clarity: Testing has been made with 3.14.5 kernel with some
patches from Arch (bugfixes not yet in stable), BFQ, loaded acpi_call
module [4] and "memmap" option. I've used [1] and [2] to disable end
enable the card. This behaviour is reproducable with stock Arch
kernel, linux-lts and also with linux-next from month ago (I don't
have linux-next ready now, and I need a bugfix for bcache -- otherwise
dmesg is filled with backtraces, which is why I haven't used other
kernels for this).

Testing with disabled >=4G of mem:
1st lspci: http://bpaste.net/show/355530/
dmesg: http://bpaste.net/show/355531/
lspci -s: http://bpaste.net/show/355532/
2nd lspci: http://bpaste.net/show/355533/

Testing with enabled >=4G of mem:
1st lspci: http://bpaste.net/show/355613/
dmesg: http://bpaste.net/show/355619/
lspci -s: identical
2nd lspci: http://abbradar.net/abbradar/share/pub/nvidia-lspci/
dmesg log if riddled with various subsystems' errors (mostly iwlwifi,
e1000e, ide and so on), I haven't made photos because "less" binary
became corrupted.

BTW: Thanks for the answer!

Nikolay Amiantov.

[1]: http://bpaste.net/show/355364/
[2]: http://bpaste.net/show/355365/
[3]: https://github.com/Bumblebee-Project/bbswitch/blob/master/bbswitch.c
[4]: https://github.com/mkottman/acpi_call/

>
> Bjorn
>

2014-06-08 17:53:21

by H. Peter Anvin

[permalink] [raw]
Subject: Re: What can change in ways Linux handles memory when all memory >4G is disabled? (x86)

On 06/06/2014 05:06 PM, Nikolay Amiantov wrote:
>
> I've played a bit with this theory in mind and found a very
> interesting thing -- when I reserve all memory upper than 4G with
> "memmap" kernel option ("memmap=99G$0x100000000"), everything works!
> Also, I've written a small utility that fills memory with zeros using
> /dev/mem and then checks it. I've checked reserved memory with it, and
> it appears that no memory in that region is corrupted at all, which is
> even more strange. I suspect that somehow when nvidia is enabled
> I/O-mapped memory regions are corrupted, and only when upper memory is
> not enabled. Also, memory map does not differ apart from missing last
> big chunk of memory with and without "memmap", and with Windows, too.
> If I enable even small chunk of "upper" memory (e.g.,
> 0x270000000-0x280000000), there are usual crashes.
>
> Long story short: I'm interested how memory management can differ when
> this "upper" memory regions are enabled?
>

This would point either to an iommu problem, or a problem in the driver
where addresses somehow get truncated to 32 bits. Since this is a
graphics driver it is extremely complex, and subtle problems could be
buried somewhere inside it. The fact that you can trigger it without a
driver would point to that kind of problem inside the firmware.

-hpa

2014-06-08 18:06:57

by Nikolay Amiantov

[permalink] [raw]
Subject: Re: What can change in ways Linux handles memory when all memory >4G is disabled? (x86)

On Sun, Jun 8, 2014 at 9:53 PM, H. Peter Anvin <[email protected]> wrote:
>
> This would point either to an iommu problem, or a problem in the driver
> where addresses somehow get truncated to 32 bits. Since this is a
> graphics driver it is extremely complex, and subtle problems could be
> buried somewhere inside it. The fact that you can trigger it without a
> driver would point to that kind of problem inside the firmware.
>
> -hpa

My assumption is that in new BIOSes (old ones are working, as I've
said, for T440p) there is some new way nvidia should be enabled, and
it's not related to ACPI (the calls are same in Windows). There are no
related changelog entries for T440p [1] (between 1.14 and 1.16). Also,
about address truncation: I thought so too, but if I understand
correctly, then some memory in >=4G region should be reserved for some
driver's use, and this is not the case (see [2] for /proc/iomem).

Nikolay Amiantov.

[1]: http://download.lenovo.com/ibmdl/pub/pc/pccbbs/mobiles/gluj13us.txt
[2]: http://bpaste.net/show/355712/