2011-02-23 23:30:26

by Jan Niehusmann

[permalink] [raw]
Subject: memory corruption when (un)plugging VGA cable

On a Thinkpad x61s, I noticed some memory corruption when
plugging/unplugging the external VGA connection.

Symptoms:
---------

4 bytes at the beginning of a page get overwritten by zeroes.
The address of the corruption varies when rebooting the machine, but
stays constant while it's running (so it's possible to repeatedly write
some data and then corrupt it again by plugging the cable).

Some example addresses I observed were:
0x998da000
0xb4e6a000
0xb4854000
0xb4843000
(locations in /dev/mem - this is physical memory, right?)

Environment:
------------

2.6.37.1, x86_64, Thinkpad x61s, Intel GM965 with integrated graphics,
6GB of RAM (this may be triggering the problem, as officially, only 4GB
are supported).

I first observed the issue in November 2010, running 2.6.36, and then
again in January, with 2.6.37. Obviously, it doesn't get triggered too
often in day-to-day use (or just isn't noticed), so I'm not sure since
when it actually happened.

How to trigger:
---------------

I first noticed the problem after suspend-to-ram. Later I noticed it's
also possible to trigger the issue by just plugging the VGA cable, so I
guess suspending/resuming only triggers it because it also dis/enables
the VGA output.

Today I spent some time to actively reproduce the issue, successfully.

To reliably detect the problem, I filled up a tmpfs mount with a big
file (3GB) with known content, which makes the corruption hit part of
that file very often. That way, I can reproduce & detect the corruption
within a few minutes (usually after 1-2 reboots).

When plugging the cable while displaying a text console, the backlight
of the internal display goes darker a few moments after actually plugging
the cable. That seems to be the moment when the corruption occurs.


I tried to find the cause by setting a dataw breakpoint with kdb.
The breakpoint works (verified by writing to the affected file), but
doesn't trigger when the corruption occurs. Not sure what that means.


Suggestions on what I could try to find the cause are very welcome!


2011-02-25 12:31:00

by Jan Niehusmann

[permalink] [raw]
Subject: [PATCH] intel-gtt: fix memory corruption with GM965 and >4GB RAM

On Thu, Feb 24, 2011 at 12:30:22AM +0100, Jan Niehusmann wrote to
[email protected]:
> On a Thinkpad x61s, I noticed some memory corruption when
> plugging/unplugging the external VGA connection.
>
> Symptoms:
> ---------
>
> 4 bytes at the beginning of a page get overwritten by zeroes.
> The address of the corruption varies when rebooting the machine, but
> stays constant while it's running (so it's possible to repeatedly write
> some data and then corrupt it again by plugging the cable).

Further investigation revealed that the corrupted address is
(dev_priv->status_page_dmah->busaddr & 0xffffffff), ie. the beginning of
the hardware status page of the i965 graphics card, cut to 32 bits.

So it seems that for some memory access, the hardware uses only 32 bit
addressing. If the hardware status page is located >4GB, this corrupts
unrelated memory.

The corruption was observed on a Thinkpad x61s, using the Mobile Intel
GM965 Express Chipset. The first four bytes of the wrong page are
overwritten with zeroes whenever the VGA cable gets plugged or unplugged.

This patch simply works around this issue by confining the dma memory
to 32 bits.

It's not known if other chipsets are affected as well.

---
drivers/char/agp/intel-gtt.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/char/agp/intel-gtt.c b/drivers/char/agp/intel-gtt.c
index 29ac6d4..f7977f2 100644
--- a/drivers/char/agp/intel-gtt.c
+++ b/drivers/char/agp/intel-gtt.c
@@ -1379,7 +1379,7 @@ static const struct intel_gtt_driver i965_gtt_driver = {
.setup = i9xx_setup,
.cleanup = i9xx_cleanup,
.write_entry = i965_write_entry,
- .dma_mask_size = 36,
+ .dma_mask_size = 32,
.check_flags = i830_check_flags,
.chipset_flush = i9xx_chipset_flush,
};
--
1.7.2.3

2011-02-25 20:23:12

by Chris Wilson

[permalink] [raw]
Subject: Re: [PATCH] intel-gtt: fix memory corruption with GM965 and >4GB RAM

On Fri, 25 Feb 2011 13:30:56 +0100, Jan Niehusmann <[email protected]> wrote:
> On Thu, Feb 24, 2011 at 12:30:22AM +0100, Jan Niehusmann wrote to
> [email protected]:
> > On a Thinkpad x61s, I noticed some memory corruption when
> > plugging/unplugging the external VGA connection.
> >
> > Symptoms:
> > ---------
> >
> > 4 bytes at the beginning of a page get overwritten by zeroes.
> > The address of the corruption varies when rebooting the machine, but
> > stays constant while it's running (so it's possible to repeatedly write
> > some data and then corrupt it again by plugging the cable).
>
> Further investigation revealed that the corrupted address is
> (dev_priv->status_page_dmah->busaddr & 0xffffffff), ie. the beginning of
> the hardware status page of the i965 graphics card, cut to 32 bits.

965GM explicitly supports 36bits of addressing in the PTE. The only
exception is that general state (part of the 3D engine) must be located in
the lower 4GiB.

Simply ignoring the upper 4bits is the wrong approach and means that the
PTE then point to random pages, and completely irrelevant to the physical
address used in the hardware status page address register.

I have been considering:

diff --git a/drivers/gpu/drm/i915/i915_dma.c
b/drivers/gpu/drm/i915/i915_dma.c
index ffa2196..268e448 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1896,6 +1896,8 @@ int i915_driver_load(struct drm_device *dev,
unsigned long
/* overlay on gen2 is broken and can't address above 1G */
if (IS_GEN2(dev))
dma_set_coherent_mask(&dev->pdev->dev, DMA_BIT_MASK(30));
+ if (IS_BRROADWATER(dev) || IS_CRESTLINE(dev))
+ dma_set_coherent_mask(&dev->pdev->dev, DMA_BIT_MASK(32));

mmio_bar = IS_GEN2(dev) ? 1 : 0;
dev_priv->regs = pci_iomap(dev->pdev, mmio_bar, 0);

to prevent hitting the erratum.

However your bug looks to be:

diff --git a/drivers/gpu/drm/i915/i915_dma.c
b/drivers/gpu/drm/i915/i915_dma.c
index ffa2196..3b80507 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -66,9 +66,9 @@ static int i915_init_phys_hws(struct drm_device *dev)

memset_io(ring->status_page.page_addr, 0, PAGE_SIZE);

- if (INTEL_INFO(dev)->gen >= 4)
- dev_priv->dma_status_page |= (dev_priv->dma_status_page >> 28) &
- 0xf0;
+ if (INTEL_INFO(dev)->gen >= 4) /* 36-bit addressing */
+ dev_priv->dma_status_page |=
+ (dev_priv->status_page_dmah->busaddr >> 28) & 0xf0;

I915_WRITE(HWS_PGA, dev_priv->dma_status_page);
DRM_DEBUG_DRIVER("Enabled hardware status page\n");

--
Chris Wilson, Intel Open Source Technology Centre

2011-02-25 21:16:50

by Jan Niehusmann

[permalink] [raw]
Subject: Re: [PATCH] intel-gtt: fix memory corruption with GM965 and >4GB RAM

Hi Chris,

On Fri, Feb 25, 2011 at 08:22:53PM +0000, Chris Wilson wrote:
> On Fri, 25 Feb 2011 13:30:56 +0100, Jan Niehusmann <[email protected]> wrote:
> > Further investigation revealed that the corrupted address is
> > (dev_priv->status_page_dmah->busaddr & 0xffffffff), ie. the beginning of
> > the hardware status page of the i965 graphics card, cut to 32 bits.
>
> 965GM explicitly supports 36bits of addressing in the PTE. The only
> exception is that general state (part of the 3D engine) must be located in
> the lower 4GiB.

I'm not claiming that 965GM doesn't do 36 bits. In fact I actually see
activity in /sys/kernel/debug/dri/64/i915_gem_hws, and everything seems
to be working well, when the status page is above 4GB - if one ignores
the tiny detail that the wrong memory location gets overwritten,
sometimes...

> Simply ignoring the upper 4bits is the wrong approach and means that the
> PTE then point to random pages, and completely irrelevant to the physical
> address used in the hardware status page address register.

Doesn't setting DMA_BIT_MASK(32) only change the region DMA memory is
allocated from? I made that change just to make sure one gets addresses
which are safe even if the chipset sometimes ignores address bit 32. The
only negative impact I could think of is the allocation may fail if no
appropriate memory is available. Am I wrong?

> I have been considering:

> + if (IS_BRROADWATER(dev) || IS_CRESTLINE(dev))
> + dma_set_coherent_mask(&dev->pdev->dev, DMA_BIT_MASK(32));

> to prevent hitting the erratum.

So is there a known erratum about these chips? I didn't find errata
documents online, but I only did a short google search and may have
missed them.

> However your bug looks to be:

> - if (INTEL_INFO(dev)->gen >= 4)
> - dev_priv->dma_status_page |= (dev_priv->dma_status_page >> 28) &
> - 0xf0;
> + if (INTEL_INFO(dev)->gen >= 4) /* 36-bit addressing */
> + dev_priv->dma_status_page |=
> + (dev_priv->status_page_dmah->busaddr >> 28) & 0xf0;

Don't think so. dev_priv->dma_status_page gets initialized to
dev_priv->status_page_dmah->busaddr a few lines above, and it's 64 bit,
so that diff doesn't change the result of the computation.

Jan

2011-02-25 22:18:39

by Chris Wilson

[permalink] [raw]
Subject: Re: [PATCH] intel-gtt: fix memory corruption with GM965 and >4GB RAM

On Fri, 25 Feb 2011 22:16:46 +0100, Jan Niehusmann <[email protected]> wrote:
> Hi Chris,
>
> On Fri, Feb 25, 2011 at 08:22:53PM +0000, Chris Wilson wrote:
> > On Fri, 25 Feb 2011 13:30:56 +0100, Jan Niehusmann <[email protected]> wrote:
> > > Further investigation revealed that the corrupted address is
> > > (dev_priv->status_page_dmah->busaddr & 0xffffffff), ie. the beginning of
> > > the hardware status page of the i965 graphics card, cut to 32 bits.
> >
> > 965GM explicitly supports 36bits of addressing in the PTE. The only
> > exception is that general state (part of the 3D engine) must be located in
> > the lower 4GiB.
>
> I'm not claiming that 965GM doesn't do 36 bits. In fact I actually see
> activity in /sys/kernel/debug/dri/64/i915_gem_hws, and everything seems
> to be working well, when the status page is above 4GB - if one ignores
> the tiny detail that the wrong memory location gets overwritten,
> sometimes...
>
> > Simply ignoring the upper 4bits is the wrong approach and means that the
> > PTE then point to random pages, and completely irrelevant to the physical
> > address used in the hardware status page address register.
>
> Doesn't setting DMA_BIT_MASK(32) only change the region DMA memory is
> allocated from? I made that change just to make sure one gets addresses
> which are safe even if the chipset sometimes ignores address bit 32. The
> only negative impact I could think of is the allocation may fail if no
> appropriate memory is available. Am I wrong?

I just thought Daniel had wired up the dma_mask_size differently and
didn't realise it also did pci_set_dma_mask on the same pci dev. So
our patches were morally equivalent. ;-)

>
> > I have been considering:
>
> > + if (IS_BRROADWATER(dev) || IS_CRESTLINE(dev))
> > + dma_set_coherent_mask(&dev->pdev->dev, DMA_BIT_MASK(32));
>
> > to prevent hitting the erratum.
>
> So is there a known erratum about these chips? I didn't find errata
> documents online, but I only did a short google search and may have
> missed them.

Hah. I wish our documentation were that organised. If you grab the PRM
for gen4 from intellinuxgraphics.org, you should find it mentioned in the
state descriptions for the 3D pipeline.

> > However your bug looks to be:
>
> > - if (INTEL_INFO(dev)->gen >= 4)
> > - dev_priv->dma_status_page |= (dev_priv->dma_status_page >> 28) &
> > - 0xf0;
> > + if (INTEL_INFO(dev)->gen >= 4) /* 36-bit addressing */
> > + dev_priv->dma_status_page |=
> > + (dev_priv->status_page_dmah->busaddr >> 28) & 0xf0;
>
> Don't think so. dev_priv->dma_status_page gets initialized to
> dev_priv->status_page_dmah->busaddr a few lines above, and it's 64 bit,
> so that diff doesn't change the result of the computation.

And here I was working on the assumption that the code to program a 32bit
register would indeed create a 32bit value.

So, I'm happy to use your patch to workaround the known erratum. I just
wish I had an explanation as to what is actually causing the corruption.
What I want to make sure is that we don't paper over a real bug by
thinking it is yet another silicon issue.
-Chris

--
Chris Wilson, Intel Open Source Technology Centre

2011-02-25 23:05:48

by Daniel Vetter

[permalink] [raw]
Subject: Re: [Intel-gfx] [PATCH] intel-gtt: fix memory corruption with GM965 and >4GB RAM

On Fri, Feb 25, 2011 at 10:18:16PM +0000, Chris Wilson wrote:
> So, I'm happy to use your patch to workaround the known erratum. I just
> wish I had an explanation as to what is actually causing the corruption.
> What I want to make sure is that we don't paper over a real bug by
> thinking it is yet another silicon issue.

Actually, on style points I prefer your patch: The hw status page is
allocated with drm_pci_alloc which calls dma_alloc_coherent, so setting
the coherent mask is sufficient. The dma mask set in the gtt is
essentially useless, because we call get_user_pages on everything anyway
(in gem - iirc agp uses it). I just think it's confusing to limit the
general dma mask and continue to happily map pages above 4G.

Cheers, Daniel
--
Daniel Vetter
Mail: [email protected]
Mobile: +41 (0)79 365 57 48

2011-02-28 06:46:39

by Zhenyu Wang

[permalink] [raw]
Subject: Re: [Intel-gfx] [PATCH] intel-gtt: fix memory corruption with GM965 and >4GB RAM

On 2011.02.26 00:05:27 +0100, Daniel Vetter wrote:
> On Fri, Feb 25, 2011 at 10:18:16PM +0000, Chris Wilson wrote:
> > So, I'm happy to use your patch to workaround the known erratum. I just
> > wish I had an explanation as to what is actually causing the corruption.
> > What I want to make sure is that we don't paper over a real bug by
> > thinking it is yet another silicon issue.
>
> Actually, on style points I prefer your patch: The hw status page is
> allocated with drm_pci_alloc which calls dma_alloc_coherent, so setting
> the coherent mask is sufficient. The dma mask set in the gtt is
> essentially useless, because we call get_user_pages on everything anyway
> (in gem - iirc agp uses it). I just think it's confusing to limit the
> general dma mask and continue to happily map pages above 4G.
>

Think about IOMMU engine, we need to set dma_mask properly for
returned dma mapping address be limited in max range that can be
handled in GTT entry.

--
Open Source Technology Center, Intel ltd.

$gpg --keyserver wwwkeys.pgp.net --recv-keys 4D781827


Attachments:
(No filename) (1.04 kB)
signature.asc (198.00 B)
Digital signature
Download all attachments

2011-02-28 19:58:12

by Daniel Vetter

[permalink] [raw]
Subject: Re: [Intel-gfx] [PATCH] intel-gtt: fix memory corruption with GM965 and >4GB RAM

On Mon, Feb 28, 2011 at 02:46:35PM +0800, Zhenyu Wang wrote:
> > Actually, on style points I prefer your patch: The hw status page is
> > allocated with drm_pci_alloc which calls dma_alloc_coherent, so setting
> > the coherent mask is sufficient. The dma mask set in the gtt is
> > essentially useless, because we call get_user_pages on everything anyway
> > (in gem - iirc agp uses it). I just think it's confusing to limit the
> > general dma mask and continue to happily map pages above 4G.
> >
>
> Think about IOMMU engine, we need to set dma_mask properly for
> returned dma mapping address be limited in max range that can be
> handled in GTT entry.

If I understand it correctly, this is just about broadwater/crestline,
i.e. the original i965 series. Only the later eaglelake/cantiga (gm45 in
our codebase) was shipped with an iommu attached. So no problem there.
Anyway, my comment was just style nitpick, essentially to keep in line
with the existing work-around for broken overlay reg files on gen2.
-Daniel
--
Daniel Vetter
Mail: [email protected]
Mobile: +41 (0)79 365 57 48

2011-03-01 22:24:22

by Jan Niehusmann

[permalink] [raw]
Subject: [PATCH] drm/i915: fix memory corruption with GM965 and >4GB RAM

On Sat, Feb 26, 2011 at 12:05:27AM +0100, Daniel Vetter wrote:
> Actually, on style points I prefer your patch: The hw status page is
> allocated with drm_pci_alloc which calls dma_alloc_coherent, so setting
> the coherent mask is sufficient. The dma mask set in the gtt is
> essentially useless, because we call get_user_pages on everything anyway
> (in gem - iirc agp uses it). I just think it's confusing to limit the
> general dma mask and continue to happily map pages above 4G.

Indeed, setting dma_alloc_coherent should be sufficient, AFAIKT. After
all, only the HWS seems to be affected, which is allocated with
drm_pci_alloc, which in turn uses dma_alloc_coherent.

I just tried the patch below, it also works for me (as expected). Added
the comment because otherwise it wouldn't be obvious why the mask gets
set to 32 bit.

What I don't know is if really only Broadwater and Crestline chips are
affected. The tests were done with a Crestline one. But I think it's a
fair guess that the bug would have been noticed earlier if more recent
chips were affected, as >4GB RAM have become much more common since then.

Signed-off-by: Jan Niehusmann <[email protected]>

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 17bd766..1961580 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1895,6 +1895,12 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
if (IS_GEN2(dev))
dma_set_coherent_mask(&dev->pdev->dev, DMA_BIT_MASK(30));

+ /* 965GM sometimes incorrectly writes to hardware status page (HWS)
+ * using 32bit addressing, overwriting memory if HWS is located
+ * above 4GB. */
+ if (IS_BROADWATER(dev) || IS_CRESTLINE(dev))
+ dma_set_coherent_mask(&dev->pdev->dev, DMA_BIT_MASK(32));
+
mmio_bar = IS_GEN2(dev) ? 1 : 0;
dev_priv->regs = pci_iomap(dev->pdev, mmio_bar, 0);
if (!dev_priv->regs) {

2011-03-01 22:32:28

by Chris Wilson

[permalink] [raw]
Subject: Re: [PATCH] drm/i915: fix memory corruption with GM965 and >4GB RAM

On Tue, 1 Mar 2011 23:24:16 +0100, Jan Niehusmann <[email protected]> wrote:
> Indeed, setting dma_alloc_coherent should be sufficient, AFAIKT. After
> all, only the HWS seems to be affected, which is allocated with
> drm_pci_alloc, which in turn uses dma_alloc_coherent.
>
> I just tried the patch below, it also works for me (as expected). Added
> the comment because otherwise it wouldn't be obvious why the mask gets
> set to 32 bit.
>
> What I don't know is if really only Broadwater and Crestline chips are
> affected. The tests were done with a Crestline one. But I think it's a
> fair guess that the bug would have been noticed earlier if more recent
> chips were affected, as >4GB RAM have become much more common since then.

The later chips do not use a physical address for the hardware status
page. The patch looks good, thanks.
-Chris

--
Chris Wilson, Intel Open Source Technology Centre