2002-01-21 10:56:16

by Reid Hekman

[permalink] [raw]
Subject: Athlon PSE/AGP Bug

Hi,

The folks at Gentoo Linux have published a news item about AGP related
lockups with PSE on AMD Athlons.

As I have a couple systems that may/may not be affected, I'm seeking
some clarification. Is this an effect of the errata published by AMD in
the Athlon models 4 & 6 revision guides as "INVLPG Instruction Does Not
Flush Entire Four-Megabyte Page Properly with Certain Linear Addresses"?
That errata lists all Athlon Thunderbirds as affected and all Athlon
Palominos except for stepping A5.

Regardless of specific errata listings, will future workarounds be
enabled based on cpuid or via a test for the bug itself?

Regards,
Reid


2002-01-21 13:39:40

by David Miller

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

From: Reid Hekman <[email protected]>
Date: 21 Jan 2002 04:53:39 -0600

As I have a couple systems that may/may not be affected, I'm seeking
some clarification. Is this an effect of the errata published by AMD in
the Athlon models 4 & 6 revision guides as "INVLPG Instruction Does Not
Flush Entire Four-Megabyte Page Properly with Certain Linear Addresses"?
That errata lists all Athlon Thunderbirds as affected and all Athlon
Palominos except for stepping A5.

Regardless of specific errata listings, will future workarounds be
enabled based on cpuid or via a test for the bug itself?

The funny part is, if this published errata is the problem, it cannot
be a problem under Linux since we never invalidate 4MB pages. We
create them at boot time and they never change after that.

2002-01-21 13:50:40

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

"David S. Miller" wrote:

> The funny part is, if this published errata is the problem, it cannot
> be a problem under Linux since we never invalidate 4MB pages. We
> create them at boot time and they never change after that.

Well we don't know what nvidia's kernel module is doing.....

2002-01-21 16:53:32

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On Mon, Jan 21, 2002 at 05:37:24AM -0800, David S. Miller wrote:
> From: Reid Hekman <[email protected]>
> Date: 21 Jan 2002 04:53:39 -0600
>
> As I have a couple systems that may/may not be affected, I'm seeking
> some clarification. Is this an effect of the errata published by AMD in
> the Athlon models 4 & 6 revision guides as "INVLPG Instruction Does Not
> Flush Entire Four-Megabyte Page Properly with Certain Linear Addresses"?
> That errata lists all Athlon Thunderbirds as affected and all Athlon
> Palominos except for stepping A5.
>
> Regardless of specific errata listings, will future workarounds be
> enabled based on cpuid or via a test for the bug itself?
>
> The funny part is, if this published errata is the problem, it cannot
> be a problem under Linux since we never invalidate 4MB pages. We
> create them at boot time and they never change after that.

correct, furthmore it cannot even trigger if you invlpg with an address
page aligned (4mbyte aligned in this case) like we would always do in
linux anyways, we never use invlpg on misaligned addresses, no matter if
the page is a 4M or a 4k page. And I guess with PAE enabled it cannot
even trigger in first place (it speaks only about 4M pages, pae only
provides 2M pages instead).

I think this is a very very minor issue, I doubt anybody ever triggered
it in real life with linux.

And Gentoo is shipping a kernel with preempt and rmaps included, so it
can crash anytime anyways, no matter how good the cpu is, so if they
got crashes with such a kernel (maybe even with nvidia driver) that's
normal. I was speaking today with a trusted party doing vm benchmarking
and rmap crashes the kernel reproducibly under a stright calloc while
swapping heavily, so clearly the implementation is still broken. preempt
additionally will mess up all the locking into the nvidia driver as
well. so if the combination of the two runs for some time without any
lockup that's pure luck IMHO.

Andrea

2002-01-21 16:55:32

by Jeff Epler

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On Mon, Jan 21, 2002 at 01:50:14PM +0000, Arjan van de Ven wrote:
> "David S. Miller" wrote:
>
> > The funny part is, if this published errata is the problem, it cannot
> > be a problem under Linux since we never invalidate 4MB pages. We
> > create them at boot time and they never change after that.
>
> Well we don't know what nvidia's kernel module is doing.....

.. which makes it not a kernel bug, right? Just some buggy module that
bangs hardware in a way documented to not work...

Jeff

2002-01-21 17:26:54

by Ed Sweetman

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On Mon, 2002-01-21 at 11:58, [email protected] wrote:
> On Mon, Jan 21, 2002 at 01:50:14PM +0000, Arjan van de Ven wrote:
> > "David S. Miller" wrote:
> >
> > > The funny part is, if this published errata is the problem, it cannot
> > > be a problem under Linux since we never invalidate 4MB pages. We
> > > create them at boot time and they never change after that.
> >
> > Well we don't know what nvidia's kernel module is doing.....
>
> .. which makes it not a kernel bug, right? Just some buggy module that
> bangs hardware in a way documented to not work...
>

Would seem so since it's been one and a half years and nobody has
encountered this bug in linux.

Damn you gotta love slashdot. It's like the Internet's smut mag. If
their news is going to be so old it should be because they're actually
looking into the story they're posting with some kind of review
process.

2002-01-21 18:00:44

by Reid Hekman

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On Mon, 2002-01-21 at 10:54, Andrea Arcangeli wrote:
> On Mon, Jan 21, 2002 at 05:37:24AM -0800, David S. Miller wrote:
> > That errata lists all Athlon Thunderbirds as affected and all Athlon
> > Palominos except for stepping A5.
...
> > The funny part is, if this published errata is the problem, it cannot
> > be a problem under Linux since we never invalidate 4MB pages. We
> > create them at boot time and they never change after that.
>
> correct, furthmore it cannot even trigger if you invlpg with an address
> page aligned (4mbyte aligned in this case) like we would always do in
> linux anyways, we never use invlpg on misaligned addresses, no matter if
> the page is a 4M or a 4k page. And I guess with PAE enabled it cannot
> even trigger in first place (it speaks only about 4M pages, pae only
> provides 2M pages instead).
>
> I think this is a very very minor issue, I doubt anybody ever triggered
> it in real life with linux.

Thanks for the clarification, I run a few systems with such CPU's but
they don't exhibit the problem. I don't run Gentoo, just RH 7.(12) and
Debian Woody with recent 2.4 vanilla kernels, all of which run AGP, but
with a mix of ATI and Nvidia cards.

On Mon, 2002-01-21 at 12:26, Ed Sweetman wrote:
> Damn you gotta love slashdot. It's like the Internet's smut mag. If
> their news is going to be so old it should be because they're actually
> looking into the story they're posting with some kind of review
> process.

Well I saw this on LinuxToday before it hit slashdot (it was mostly
inaccessible after that). Gentoo's explanation made sense, they claimed
to have spoken with Terrence Ripperda at Nvidia, Andrew Morton, and Alan
Cox. They also claimed this was a generic CPU bug affecting Linux -- the
same bug that was resolved with a workaround a year ago in Windows.

Unfortunately, the Technical note describing the Windows fix AMD
published is incredibly vague and doesn't specify if it is in fact a CPU
bug or some voodoo specific to Windows 2000.

Certainly there are some questions regarding the true impact of this bug
if any -- that's why I asked here. Slashdot reporting it just blows
things out of proportion. I wouldn't take Slashdot's word for anything,
but nor would I dismiss reports of problems out of hand just because
Slashdot picks up on it.

Regards,
Reid

2002-01-21 18:24:08

by Andrew Morton

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

Andrea Arcangeli wrote:
>
> ...
>
> I think this is a very very minor issue, I doubt anybody ever triggered
> it in real life with linux.

It is said that the crashes cease when the `nopentium' option
is used, so it does appear that something is up.

I does seem that the nVidia driver is usually involved.

> And Gentoo is shipping a kernel with preempt and rmaps included, so it
> can crash anytime anyways, no matter how good the cpu is, so if they
> got crashes with such a kernel (maybe even with nvidia driver) that's
> normal. I was speaking today with a trusted party doing vm benchmarking
> and rmap crashes the kernel reproducibly under a stright calloc while
> swapping heavily, so clearly the implementation is still broken.

-rmap is still young. I did some heavy stress testing on it a couple
of days ago and it was rock-solid, and performed well.

> preempt additionally will mess up all the locking into the nvidia driver as
> well. so if the combination of the two runs for some time without any
> lockup that's pure luck IMHO.

Yup. But don't forget about the `nopentium' observations.

-

2002-01-21 19:12:02

by Harold Campbell

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On Mon, 2002-01-21 at 12:17, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > ...
> >
> > I think this is a very very minor issue, I doubt anybody ever triggered
> > it in real life with linux.
>
> It is said that the crashes cease when the `nopentium' option
> is used, so it does appear that something is up.
>
> I does seem that the nVidia driver is usually involved.
>

If it makes any difference the only time my Athlon Thunderbird system
with Matrox G450 locks up is during quake3. No "nVidia inside(tm)".
Guess I'll toss in the nopentium option and see if it helps.

--
Some circumstantial evidence is very strong, as when you find a trout in
the milk.
-- Thoreau

2002-01-21 19:26:14

by Alan

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

> That errata lists all Athlon Thunderbirds as affected and all Athlon
> Palominos except for stepping A5.
>
> Regardless of specific errata listings, will future workarounds be
> enabled based on cpuid or via a test for the bug itself?

That problem shouldnt be hitting Linux x86. I don't know about the
Nvidia module but the base kernel shouldnt hit an invlpg on 4Mb pages

2002-01-21 19:35:54

by David Weinehall

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On Mon, Jan 21, 2002 at 07:37:45PM +0000, Alan Cox wrote:
> > That errata lists all Athlon Thunderbirds as affected and all Athlon
> > Palominos except for stepping A5.
> >
> > Regardless of specific errata listings, will future workarounds be
> > enabled based on cpuid or via a test for the bug itself?
>
> That problem shouldnt be hitting Linux x86. I don't know about the
> Nvidia module but the base kernel shouldnt hit an invlpg on 4Mb pages

The reference to you in the /.-article is the usual /.-bullshit, I
gather?!


/David
_ _
// David Weinehall <[email protected]> /> Northern lights wander \\
// Maintainer of the v2.0 kernel // Dance across the winter sky //
\> http://www.acc.umu.se/~tao/ </ Full colour fire </

2002-01-21 19:51:06

by Sipos Ferenc

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

Hi!

It has nothing to do with nvidia. I think. I'm using redhat now, but
once I've tried a network installation with suse, and yast crashed at
the beginning. I have played with the boot parameters, and mem=nopentium
was the winner:). I have bought my machine in december, 1999, it's an
athlon 500, so in former athlons the bug exists somehow. The nvidia
module makes crash with the stock kernel irongate agp driver, but not
with the nv agp driver for me, but it's offtopic.

Paco

2002-01-21, H keltez?ssel David Weinehall ezt ?rta:
> On Mon, Jan 21, 2002 at 07:37:45PM +0000, Alan Cox wrote:
> > > That errata lists all Athlon Thunderbirds as affected and all Athlon
> > > Palominos except for stepping A5.
> > >
> > > Regardless of specific errata listings, will future workarounds be
> > > enabled based on cpuid or via a test for the bug itself?
> >
> > That problem shouldnt be hitting Linux x86. I don't know about the
> > Nvidia module but the base kernel shouldnt hit an invlpg on 4Mb pages
>
> The reference to you in the /.-article is the usual /.-bullshit, I
> gather?!
>
>
> /David
> _ _
> // David Weinehall <[email protected]> /> Northern lights wander \\
> // Maintainer of the v2.0 kernel // Dance across the winter sky //
> \> http://www.acc.umu.se/~tao/ </ Full colour fire </
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2002-01-21 22:16:44

by David Miller

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

From: Arjan van de Ven <[email protected]>
Date: Mon, 21 Jan 2002 13:50:14 +0000

Well we don't know what nvidia's kernel module is doing.....

I know it isn't using large pages, that is for sure.

2002-01-21 22:21:54

by David Miller

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

From: Andrea Arcangeli <[email protected]>
Date: Mon, 21 Jan 2002 17:54:10 +0100

correct, furthmore it cannot even trigger if you invlpg with an address
page aligned (4mbyte aligned in this case) like we would always do in
linux anyways, we never use invlpg on misaligned addresses, no matter if
the page is a 4M or a 4k page.

That's not true, see the ptrace() helper code. Russell King pointed
this out to me last week and it's on my TODO list to fix it up.

2002-01-21 22:26:14

by David Miller

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

From: Andrew Morton <[email protected]>
Date: Mon, 21 Jan 2002 10:17:10 -0800

Andrea Arcangeli wrote:
> I think this is a very very minor issue, I doubt anybody ever triggered
> it in real life with linux.

It is said that the crashes cease when the `nopentium' option
is used, so it does appear that something is up.

I does seem that the nVidia driver is usually involved.

I think this is all "just so happens" personally, and all the that
turning off the large pages really does is change the timings so that
whatever bug is really present simply becomes a heisenbug.

2002-01-22 00:26:58

by Stuart Young

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

At 02:23 PM 21/01/02 -0800, David S. Miller wrote:
> I does seem that the nVidia driver is usually involved.
>
>I think this is all "just so happens" personally, and all the that
>turning off the large pages really does is change the timings so that
>whatever bug is really present simply becomes a heisenbug.

I'd definitely agree with that. My home system seems rock stable in regards
to games. Every game problem I've had I've somehow tracked down to a memory
leak in the game (or another app running in the background) that
(assumedly) blew out the VM after a while. (Athlon 1400, Asus A7M, Creative
SBLive!, Asus V8200 GF3DDR, decent power supply and cooling).

I would not be surprised if some of these "crashes" were power related.
It's quite possible that by slowing things down (even marginally) it will
reduce the current drain on the system. Some systems power supplies (and
associated m/board power circuitry) can be so touchy they become unstable,
and will eventually provide unclean power (usually an AC ripple on the DC).
From there, chaos.

Then you've got heat, which is the "next big killer" with the Athlon's.
They produce a lot of it, and if it doesn't circulate properly, you can
never expect anything reliable. Of course, the video card generally is
quite close to the CPU, and it generates heat too, and can suffer from all
sorts of issues if they get too hot.

Almost all the Athlon problems I've looked at for friends (with lockups
specifically) have pretty much fallen into either; the above failure
categories, "broken hardware" or "kernel issues" (usually known issues, not
specific to Athlon).

I doubt it's as bad as everyone makes out. Unfortunately people buy cheap
hardware because it's cheap, not because it's reliable.


Stuart Young - [email protected]
(aka Cefiar) - [email protected]

[All opinions expressed in the above message are my]
[own and not necessarily the views of my employer..]

2002-01-22 00:36:50

by Steve Brueggeman

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

Actually, this one hit home this weekend.

I bought a new computer at a computer fair.
ECS K7S5A Motherboard
1.8Ghz (1.5actually) Athelon XP
3DForce2-MX
256MB DDR SDRAM PC-2100
AHA2940UW SCSI Controller
Compaq CDROM (reused from other upgraded system)

Spent all of Saturday trying to install Mandrake Linux 8.1 with random crashes,
segfaults, IDE-Timeouts. Figuring this to be a memory problem, I ran memtest86
for 4 hours without any errors. Was getting late, and said screw-it and went to
bed.

Sunday, set the memory and CPU both to 100Mhz, still have problems. so I set
both back to 133Mhz. Booted kernel 2.2.19 from 2nd CD in Mandrake set, and had
better luck. Got it installed after 3 restarts. Figuring this was somehow
related to APM or ACPI, I compiled a standard Marcello kernel 2.4.17, but could
not make it through a whole compile without segfaults. I'd just restart the
compile, letting make skip past the stuff that was already compiled. Got an
average of 3-4 segfaults on compile run, and I tried about 5 runs.

Boot to linux-2.4.17 with APM and ACPI disabled, and only stuff in my system
enabled, and no Frame Buffer, still get segfaults when compiling kernel.

Then by sheer luck, while doing my normal check of linuxtoday.com, the top
article mentioned this Athelon bug. I figure, "Hey, this sounds somewhat
familar", so I reboot with mem=nopentium as they suggested.

I've compiled the linux-2.4.17 about 10 times now, without a single segfault.

So, add me to the "Yes I've got this problem" list, and Yes, it appears to be
related to Nvidia AGP boards.

I've been running a 1Ghz Thunderbird for about a year now, with 2 different ATI
boards without any problems. I'll try swapping the ATI and Nvidia display
adapters and see if it follows.

Steve Brueggeman

On Mon, 21 Jan 2002 10:17:10 -0800, you wrote:

>Andrea Arcangeli wrote:
>>
>> ...
>>
>> I think this is a very very minor issue, I doubt anybody ever triggered
>> it in real life with linux.
>
>It is said that the crashes cease when the `nopentium' option
>is used, so it does appear that something is up.
>
>I does seem that the nVidia driver is usually involved.
>
>> And Gentoo is shipping a kernel with preempt and rmaps included, so it
>> can crash anytime anyways, no matter how good the cpu is, so if they
>> got crashes with such a kernel (maybe even with nvidia driver) that's
>> normal. I was speaking today with a trusted party doing vm benchmarking
>> and rmap crashes the kernel reproducibly under a stright calloc while
>> swapping heavily, so clearly the implementation is still broken.
>
>-rmap is still young. I did some heavy stress testing on it a couple
>of days ago and it was rock-solid, and performed well.
>
>> preempt additionally will mess up all the locking into the nvidia driver as
>> well. so if the combination of the two runs for some time without any
>> lockup that's pure luck IMHO.
>
>Yup. But don't forget about the `nopentium' observations.
>
>-
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/

2002-01-22 00:37:10

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On Mon, Jan 21, 2002 at 02:19:31PM -0800, David S. Miller wrote:
> From: Andrea Arcangeli <[email protected]>
> Date: Mon, 21 Jan 2002 17:54:10 +0100
>
> correct, furthmore it cannot even trigger if you invlpg with an address
> page aligned (4mbyte aligned in this case) like we would always do in
> linux anyways, we never use invlpg on misaligned addresses, no matter if
> the page is a 4M or a 4k page.
>
> That's not true, see the ptrace() helper code. Russell King pointed
> this out to me last week and it's on my TODO list to fix it up.

Where? :) ptrace doesn't change pagetables, no need to flush any tlb in
ptrace.

Anyways if the problem is in the nvidia driver they may be really doing
an invlpg on a misaligned 4M page address for no good reason, this
sounds unlikely though. What's certain is that the stuff into the
mainline kernel shouldn't really be affected for the reason you also
said previously (we never invalidate 4M pages with invlpg). In the very
worst case nvidia guys just need to mask the lower (not significant)
bits before passing the address to invlpg, which is going to be a one
liner.

Andrea

2002-01-22 00:38:50

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On Mon, Jan 21, 2002 at 02:23:20PM -0800, David S. Miller wrote:
> From: Andrew Morton <[email protected]>
> Date: Mon, 21 Jan 2002 10:17:10 -0800
>
> Andrea Arcangeli wrote:
> > I think this is a very very minor issue, I doubt anybody ever triggered
> > it in real life with linux.
>
> It is said that the crashes cease when the `nopentium' option
> is used, so it does appear that something is up.
>
> I does seem that the nVidia driver is usually involved.
>
> I think this is all "just so happens" personally, and all the that
> turning off the large pages really does is change the timings so that
> whatever bug is really present simply becomes a heisenbug.

My same wondering, however I wasn't sure how much the timing could
really change to make the kernel bugs trigger.

Andrea

2002-01-22 00:44:40

by Russell King

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On Tue, Jan 22, 2002 at 01:37:43AM +0100, Andrea Arcangeli wrote:
> On Mon, Jan 21, 2002 at 02:19:31PM -0800, David S. Miller wrote:
> > That's not true, see the ptrace() helper code. Russell King pointed
> > this out to me last week and it's on my TODO list to fix it up.
>
> Where? :) ptrace doesn't change pagetables, no need to flush any tlb in
> ptrace.

See:

int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write)
{
...
flush_cache_page(vma, addr);
...
}

flush_cache_page() is passed a non-page aligned address. AFAIK that is
the only instance where the flush_{cache,tlb}_* stuff is called with
non-page aligned addresses.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2002-01-22 00:52:40

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On Tue, Jan 22, 2002 at 12:43:59AM +0000, Russell King wrote:
> On Tue, Jan 22, 2002 at 01:37:43AM +0100, Andrea Arcangeli wrote:
> > On Mon, Jan 21, 2002 at 02:19:31PM -0800, David S. Miller wrote:
> > > That's not true, see the ptrace() helper code. Russell King pointed
> > > this out to me last week and it's on my TODO list to fix it up.
> >
> > Where? :) ptrace doesn't change pagetables, no need to flush any tlb in
> > ptrace.
>
> See:
>
> int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write)
> {
> ...
> flush_cache_page(vma, addr);
> ...
> }
>
> flush_cache_page() is passed a non-page aligned address. AFAIK that is
> the only instance where the flush_{cache,tlb}_* stuff is called with
> non-page aligned addresses.

flush_cache_page is by no means a _tlb_ flush. It is a virtual indexed
cache flush needed before you can access data at such address (noop on
x86).

I'm not even sure that we should consider incorrect if anybody would do
a tlb flush on a not aligned address, also given it works fine for the
4k pages, I mainly wanted to point out that with tlb flushes it gets
pretty natural to do them aligned in the code, and that's what linux
does with the 4k pages (we never invalidate 4M pages as Dave pointed out
but it sounds unlikely nvidia tlb flush 4M pages misaligned).

Andrea

2002-01-22 00:56:10

by Russell King

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On Tue, Jan 22, 2002 at 01:53:21AM +0100, Andrea Arcangeli wrote:
> On Tue, Jan 22, 2002 at 12:43:59AM +0000, Russell King wrote:
> > On Tue, Jan 22, 2002 at 01:37:43AM +0100, Andrea Arcangeli wrote:
> > > On Mon, Jan 21, 2002 at 02:19:31PM -0800, David S. Miller wrote:
> > > > That's not true, see the ptrace() helper code. Russell King pointed
> > > > this out to me last week and it's on my TODO list to fix it up.
> > >
> > > Where? :) ptrace doesn't change pagetables, no need to flush any tlb in
> > > ptrace.
> >
> > See:
> >
> > int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write)
> > {
> > ...
> > flush_cache_page(vma, addr);
> > ...
> > }
> >
> > flush_cache_page() is passed a non-page aligned address. AFAIK that is
> > the only instance where the flush_{cache,tlb}_* stuff is called with
> > non-page aligned addresses.
>
> flush_cache_page is by no means a _tlb_ flush. It is a virtual indexed
> cache flush needed before you can access data at such address (noop on
> x86).

Sigh, I never claimed it was a tlb flush function.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2002-01-22 01:03:10

by Steve Brueggeman

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

Forgot to mention, I got the segfaults compiling kernels while running
linux-2.4.17, I was in console, and did not have Frame Buffer, or drm drivers
loaded. I did have the SiS AGP compiled into the kernel though.

2002-01-22 01:10:10

by David Miller

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

From: Andrea Arcangeli <[email protected]>
Date: Tue, 22 Jan 2002 01:37:43 +0100

On Mon, Jan 21, 2002 at 02:19:31PM -0800, David S. Miller wrote:
> That's not true, see the ptrace() helper code. Russell King pointed
> this out to me last week and it's on my TODO list to fix it up.

Where? :) ptrace doesn't change pagetables, no need to flush any tlb in
ptrace.

egrep flush_*_page kernel/ptrace.c:access_process_vm()

2002-01-22 01:10:50

by David Miller

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

From: Andrea Arcangeli <[email protected]>
Date: Tue, 22 Jan 2002 01:39:09 +0100

On Mon, Jan 21, 2002 at 02:23:20PM -0800, David S. Miller wrote:
> I think this is all "just so happens" personally, and all the that
> turning off the large pages really does is change the timings so that
> whatever bug is really present simply becomes a heisenbug.

My same wondering, however I wasn't sure how much the timing could
really change to make the kernel bugs trigger.

Not kernel bugs, things like AGP bugs under high load which would
go away if the machine spent more time taking kernel TLB misses.

2002-01-22 01:27:16

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On Mon, Jan 21, 2002 at 05:07:45PM -0800, David S. Miller wrote:
> From: Andrea Arcangeli <[email protected]>
> Date: Tue, 22 Jan 2002 01:37:43 +0100
>
> On Mon, Jan 21, 2002 at 02:19:31PM -0800, David S. Miller wrote:
> > That's not true, see the ptrace() helper code. Russell King pointed
> > this out to me last week and it's on my TODO list to fix it up.
>
> Where? :) ptrace doesn't change pagetables, no need to flush any tlb in
> ptrace.
>
> egrep flush_*_page kernel/ptrace.c:access_process_vm()

that is not a tlb flush, it's a noop on x86 infact.

andrea@athlon:~/devel/kernel/2.4.18pre4aa1> egrep tlb kernel/ptrace.c
andrea@athlon:~/devel/kernel/2.4.18pre4aa1>

Andrea

2002-01-22 05:49:12

by Shaya Potter

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

athlon XP 1800 is a cpuid 622 (aka an A5)

doesn't have this bug, according to AMD tech docs.

at least my 2 XP 1800+s are 622, so I assume all are (could be wrong)

On Mon, 2002-01-21 at 19:36, Steve Brueggeman wrote:
> Actually, this one hit home this weekend.
>
> I bought a new computer at a computer fair.
> ECS K7S5A Motherboard
> 1.8Ghz (1.5actually) Athelon XP
> 3DForce2-MX
> 256MB DDR SDRAM PC-2100
> AHA2940UW SCSI Controller
> Compaq CDROM (reused from other upgraded system)
>
> Spent all of Saturday trying to install Mandrake Linux 8.1 with random crashes,
> segfaults, IDE-Timeouts. Figuring this to be a memory problem, I ran memtest86
> for 4 hours without any errors. Was getting late, and said screw-it and went to
> bed.
>
> Sunday, set the memory and CPU both to 100Mhz, still have problems. so I set
> both back to 133Mhz. Booted kernel 2.2.19 from 2nd CD in Mandrake set, and had
> better luck. Got it installed after 3 restarts. Figuring this was somehow
> related to APM or ACPI, I compiled a standard Marcello kernel 2.4.17, but could
> not make it through a whole compile without segfaults. I'd just restart the
> compile, letting make skip past the stuff that was already compiled. Got an
> average of 3-4 segfaults on compile run, and I tried about 5 runs.
>
> Boot to linux-2.4.17 with APM and ACPI disabled, and only stuff in my system
> enabled, and no Frame Buffer, still get segfaults when compiling kernel.
>
> Then by sheer luck, while doing my normal check of linuxtoday.com, the top
> article mentioned this Athelon bug. I figure, "Hey, this sounds somewhat
> familar", so I reboot with mem=nopentium as they suggested.
>
> I've compiled the linux-2.4.17 about 10 times now, without a single segfault.
>
> So, add me to the "Yes I've got this problem" list, and Yes, it appears to be
> related to Nvidia AGP boards.
>
> I've been running a 1Ghz Thunderbird for about a year now, with 2 different ATI
> boards without any problems. I'll try swapping the ATI and Nvidia display
> adapters and see if it follows.
>
> Steve Brueggeman
>
> On Mon, 21 Jan 2002 10:17:10 -0800, you wrote:
>
> >Andrea Arcangeli wrote:
> >>
> >> ...
> >>
> >> I think this is a very very minor issue, I doubt anybody ever triggered
> >> it in real life with linux.
> >
> >It is said that the crashes cease when the `nopentium' option
> >is used, so it does appear that something is up.
> >
> >I does seem that the nVidia driver is usually involved.
> >
> >> And Gentoo is shipping a kernel with preempt and rmaps included, so it
> >> can crash anytime anyways, no matter how good the cpu is, so if they
> >> got crashes with such a kernel (maybe even with nvidia driver) that's
> >> normal. I was speaking today with a trusted party doing vm benchmarking
> >> and rmap crashes the kernel reproducibly under a stright calloc while
> >> swapping heavily, so clearly the implementation is still broken.
> >
> >-rmap is still young. I did some heavy stress testing on it a couple
> >of days ago and it was rock-solid, and performed well.
> >
> >> preempt additionally will mess up all the locking into the nvidia driver as
> >> well. so if the combination of the two runs for some time without any
> >> lockup that's pure luck IMHO.
> >
> >Yup. But don't forget about the `nopentium' observations.
> >
> >-
> >-
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to [email protected]
> >More majordomo info at http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at http://www.tux.org/lkml/
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


2002-01-22 06:38:41

by Paul G. Allen

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

Alan Cox wrote:
>
> > That errata lists all Athlon Thunderbirds as affected and all Athlon
> > Palominos except for stepping A5.
> >
> > Regardless of specific errata listings, will future workarounds be
> > enabled based on cpuid or via a test for the bug itself?
>
> That problem shouldnt be hitting Linux x86. I don't know about the
> Nvidia module but the base kernel shouldnt hit an invlpg on 4Mb pages
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

See my post on /. regarding the bug.

In summary, I have 2 Thunderbird systems - a dual 1.4GHz Thunderbird on
Tyan Thunder K7 and a single 1.4GHz Thunderbird on Asus A7V133 - with
NVIDIA cards and the latest 2313 NVIDIA driver. The single runs RH 7.2
and this one (the dual) an up2date RH 7.1 with kernel 2.4.17. I have no
problems unless I boot a system into Win98. There are many other issues,
as you all know (and many dorks on /. apparently do not), that can and
will cause a system to hang. I run AGP4x, SBA, FSAA, and Anisotropic
filtering on most all games. I compile often many different things. The
ONLY times I have compile issues are when I compile some things (Torque
game engine and Quake II) with -march at anything over pentium, at which
point either the internal compiler bugs rear their ugly heads or I get
strange graphics in a game.

But since kernel 2.4.14, never a system lock.

PGA
--
Paul G. Allen
Owner, Sr. Engineer, Security Specialist
Random Logic/Dream Park
http://www.randomlogic.com

2002-01-22 07:05:55

by Ville Herva

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

David S. Miller said:
>
> The funny part is, if this published errata is the problem, it cannot be a
> problem under Linux since we never invalidate 4MB pages. We create them
> at boot time and they never change after that.

and:
> From: Arjan van de Ven <[email protected]>
> > Well we don't know what nvidia's kernel module is doing.....
>
> I know it isn't using large pages, that is for sure.

and:
> I think this is all "just so happens" personally, and all the that
> turning off the large pages really does is change the timings so that
> whatever bug is really present simply becomes a heisenbug.

Andrea Arcangeli <[email protected]> said:
> My same wondering, however I wasn't sure how much the timing could
> really change to make the kernel bugs trigger.

Alan Cox said:
> That problem shouldnt be hitting Linux x86. I don't know about the Nvidia
> module but the base kernel shouldnt hit an invlpg on 4Mb pages


Here's what Ripperda of nVidia (I imagine this is the same "Terrence
Ripperda of NVIDIA" mentioned at http://www.gentoo.org/) said on nvidia @
#irc.openprojects.net:

*** ripperda ([email protected]) has joined channel #nvidia
<Primer> ripperda: my man!
<Primer> major props for reporting the athlon bug
<ripperda> hey primer
<ripperda> thanks, hopefully we can get athlons a lot more stable under the
drivers now
<ripperda> I feel bad I screwed the pooch and didn't get it figured out
quicker
<Thunderbird> who discovered the bug after all?
<Primer> Thunderbird: AMD, back in Sept. 2000
<Primer> :P
<ripperda> one of our main windows kernel developers here, over a year ago
<Primer> except they forgot to tell us
<Thunderbird> why did nobody publish it before then?
<ripperda> he mentioned it to me, but I was swamped with other things, tried
to see if it would affect us, but was still a little new to the kernel code
<Russ|werk> hey ripperda
<Russ|werk> ripperda: is the fix going to cause a release?
<ripperda> this athlon bug can't be fixed in our code, that's a kernel issue

So clearly either nvidia driver uses large paging or there appears to be
some great misunderstanding.

Also, drobbins at http://www.gentoo.org goes on to say:

"I informed kernel hacker Andrew Morton of the issue; he put me in touch
with Alan Cox. Alan is going to try to add some kind of Athlon/AGP CPU bug
detection code to the kernel so that it will be able to auto-downgrade to 4K
pages when necessary."

Another case of miscommunication?

I sincerely hope you guys can sort this out...


-- v --

[email protected]

2002-01-22 07:11:26

by David Miller

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

From: Ville Herva <[email protected]>
Date: Tue, 22 Jan 2002 09:05:18 +0200

Another case of miscommunication?

Yes, Gareth Hughes @ NVIDIA understands very well that this can still
be just a heisenbug.

There is still no hard proof that not using 4M pages really fixes
anything AMD states is wrong with their chips.

2002-01-22 08:05:02

by Daniel Robbins

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On Tue, 2002-01-22 at 00:08, David S. Miller wrote:
> Yes, Gareth Hughes @ NVIDIA understands very well that this can still
> be just a heisenbug.
>
> There is still no hard proof that not using 4M pages really fixes
> anything AMD states is wrong with their chips.

Well, it's clear that either NVIDIA, AMD or the general opinion held by
the majority Linux kernel guys is wrong. I'm eager to find out the
truth behind the matter so that the parties involved can work towards a
solution, whatever that may be.

It'd be a bummer if I find that the explanation that NVIDIA gave me
turns out to be false. But it seems that there may be a real issue
here. I have received quite a few reports (and read in quite a few
comments posted on sites) that mem=nopentium solved a variety of strange
stability-related issues related to PCI/AGP devices. It may turn out
that the Athlon does have a problem with ends of DMA push buffers
aligned to 4Mb page boundaries. mem=nopentium seems to have fixed audio
and other types of lock-ups as well. Note that AMD told me on the phone
this morning that the issue Terrence found (and the AMD Windows 2000
patch was created to solve) did *not* corellate with the published AMD
errata that everyone on LKML is talking about, but was in fact another
issue.

Thankfully, the guessing will (hopefully) soon be over. AMD will be
calling me tomorrow at 3PM MST. They've reached a conclusion as to
what's going on, and I'll post the AMD's official word on gentoo.org as
soon as I get it.

Best Regards,

--
Daniel Robbins <[email protected]>
Chief Architect/President http://www.gentoo.org
Gentoo Technologies, Inc.

2002-01-22 12:59:07

by Dave Jones

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On Tue, Jan 22, 2002 at 12:45:59AM -0500, Shaya Potter wrote:
> athlon XP 1800 is a cpuid 622 (aka an A5)
> at least my 2 XP 1800+s are 622, so I assume all are (could be wrong)

Unless you have /proc/cpuinfo output that says otherwise, this is
wrong. 622 is the olde Athlon (0.18um) Rev A2.

XP is 662 with cachesize >=256 with bit 19 of capflags==0

(Determining new Duron/Athlon XP/Athlon MP is quite messy,
see x86info source for gory details)

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-01-22 14:13:22

by Halpaap, Mark (CETA)

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

Hi,

after applying mem=nopentium as a boot parameter
I've been able to play tuxracer _for the first time_.

Prior to this any OpenGL application deepfroze
the system after 10-20 secs.

Tried some other Loki-Demos, they run just fine
now.

I do _not_ have an NVidia card, it's a Matrox G450.

I wasn't able to use OpenGL on both Athlon-systems
I used (was Athlon 600 w/ G400, is a Thunderbird 1333
w/ G450 now), been trying it ever since XFree86 4.0 and
with (almost) any kernel that was released since then
(It's 2.4.16-pre1 right now).

So whatever the deeper reason, there _is_ something
fishy that this workaround seems to fix and it seems
not to be tied to NVidia drivers.

Mark.

2002-01-22 14:51:36

by Ed Sweetman

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On Tue, 2002-01-22 at 09:12, Halpaap, Mark (CETA) wrote:
> Hi,
>
> after applying mem=nopentium as a boot parameter
> I've been able to play tuxracer _for the first time_.
>
> Prior to this any OpenGL application deepfroze
> the system after 10-20 secs.
>
> Tried some other Loki-Demos, they run just fine
> now.
>
> I do _not_ have an NVidia card, it's a Matrox G450.
>
> I wasn't able to use OpenGL on both Athlon-systems
> I used (was Athlon 600 w/ G400, is a Thunderbird 1333
> w/ G450 now), been trying it ever since XFree86 4.0 and
> with (almost) any kernel that was released since then
> (It's 2.4.16-pre1 right now).
>
> So whatever the deeper reason, there _is_ something
> fishy that this workaround seems to fix and it seems
> not to be tied to NVidia drivers.
>
> Mark.

I've had two different kinds of athlon's (K7-2 and Tbird 1.33Ghz) with
V3 agp and Matrox G450 agp but both of the times it was on Abit
motherboards and I have never ever experienced these problems. I have
agpgart enabled and X set to use agp4x on my G450 and still no problems
at all with GL apps. I've used different kernels (mostly dev) and been
using X 3.x to cvs on them. No GL problems ever related to a boot
flag.

I dont deny the existance of the "bug" in linux but it's just strange
how a cpu bug is turning up with some people and not others. Perhaps
only some chips from all the batches are affected? whatever the case
tuxracer works perfectly here.

2002-01-22 14:54:46

by Joao Seabra

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

I usually play tuxracer,tuxkart ,quake3,gltron (hardware 3d), and other
OpenGL games and never had any kind of problem.

Im running 2.4.17 on Asus a7v133 with Athlon 1333Mhz.Asus v7100 pro
(geforce2 MX 400) (with latest nvidia drivers) and XFree 4.1.0

The only problem I noticed was when using framebuffer going from X to
console the system freezed.After deactivating frambuffer the problem has gone.

Best Regards,

Jo?o Seabra

On Tuesday 22 January 2002 14:12, Halpaap, Mark (CETA) wrote:
> Hi,
>
> after applying mem=nopentium as a boot parameter
> I've been able to play tuxracer _for the first time_.
>
> Prior to this any OpenGL application deepfroze
> the system after 10-20 secs.
>
> Tried some other Loki-Demos, they run just fine
> now.
>
> I do _not_ have an NVidia card, it's a Matrox G450.
>


2002-01-22 15:30:56

by Shaya Potter

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

you're right, it's a typo on my part. I hit 2 twice instead of 6.

woops

On Tue, 2002-01-22 at 07:58, Dave Jones wrote:
> On Tue, Jan 22, 2002 at 12:45:59AM -0500, Shaya Potter wrote:
> > athlon XP 1800 is a cpuid 622 (aka an A5)
> > at least my 2 XP 1800+s are 622, so I assume all are (could be wrong)
>
> Unless you have /proc/cpuinfo output that says otherwise, this is
> wrong. 622 is the olde Athlon (0.18um) Rev A2.
>
> XP is 662 with cachesize >=256 with bit 19 of capflags==0
>
> (Determining new Duron/Athlon XP/Athlon MP is quite messy,
> see x86info source for gory details)
>
> --
> | Dave Jones. http://www.codemonkey.org.uk
> | SuSE Labs
--
spotter@{cs.columbia.edu,yucs.org}
http://yucs.org/~spotter/

2002-01-22 16:58:20

by David Woodhouse

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug


[email protected] said:
> that is not a tlb flush, it's a noop on x86 infact.

If these functions weren't quite so stupidly named, this confusion wouldn't
arise.

--
dwmw2


2002-01-22 18:53:10

by Greg

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Just my 2c,

but is it a fault in AMDs chip or a fault in a
mother board/agp version/chipset ??

There seems to be a number of people who have no problems on various video
cards and a number of people who have problems with various video cards, etc,
etc. Which is kinda leading me to go 'huh' and look puzzled because it is not
constant. This is why I thought there must by something else common to the
h/w of the people who are having problems.

if nothing comes out from Daniel:
>Thankfully, the guessing will (hopefully) soon be over. AMD will be
>calling me tomorrow at 3PM MST. They've reached a conclusion as to
>what's going on, and I'll post the AMD's official word on gentoo.org as
>soon as I get it.

maybe a look into these other areas is in order?

cheers
- --
Greg
Wellington
New Zealand

# Even the most secure OS is
# useless in the hands of an
# incompetent admin.

Download public key: http://www.performancemagic.com/cvme/public_key.asc
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE8TbUKyag+ETLtG8sRAqlWAJ40kwaXln37KgaNnF9xgV3fs/x4VgCfZHW3
/K3Cb0XOJYXjSmYUcjY2IWM=
=WGyW
-----END PGP SIGNATURE-----

2002-01-22 20:16:35

by Florian Weimer

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

Steve Brueggeman <[email protected]> writes:

> Forgot to mention, I got the segfaults compiling kernels while running
> linux-2.4.17, I was in console, and did not have Frame Buffer, or drm drivers
> loaded. I did have the SiS AGP compiled into the kernel though.

On my new system at home, I got similar segfaults. Running memtest86
revealed that one of the RAM modules had a problem--and if I swapped
them, the BIOS startup code wouldn't even expand the actual BIOS code
every other system boot. After removing the offending RAM module (and
later replacing it) the problems were completely gone and haven't
returned yet...

Fortunately, I didn't know of the PSE/AGP bug back then. This made
debugging much, much easier. ;-)

--
Florian Weimer [email protected]
University of Stuttgart http://CERT.Uni-Stuttgart.DE/people/fw/
RUS-CERT +49-711-685-5973/fax +49-711-685-5898

2002-01-22 22:09:42

by Rene Rebe

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

Hi.

From: Greg <[email protected]>
Subject: Re: Athlon PSE/AGP Bug
Date: Wed, 23 Jan 2002 07:52:49 +1300

> Just my 2c,
>
> but is it a fault in AMDs chip or a fault in a
> mother board/agp version/chipset ??
>
> There seems to be a number of people who have no problems on various video
> cards and a number of people who have problems with various video cards, etc,
> etc. Which is kinda leading me to go 'huh' and look puzzled because it is not
> constant. This is why I thought there must by something else common to the
> h/w of the people who are having problems.

A broken power-supply (or ram?) or maybe a neighbor using a
circular-saw or a power-drill in a house with a far from perfect
wireing? ;-) I had also no crashes on an Athlon 600/XP1700+ on
Irongate/SiS735 using a Matrox MGA 450 ...

> if nothing comes out from Daniel:
> >Thankfully, the guessing will (hopefully) soon be over. AMD will be
> >calling me tomorrow at 3PM MST. They've reached a conclusion as to
> >what's going on, and I'll post the AMD's official word on gentoo.org as
> >soon as I get it.
>
> maybe a look into these other areas is in order?
>
> cheers
> - --
> Greg
> Wellington
> New Zealand
>
> # Even the most secure OS is
> # useless in the hands of an
> # incompetent admin.

k33p h4ck1n6
Ren?

--
Ren? Rebe (Registered Linux user: #248718 <http://counter.li.org>)

eMail: [email protected]
[email protected]

Homepage: http://drocklinux.dyndns.org/rene/

Anyone sending unwanted advertising e-mail to this address will be
charged $25 for network traffic and computing time. By extracting my
address from this message or its header, you agree to these terms.


Attachments:
(No filename) (240.00 B)

2002-01-22 22:15:01

by Ed Sweetman

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On the same note. Anyone trying to run their ram faster than it should
go from the bios would eventually see these kind of things happen. I
used to get errors from anything really memory intensive, games and such
from having ram set at cas 2 instead of cas 3 and removing certain
delays when i shouldn't. People should really make sure their tuned up
systems aren't just overtuned before forking up segfaults to the Athlon
bug that apparently all the kernel guru's have decided doesn't affect
linux just as it doesn't affect the bsd people.

It seems to me that the bug "could" be in your chip, it doesn't mean
it's in every athlon... otherwise we'd be seeing some commonalities and
so far i've seen none.

Since all the people having problems in linux with the athlon bug are
heavy graphics/game users ...I'd suspect overtuning as the problem
before anything else first and make sure they run memtest86, even if
disabling pentium ops fixes things.

On Tue, 2002-01-22 at 15:13, Florian Weimer wrote:
> Steve Brueggeman <[email protected]> writes:
>
> > Forgot to mention, I got the segfaults compiling kernels while running
> > linux-2.4.17, I was in console, and did not have Frame Buffer, or drm drivers
> > loaded. I did have the SiS AGP compiled into the kernel though.
>
> On my new system at home, I got similar segfaults. Running memtest86
> revealed that one of the RAM modules had a problem--and if I swapped
> them, the BIOS startup code wouldn't even expand the actual BIOS code
> every other system boot. After removing the offending RAM module (and
> later replacing it) the problems were completely gone and haven't
> returned yet...
>
> Fortunately, I didn't know of the PSE/AGP bug back then. This made
> debugging much, much easier. ;-)
>
> --
> Florian Weimer [email protected]
> University of Stuttgart http://CERT.Uni-Stuttgart.DE/people/fw/
> RUS-CERT +49-711-685-5973/fax +49-711-685-5898
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


2002-01-22 22:32:31

by Steve Brueggeman

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

Yep, that was the first thing that crossed my mind.

I had a floppy with memtest86 installed on it, and it ran for 4 hours
without any errors. Since I've seen memtest86 find bad bits on memory
modules
that normally worked for me, I figured that this must surely indicate
that my segfaults are not related to bad memory subsystem.

Also, while I expect setting the command line option
mem=nopentium
would slow things down slightly, I don't think that they'd be so slow
as to hide the bad memory. Also, the segfaults happen VERY reliably
without the mem=nopentium option, and have not happened even once,
WITH the mem=nopentium option.

One more curious thing is, I've got 64MB GForce-2 MX, and the largest
I can set my AGP aparature size to is 64MB. Maybe a boundary
condition thing???


On Tue, 22 Jan 2002 21:13:42 +0100, you wrote:

>Steve Brueggeman <[email protected]> writes:
>
>> Forgot to mention, I got the segfaults compiling kernels while running
>> linux-2.4.17, I was in console, and did not have Frame Buffer, or drm drivers
>> loaded. I did have the SiS AGP compiled into the kernel though.
>
>On my new system at home, I got similar segfaults. Running memtest86
>revealed that one of the RAM modules had a problem--and if I swapped
>them, the BIOS startup code wouldn't even expand the actual BIOS code
>every other system boot. After removing the offending RAM module (and
>later replacing it) the problems were completely gone and haven't
>returned yet...
>
>Fortunately, I didn't know of the PSE/AGP bug back then. This made
>debugging much, much easier. ;-)


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

2002-01-22 22:53:32

by Steve Brueggeman

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

OK, now my dander's beginning to get up.

I have been in the computer industry for 20+ years.

I don't particularily like graphic interfaces.

My original post said that I was running the kernel compile in console
mode, but I guess I didn't expressly state that there was "NO X
running".

I did state that there were no Frame Buffer modules, nor DRM modules
loaded, but I did have the AGPGART driver for the SiS AGP hardware
compiled into the kernel.

I am not overclocking my system in "ANY WAY SHAP OR FORM". I do not
believe one bit in overclocking, as I appreciate stability much more
than the 2% to 5% overclocking gets you.

I have PC2100 DDR SDRAM, which should be able to run with a 133 Mhz
bus.

I have an Athelon 1800+ (1500 or so real clock) which should run with
1 133Mhz bus.

While I cannot rule out a poor power supply, experience has shown me
that a minor tweek, such as running the kernel with mem=nopentium, is
not enough of a load change to expose a bad power supply.

Again, I reiterate, I consistently got 3-4 Segmentation faults when
compiling a kernel without the mem=nopentium, with 5 attempts done.

I was unable to get any segfaults when compiling kernels WITH the
mem=nopentium option, with 10 attempts done.

I AM NOT stating that this is necessarily the Athelon bug exposed by
gentoo, but it appears that there are enough people complaining about
unstable systems, becoming stable by running with the mem=nopentium.
It also appears that a significant number of them are also running
Nvidia AGP graphics adapters.

For all I know, this may be due to Nvidia imposing some border-spec
timming on the AGP bus when doing dma, or maybe it could be Athelon
related, or maybe it could in fact be a kernel bug that's been blown
off by kernel developers, just because they're using Nvidia, and don't
bother to ask whether or not they were running X or not.

And of course, since this is a new, and untested system, purchased
from a computer fair, it could indeed be bad hardware, it's just that
my current indications say it's not.

I would like to see some indication that someone is collecting data
related to "running stable with mem=nopentium on Athelon
architecture", and maybe we can see a pattern here. Heck maybe we see
2 or 3 different patterns here.

OK, I'm done.

Steve Brueggeman



On 22 Jan 2002 17:14:27 -0500, you wrote:

>On the same note. Anyone trying to run their ram faster than it should
>go from the bios would eventually see these kind of things happen. I
>used to get errors from anything really memory intensive, games and such
>from having ram set at cas 2 instead of cas 3 and removing certain
>delays when i shouldn't. People should really make sure their tuned up
>systems aren't just overtuned before forking up segfaults to the Athlon
>bug that apparently all the kernel guru's have decided doesn't affect
>linux just as it doesn't affect the bsd people.
>
>It seems to me that the bug "could" be in your chip, it doesn't mean
>it's in every athlon... otherwise we'd be seeing some commonalities and
>so far i've seen none.
>
>Since all the people having problems in linux with the athlon bug are
>heavy graphics/game users ...I'd suspect overtuning as the problem
>before anything else first and make sure they run memtest86, even if
>disabling pentium ops fixes things.
>


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

2002-01-22 23:17:16

by Ian Molton

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On a sunny 22 Jan 2002 09:51:12 -0500 Ed Sweetman gathered a sheaf of
electrons and etched in their motions the following immortal words:

> I've had two different kinds of athlon's (K7-2 and Tbird 1.33Ghz) with
> V3 agp and Matrox G450 agp but both of the times it was on Abit
> motherboards

> I dont deny the existance of the "bug" in linux but it's just strange
> how a cpu bug is turning up with some people and not others. Perhaps
> only some chips from all the batches are affected? whatever the case
> tuxracer works perfectly here.

Same here. I have a Duron 800 (running at 1GHz) on an ASUS A7M with a
Radeon 64 DDR VIVO, and its running SWEEEET at AGP 4x.

2002-01-22 23:50:44

by Rik van Riel

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On Tue, 22 Jan 2002, Steve Brueggeman wrote:

> I AM NOT stating that this is necessarily the Athelon bug exposed by
> gentoo, but it appears that there are enough people complaining about
> unstable systems, becoming stable by running with the mem=nopentium.
> It also appears that a significant number of them are also running
> Nvidia AGP graphics adapters.

Daniel Robbins, William Lee Irwin and myself were on the
phone with people from AMD today.

One possible cause for this problem was already tracked
down a while ago; this problem isn't the fault of any
particular part of the system (CPU, OS, AGP or graphics
driver) but simply a consequence of how these things
work together. Of course we don't know if this particular
bug is the one hitting Linux systems with nvidia.

I won't post my poorly explained version of the story
here as the AMD guys are working on releasing their
well-written version of the story somewhere in the next
few days...

kind regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-23 00:37:22

by Stuart Young

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

At 04:52 PM 22/01/02 -0600, Steve Brueggeman wrote:
>I would like to see some indication that someone is collecting data
>related to "running stable with mem=nopentium on Athelon
>architecture", and maybe we can see a pattern here. Heck maybe we see
>2 or 3 different patterns here.

Well I'm quite willing to give all the system specs we have at work and the
ones I have at home (all up, this is about 12 Athlon's that are running
Linux, all running fine so far with no issues) towards this process.

I've not seen your system specs, so I'm wondering what sort of m/board you
have? The mention of the SiS AGP support makes me wonder if you are running
an SiS chipset board. In the past, Linux kernel developers and the XFree86
team have had a huge amount of trouble (or in some cases, flat refusal) in
getting certain (usually up to date) specs out of SiS, and I'm wondering if
maybe this could be related somehow, as none of the systems I've got have
an SiS chipset in them (they are all AMD or VIA chipsets).

Now I'm not saying this is an SiS issue, but maybe it's more prevalent with
SiS chipsets? Until we get some hard data, who knows!


Stuart Young - [email protected]
(aka Cefiar) - [email protected]

[All opinions expressed in the above message are my]
[own and not necessarily the views of my employer..]

2002-01-23 01:22:03

by Rene Rebe

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

From: Stuart Young <[email protected]>
Subject: Re: Athlon PSE/AGP Bug
Date: Wed, 23 Jan 2002 11:36:56 +1100

> At 04:52 PM 22/01/02 -0600, Steve Brueggeman wrote:
> >I would like to see some indication that someone is collecting data
> >related to "running stable with mem=nopentium on Athelon
> >architecture", and maybe we can see a pattern here. Heck maybe we see
> >2 or 3 different patterns here.
>
> Well I'm quite willing to give all the system specs we have at work and the
> ones I have at home (all up, this is about 12 Athlon's that are running
> Linux, all running fine so far with no issues) towards this process.
>
> I've not seen your system specs, so I'm wondering what sort of m/board you
> have? The mention of the SiS AGP support makes me wonder if you are running
> an SiS chipset board. In the past, Linux kernel developers and the XFree86
> team have had a huge amount of trouble (or in some cases, flat refusal) in
> getting certain (usually up to date) specs out of SiS, and I'm wondering if
> maybe this could be related somehow, as none of the systems I've got have
> an SiS chipset in them (they are all AMD or VIA chipsets).

Yes the docs and driver for the graphic part of the sis630 suck (I
helped debuggin/hacking it ...) - but the sis735 runs rock solid here!
Using a mga450 and an Athlon XP1700+.

Here I only see one Athlon system crashing all the time. This is a
700Mhz Duron runnign in a Asus A7V. With a 2.4.16 kernel compiled with
Athlon optimization all applications are crashing all time (sed, cc,
gcc, sawfish - all. Simply sig-11), with a 2.4.4 kernel (using the
same .config) it seem to run just fine. 4 passes of memtest86 showed
no error, either ...

I see the broken via chips involved most of the time.

We will try a i386-only optimized kernel tomorrow.

> Now I'm not saying this is an SiS issue, but maybe it's more prevalent with
> SiS chipsets? Until we get some hard data, who knows!
>
>
> Stuart Young - [email protected]
> (aka Cefiar) - [email protected]
>
> [All opinions expressed in the above message are my]
> [own and not necessarily the views of my employer..]


k33p h4ck1n6 and goo night
Ren?

--
Ren? Rebe (Registered Linux user: #248718 <http://counter.li.org>)

eMail: [email protected]
[email protected]

Homepage: http://drocklinux.dyndns.org/rene/

Anyone sending unwanted advertising e-mail to this address will be
charged $25 for network traffic and computing time. By extracting my
address from this message or its header, you agree to these terms.

2002-01-23 02:03:00

by Gustavo Zacarias

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

Rene Rebe wrote:

> Here I only see one Athlon system crashing all the time. This is a
> 700Mhz Duron runnign in a Asus A7V. With a 2.4.16 kernel compiled with
> Athlon optimization all applications are crashing all time (sed, cc,
> gcc, sawfish - all. Simply sig-11), with a 2.4.4 kernel (using the
> same .config) it seem to run just fine. 4 passes of memtest86 showed
> no error, either ...
>
> I see the broken via chips involved most of the time.
>
> We will try a i386-only optimized kernel tomorrow.

Hmmm... i'm running a Thunderbird 800 on an A7V (not the A7V133) without
any major problems, with 2.4.17 and athlon optimized.
Of course i have the latest BIOS from Asus (1009), with earlier ones i
did have some AGP-related instabilities, with a GeForce2 GTS.
Of course i also flashed the GF2, just the combination of both things
solved my problems, though now my ASUS GF2 is a "generic nvidia" one.
I compiled on the same run a full XFree86 4.2.0 + GNOME 1.4 without
even one coredump / sig-11, and this is a FULLY compiled gnome.
I have 2 out of 3 dimm slots populated with 256+64 pc133 dimms,
el cheapo brand, and pass memtest86 without a hitch.
HDD is an issue... i got en masse corruption once, but then it's no
wonder with the good record IBM's 75GXP's have... (somehow traced
to ext3+unmask irq on). It corrupted beyond his limits, destroying
the windos partition data also. Now i'm with ext3 but without unmasking,
and got no corruption so far.
I'm using NVIDIA_kernel #2314 forced to AGP 4x w/ SBA & FW on,
though no serious 3D load beyond xscreensaver eyecandy, with no
problems whatsoever.
Distro is (kinda) redhat 7.2, everything compiled with gcc 2.96, EXCEPT
xfree that doesn't like it very much (which was compiled with 2.95.3).

2002-01-23 02:13:31

by Linux Geek

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

There is a known bug in AMD processors and using AGP - info was posted
on slashdot.org and http://www.gentoo.org.
In short, fix by passing mem=nopentium to the kernel aat boot time
(using GRUB or LILO)

Gustavo Zacarias wrote:
>
> Rene Rebe wrote:
>
> > Here I only see one Athlon system crashing all the time. This is a
> > 700Mhz Duron runnign in a Asus A7V. With a 2.4.16 kernel compiled with
> > Athlon optimization all applications are crashing all time (sed, cc,
> > gcc, sawfish - all. Simply sig-11), with a 2.4.4 kernel (using the
> > same .config) it seem to run just fine. 4 passes of memtest86 showed
> > no error, either ...
> >
> > I see the broken via chips involved most of the time.
> >
> > We will try a i386-only optimized kernel tomorrow.
>
> Hmmm... i'm running a Thunderbird 800 on an A7V (not the A7V133) without
> any major problems, with 2.4.17 and athlon optimized.
> Of course i have the latest BIOS from Asus (1009), with earlier ones i
> did have some AGP-related instabilities, with a GeForce2 GTS.
> Of course i also flashed the GF2, just the combination of both things
> solved my problems, though now my ASUS GF2 is a "generic nvidia" one.
> I compiled on the same run a full XFree86 4.2.0 + GNOME 1.4 without
> even one coredump / sig-11, and this is a FULLY compiled gnome.
> I have 2 out of 3 dimm slots populated with 256+64 pc133 dimms,
> el cheapo brand, and pass memtest86 without a hitch.
> HDD is an issue... i got en masse corruption once, but then it's no
> wonder with the good record IBM's 75GXP's have... (somehow traced
> to ext3+unmask irq on). It corrupted beyond his limits, destroying
> the windos partition data also. Now i'm with ext3 but without unmasking,
> and got no corruption so far.
> I'm using NVIDIA_kernel #2314 forced to AGP 4x w/ SBA & FW on,
> though no serious 3D load beyond xscreensaver eyecandy, with no
> problems whatsoever.
> Distro is (kinda) redhat 7.2, everything compiled with gcc 2.96, EXCEPT
> xfree that doesn't like it very much (which was compiled with 2.95.3).
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--

Tom Hornyak
[email protected]
http://www.hornyaksys.com

2002-01-23 11:52:07

by Marek Mentel

[permalink] [raw]
Subject: Re: Athlon PSE/AGP Bug

On Tue, 22 Jan 2002 15:12:48 +0100, Halpaap, Mark (CETA) wrote:

>Hi,
>
>after applying mem=nopentium as a boot parameter
>I've been able to play tuxracer _for the first time_.
>
>Prior to this any OpenGL application deepfroze
>the system after 10-20 secs.

Yes - I had same problem in Quake III demo. System hangs
after 20-30 sec


>I do _not_ have an NVidia card, it's a Matrox G450.

Matrox G200 , Abit KT7E , Duron 800

>So whatever the deeper reason, there _is_ something
>fishy that this workaround seems to fix and it seems
>not to be tied to NVidia drivers.

yes. But this fix dont work in 100% - my system hang
after near hour of playing. Of course this is better
result then hang after 20 sec from start - but looks like
this is not full solution

--------------------------------------------------------
Marek Mentel [email protected] 2:484/3.8
INSTITUTE FOR CHEMICAL PROCESSING OF COAL , Zabrze , POLAND
NOTE: my opinions are strictly my own and not those of my employer