2003-09-17 02:23:06

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers


Here is the workaround patch for the Athlon/Opteron prefetch issue again.

Athlon/Opteron CPUs can report false page faults on prefetches in rare
cases. This patch works around it for the kernel and user space.
It checks in the slow path of the fault handler if the faulting instruction
was a prefetch and ignores them if needed.

In some cases the patch can cause an recursive page fault. It has been
carefully written to stop the recursion early. To do this cleanly I had
to remove the special case of deliver SIGBUS for *_user accesses done
inside the kernel. Instead they are now reported as standard EFAULT
without a signal.

This is much more efficient than the previous workaround used in the kernel,
which checked for AMD CPUs in every prefetch(). This can be seen
in the size of the vmlinux:

Without patch:
text data bss dec hex filename
4020232 665956 169092 4855280 4a15f0 vmlinux
With patch:
4011578 665973 169092 4846643 49f433

The patch with the fault handler fix is ~8K smaller.

The overhead is small. First the delivery path of signals for page faults
is not very hot (rarely used because most programs don't generate
signals for page faults). I tested it using LMbench2's lat_sig:

Then the prefetch check typically can decide

[note - take the results with a grain of salt. They vary even between
runs widely. All results average over 10 runs after another 10 runs to warm
everything up. The tests were run on a 2.4Ghz P4 Xeon.]

With prefetch check: 3.7268 microseconds
Without prefetch check: 3.65945 microseconds

This means the over is ~0.06 microseconds. Actually I suspect the
difference is beyond lmbench's accuracy, although the no prefetch results
were systematically slightly better than prefetch (as expected). Overall
it has a small overhead, but not really worth caring about.

Just alone for the .text savings I would consider applying it.
Without it the code for list_for_each looks pretty bad.
Also it helps user space programs too, which the previous workaround didn't.

x86-64 has the same issue, but I will submit that separately.

Please consider applying,

-Andi


--- linux-2.6.0test5-work/arch/i386/mm/fault.c-o 2003-05-27 03:00:20.000000000 +0200
+++ linux-2.6.0test5-work/arch/i386/mm/fault.c 2003-09-10 23:58:52.000000000 +0200
@@ -55,6 +55,77 @@
console_loglevel = loglevel_save;
}

+/* Sometimes the CPU reports invalid exceptions on prefetch.
+ Check that here and ignore.
+ Opcode checker based on code by Richard Brunner */
+static int is_prefetch(struct pt_regs *regs, unsigned long addr)
+{
+ unsigned char *instr = (unsigned char *)(regs->eip);
+ int scan_more = 1;
+ int prefetch = 0;
+ unsigned char *max_instr = instr + 15;
+
+ /* Avoid recursive faults. This is just an optimization,
+ they must be handled correctly too */
+ if (regs->eip == addr)
+ return 0;
+
+ while (scan_more && instr < max_instr) {
+ unsigned char opcode;
+ unsigned char instr_hi;
+ unsigned char instr_lo;
+
+ if (__get_user(opcode, instr))
+ break;
+
+ instr_hi = opcode & 0xf0;
+ instr_lo = opcode & 0x0f;
+ instr++;
+
+ switch (instr_hi) {
+ case 0x20:
+ case 0x30:
+ /* Values 0x26,0x2E,0x36,0x3E are valid x86
+ prefixes. In long mode, the CPU will signal
+ invalid opcode if some of these prefixes are
+ present so we will never get here anyway */
+ scan_more = ((instr_lo & 7) == 0x6);
+ break;
+
+ case 0x40:
+ /* May be valid in long mode (REX prefixes) */
+ break;
+
+ case 0x60:
+ /* 0x64 thru 0x67 are valid prefixes in all modes. */
+ scan_more = (instr_lo & 0xC) == 0x4;
+ break;
+ case 0xF0:
+ /* 0xF0, 0xF2, and 0xF3 are valid prefixes in all modes. */
+ scan_more = !instr_lo || (instr_lo>>1) == 1;
+ break;
+ case 0x00:
+ /* Prefetch instruction is 0x0F0D or 0x0F18 */
+ scan_more = 0;
+ if (__get_user(opcode, instr))
+ break;
+ prefetch = (instr_lo == 0xF) &&
+ (opcode == 0x0D || opcode == 0x18);
+ break;
+ default:
+ scan_more = 0;
+ break;
+ }
+ }
+
+#if 0
+ if (prefetch)
+ printk("%s: prefetch caused page fault at %lx/%lx\n", current->comm,
+ regs->eip, addr);
+#endif
+ return prefetch;
+}
+
asmlinkage void do_invalid_op(struct pt_regs *, unsigned long);

/*
@@ -110,7 +181,7 @@
* atomic region then we must not take the fault..
*/
if (in_atomic() || !mm)
- goto no_context;
+ goto bad_area_nosemaphore;

down_read(&mm->mmap_sem);

@@ -198,8 +269,12 @@
bad_area:
up_read(&mm->mmap_sem);

+bad_area_nosemaphore:
/* User mode accesses just cause a SIGSEGV */
if (error_code & 4) {
+ if (is_prefetch(regs, address))
+ return;
+
tsk->thread.cr2 = address;
tsk->thread.error_code = error_code;
tsk->thread.trap_no = 14;
@@ -232,6 +307,9 @@
if (fixup_exception(regs))
return;

+ if (is_prefetch(regs, address))
+ return;
+
/*
* Oops. The kernel tried to access some bad page. We'll have to
* terminate things with extreme prejudice.
@@ -286,10 +364,13 @@
do_sigbus:
up_read(&mm->mmap_sem);

- /*
- * Send a sigbus, regardless of whether we were in kernel
- * or user mode.
- */
+ /* Kernel mode? Handle exceptions or die */
+ if (!(error_code & 4))
+ goto no_context;
+
+ if (is_prefetch(regs, address))
+ return;
+
tsk->thread.cr2 = address;
tsk->thread.error_code = error_code;
tsk->thread.trap_no = 14;
@@ -298,10 +379,6 @@
info.si_code = BUS_ADRERR;
info.si_addr = (void *)address;
force_sig_info(SIGBUS, &info, tsk);
-
- /* Kernel mode? Handle exceptions or die */
- if (!(error_code & 4))
- goto no_context;
return;

vmalloc_fault:
--- linux-2.6.0test5-work/include/asm-i386/processor.h-o 2003-09-09 20:55:46.000000000 +0200
+++ linux-2.6.0test5-work/include/asm-i386/processor.h 2003-09-11 03:22:16.000000000 +0200
@@ -578,8 +578,6 @@
#define ARCH_HAS_PREFETCH
extern inline void prefetch(const void *x)
{
- if (cpu_data[0].x86_vendor == X86_VENDOR_AMD)
- return; /* Some athlons fault if the address is bad */
alternative_input(ASM_NOP4,
"prefetchnta (%1)",
X86_FEATURE_XMM,




2003-09-17 02:44:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers

Andi Kleen <[email protected]> wrote:
>
>
> This is much more efficient than the previous workaround used in the kernel,
> which checked for AMD CPUs in every prefetch(). This can be seen
> in the size of the vmlinux:

That is hardly a serious comparison: the workaround is just to stop the
oopses while this gets sorted out. It makes no pretense at either
efficiency or permanence.

> Without patch:
> text data bss dec hex filename
> 4020232 665956 169092 4855280 4a15f0 vmlinux
> With patch:
> 4011578 665973 169092 4846643 49f433

hrm. Why did data grow?

> With prefetch check: 3.7268 microseconds
> Without prefetch check: 3.65945 microseconds

We don't know how much of this difference is due to removing the branch and
how much is due to reenabling prefetch.

It would be interesting to see comparative benchmarking between prefetch
and no prefetch at all, see whether this feature is worth its icache
footprint.

2003-09-17 03:25:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers

On Tue, 16 Sep 2003 19:44:46 -0700
Andrew Morton <[email protected]> wrote:

> Andi Kleen <[email protected]> wrote:
> >
> >
> > This is much more efficient than the previous workaround used in the kernel,
> > which checked for AMD CPUs in every prefetch(). This can be seen
> > in the size of the vmlinux:
>
> That is hardly a serious comparison: the workaround is just to stop the
> oopses while this gets sorted out. It makes no pretense at either
> efficiency or permanence.

It was mainly to show that the patch is a better solution than what currently
is in the kernel.

>
> > Without patch:
> > text data bss dec hex filename
> > 4020232 665956 169092 4855280 4a15f0 vmlinux
> > With patch:
> > 4011578 665973 169092 4846643 49f433
>
> hrm. Why did data grow?

Probably because of the two __get_user in is_prefetch. I suspect their exception
tables get accounted to data.

>
> > With prefetch check: 3.7268 microseconds
> > Without prefetch check: 3.65945 microseconds
>
> We don't know how much of this difference is due to removing the branch and
> how much is due to reenabling prefetch.

None at all. The test did not measure prefetch performance, but just
how much overhead is_prefetch adds to the page fault path.
AFAIK there is no prefetch in the do_page_fault -> fail to find vma -> signal delivery
path.

As for measuring prefetch I'm not sure it makes sense. It depends a lot
on how fast your memory is, how thrashed your caches are, how many
CPUs you have (e.g. on Opteron memory latency varies based on the
number of CPUs) etc. So even if it isn't a win on some system
it can help on others.

> It would be interesting to see comparative benchmarking between prefetch
> and no prefetch at all, see whether this feature is worth its icache
> footprint.

Maybe. But that would be really a separate thing. even with prefetch completely
removed from the kernel we would still need a workaround for user space.

-Andi

2003-09-17 04:54:10

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers



Andrew Morton wrote:

>Andi Kleen <[email protected]> wrote:
>
>>
>>This is much more efficient than the previous workaround used in the kernel,
>>which checked for AMD CPUs in every prefetch(). This can be seen
>>in the size of the vmlinux:
>>
>
>That is hardly a serious comparison: the workaround is just to stop the
>oopses while this gets sorted out. It makes no pretense at either
>efficiency or permanence.
>
>
>>Without patch:
>> text data bss dec hex filename
>>4020232 665956 169092 4855280 4a15f0 vmlinux
>>With patch:
>>4011578 665973 169092 4846643 49f433
>>
>
>hrm. Why did data grow?
>
>
>>With prefetch check: 3.7268 microseconds
>>Without prefetch check: 3.65945 microseconds
>>
>
>We don't know how much of this difference is due to removing the branch and
>how much is due to reenabling prefetch.
>
>It would be interesting to see comparative benchmarking between prefetch
>and no prefetch at all, see whether this feature is worth its icache
>footprint.
>

The test was on a pentium 4, so its only removing the extra code.

I think Andi's patch is required (especially because it fixes
userspace), and under the current cpu selection scheme, it is
implemented correctly (although I am now at a loss as to what the
generic thing is for).

The conditional compilation thing is a seperate issue. This patch may
have just broken a few camels' backs.

What is intriguing to me is the "Its only a 2% slowdown of the page
fault for every cpu other than K[78] for this single workaround. There
is no point to conditional compilation" attitude some people have.
Of course, its only 2% on a pagefault, not anywhere near 2% of kernel
performance as a whole, so maybe that is justified.

Just repeating though, that is a seperate issue and I think Andi's patch
is needed.


2003-09-17 05:09:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers

Nick Piggin <[email protected]> wrote:
>
> What is intriguing to me is the "Its only a 2% slowdown of the page
> fault for every cpu other than K[78] for this single workaround. There
> is no point to conditional compilation" attitude some people have.
> Of course, its only 2% on a pagefault, not anywhere near 2% of kernel
> performance as a whole, so maybe that is justified.

Absolutely. But it's a bit of a pain finding a config option which says
"this CPU might need the fixup".

> Just repeating though, that is a seperate issue and I think Andi's patch
> is needed.

It is unquestionably needed - the kernel _has_ to perform the fixup for this
CPU erratum.


But I would like to see some evidence that prefetch ever provides any
performance gain in-kernel. I spent some time fiddling a while back and
was unable to demonstrate any difference.

2003-09-17 05:19:24

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers

Andrew Morton wrote:

>Nick Piggin <[email protected]> wrote:
>
>>What is intriguing to me is the "Its only a 2% slowdown of the page
>> fault for every cpu other than K[78] for this single workaround. There
>> is no point to conditional compilation" attitude some people have.
>> Of course, its only 2% on a pagefault, not anywhere near 2% of kernel
>> performance as a whole, so maybe that is justified.
>>
>
>Absolutely. But it's a bit of a pain finding a config option which says
>"this CPU might need the fixup".
>

Right. It obviously can't be done using the current system.

>
>> Just repeating though, that is a seperate issue and I think Andi's patch
>> is needed.
>>
>
>It is unquestionably needed - the kernel _has_ to perform the fixup for this
>CPU erratum.
>
>
>But I would like to see some evidence that prefetch ever provides any
>performance gain in-kernel. I spent some time fiddling a while back and
>was unable to demonstrate any difference.
>
>
>

OK. I just liked this patch because apparently it fixes userspace as
well. Disabling prefetch for the kernel doesn't.


2003-09-17 05:26:53

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers

On Wed, 17 Sep 2003 14:53:40 +1000
Nick Piggin <[email protected]> wrote:

>
> The conditional compilation thing is a seperate issue. This patch may
> have just broken a few camels' backs.
>
> What is intriguing to me is the "Its only a 2% slowdown of the page
> fault for every cpu other than K[78] for this single workaround. There

Erm. It helps to get your numbers right first when arguing. Why did I waste the
time on benchmarking and collecting data when people afterwards still argue with
bogus numbers? @)

It is not 2%, but <1% [~0.6% to be exact]

(probably lower the statistical error for the test, LMBench results vary more
than that on multiple runs)

Then it is not for all page faults, but only for a very narrow special
case - a page fault that is not handled, but causes a signal. These are
not very common. Arguably there are some applications that use this
stuff (like generational garbage collectors), but these are not exactly
common.

In summary, it causes 0.6% slowdown in an quite obscure use case.

That's small enough that it is best to just always enable it, because
the cost of processing any support request when people forget to enable
etc. is much greater.

-Andi

2003-09-17 05:44:22

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers



Andi Kleen wrote:

>On Wed, 17 Sep 2003 14:53:40 +1000
>Nick Piggin <[email protected]> wrote:
>
>
>>The conditional compilation thing is a seperate issue. This patch may
>>have just broken a few camels' backs.
>>
>>What is intriguing to me is the "Its only a 2% slowdown of the page
>>fault for every cpu other than K[78] for this single workaround. There
>>
>
>Erm. It helps to get your numbers right first when arguing. Why did I waste the
>time on benchmarking and collecting data when people afterwards still argue with
>bogus numbers? @)
>

I was using your numbers.

>
>It is not 2%, but <1% [~0.6% to be exact]
>

100 * (3.7268 - 3.65945) / 3.65945 = 1.8 (am I wrong?)

>
>(probably lower the statistical error for the test, LMBench results vary more
>than that on multiple runs)
>
>Then it is not for all page faults, but only for a very narrow special
>case - a page fault that is not handled, but causes a signal. These are
>not very common. Arguably there are some applications that use this
>stuff (like generational garbage collectors), but these are not exactly
>common.
>

I missed that, sorry.

>
>In summary, it causes 0.6% slowdown in an quite obscure use case.
>
>That's small enough that it is best to just always enable it, because
>the cost of processing any support request when people forget to enable
>etc. is much greater.
>

I agree that with the current cpu selection arrangements it has to be
always enabled. I'm talking about Adrian's patch though.

I wasn't really arguing about it based on it causing a big slowdown. I
just don't agree with your opinion (which you're very entitled to) on
the matter.

I think that messing up your config options is a user error anyway. And
obviously a big reason for config options is so you don't have to
compile everything, are we really going to try catering for people who
make the wrong choices?

Aside, I think Adrian's patch makes cpu selection simpler for users, and
if it was combined with a warning for booting a non selected cpu I can't
see it being any more of a support problem.


2003-09-17 07:17:33

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers

On Tue, 2003-09-16 at 22:08, Andrew Morton wrote:
> But I would like to see some evidence that prefetch ever provides any
> performance gain in-kernel. I spent some time fiddling a while back and
> was unable to demonstrate any difference.

For SDET on the NUMA-Q, it's pretty small. Looks like well less than
1%. Mostly lost in the noise, though.

DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered
trademarks of the Standard Performance Evaluation Corporation. This
benchmarking was performed for research purposes only, and the run
results are non-compliant and not-comparable with any published results.

Prefetch on:
Scripts | Average Throughput | Standard Deviation
----------+---------------------+---------------------
16 | 15795.9100 | 711.3221
32 | 16450.5800 | 292.6028
64 | 15800.2400 | 126.2358

Prefetch off:
Scripts | Average Throughput | Standard Deviation
----------+---------------------+---------------------
16 | 15672.8600 | 438.5506
32 | 16376.4100 | 364.5610
64 | 15744.8100 | 208.9217


For those of you who care, I generated this with a neat little tool that
some interns cooked up for us this summer. You hand it a little file
like this and it throws back a wealth of information back at you.
There's a central machine to which has a database of other machines and
figures out the best fit for the job (can be more than 1 machine
requested). In this case it's easy because there are very few 16x
machines in the database. Oh, and like good interns, they stole a lot
of code from Steve Pratt's "autobench".

class host num_cpus=16
+$mytest sdet "1 4 16 32 64" -l /tmp/sdet

disuseprofiler sar

# build the kernel with prefaulting off
build stock 2.6.0-test5 \
-p http://elm3b114.beaverton.ibm.com/patches/prefetch-off.patch \
-m -j8
boot
# sdet runs fast on ramfs
fs -mountramfs /tmp/sdet
run "sdet" $mytest
setup umount /tmp/sdet

build stock 2.6.0-test5 -m -j8
boot
fs -mountramfs /tmp/sdet
run "sdet" $mytest
setup umount /tmp/sdet

--
Dave Hansen
[email protected]


Attachments:
preload-off.patch (1.07 kB)

2003-09-17 19:54:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers


On Wed, 17 Sep 2003, Nick Piggin wrote:
>
> What is intriguing to me is the "Its only a 2% slowdown of the page
> fault for every cpu other than K[78] for this single workaround. There
> is no point to conditional compilation" attitude some people have.

I wouldn't worry about performance as much as correctness. I'm a lot more
worried about the notion of taking recursive pagefaults than about 2%.

Which means that I think the code should be disabled on non-buggy
hardware. Not becuase of the 2%, but because it just exposes the potential
of new bugs.

Linus

2003-09-17 20:21:04

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers

On Wed, Sep 17, 2003 at 12:53:59PM -0700, Linus Torvalds wrote:
>
> On Wed, 17 Sep 2003, Nick Piggin wrote:
> >
> > What is intriguing to me is the "Its only a 2% slowdown of the page
> > fault for every cpu other than K[78] for this single workaround. There
> > is no point to conditional compilation" attitude some people have.
>
> I wouldn't worry about performance as much as correctness. I'm a lot more
> worried about the notion of taking recursive pagefaults than about 2%.

I carefully designed it to never recurse more than once. The original
version (before I posted it) had some corner cases that violated this, but
the latest one is IMHO bulletproof in this regard.

Logic is:

when the fault came from user space as seen in CS it is ok to fault again.

when the fault came from kernel space we must always check the exception
table first. The __get_user is is_prefetch has an exception table entry
and will be catched by this.
[This is why I changed the SIGBUS path slightly - it previously did
not follow this sequence]

Also when the fault address is equal EIP we don't check. This avoids
a recursion when the kernel jumps to zero. When this is not true the
instruction is guaranteed to be mapped, because an unmapped instruction
will always cause an page fault on EIP first.

About the only chance of doing multiple recursions would be another CPU
corrupting the kernel page table in parallel while the fault happens,
but I don't see any chance to handle this properly.

-Andi

2003-09-17 20:51:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers


On Wed, 17 Sep 2003, Andi Kleen wrote:
>
> Also when the fault address is equal EIP we don't check.

And this is a good example of something that can break.

The fault address is a linear address after segment translation. The EIP
is _before_ segment translation.

You don't translate the EIP with the CS base.

Which means that the two can match even if they have nothing to do with
each other. It will happen in vm86 mode and in things like wine. So that
check is broken.

Also, for the same reason, you won't fix up prefetches in wine.

Also, you do things like comparing pointers for less/greater than, and at
least some versions of gcc has done that wrong - using signed comparisons.
Which leaves you potentially open to denial-of-service attacks if somebody
generates a long list of prefixes around the 0x80000000 mark and the size
check doesn't catch them.

In short, this is harder than you seem to think. And right now you _do_ do
the wrong things for Wine, and I think that not only should that be fixed,
it should be made athlon-specific so that any other potential bugs won't
impact people that it shouldn't impact.

See?

Linus

2003-09-17 21:12:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers

On Wed, Sep 17, 2003 at 01:50:59PM -0700, Linus Torvalds wrote:
>
> On Wed, 17 Sep 2003, Andi Kleen wrote:
> >
> > Also when the fault address is equal EIP we don't check.
>
> And this is a good example of something that can break.

>
> The fault address is a linear address after segment translation. The EIP
> is _before_ segment translation.

It doesn't matter. The EIP check only catches kernel crashes, for
user mode it is not relevant because the earlier checks catch this.

Even in kernel mode it is not critical, just gives a nicer
oops when you do call *0

I added a check for LDT entries now. This means prefetch exceptions
inside non standard code segments will be not handled (I will fix
that in a follow up patch)

> You don't translate the EIP with the CS base.
>
> Which means that the two can match even if they have nothing to do with
> each other. It will happen in vm86 mode and in things like wine. So that
> check is broken.
>
> Also, for the same reason, you won't fix up prefetches in wine.

Hmm. I didn't think wine normally used non zero segment bases.

But you're right. When the segment is in a LDT it should add in the base.

I don't think it's a big problem because it could only hit
in 16bit Wine and I doubt those programs use prefetch. But arguably
it should be handled, agreed.

But could you still consider merging the patch, I will fix it then in
a follow up patch? It seems like a quite obscure issue.

> Also, you do things like comparing pointers for less/greater than, and at
> least some versions of gcc has done that wrong - using signed comparisons.

Really? Is that any version we still support (2.95+) ?
It is certainly legal ISO-C. I changed it for now.

> In short, this is harder than you seem to think. And right now you _do_ do
> the wrong things for Wine, and I think that not only should that be fixed,
> it should be made athlon-specific so that any other potential bugs won't
> impact people that it shouldn't impact.

Ok, here is a new patch which also includes an Athlon/Opteron check
and doesn't do pointer comparisons and ignores CS in LDT for now.

Please consider applying.

-Andi


--- linux-2.6.0test5-work/arch/i386/mm/fault.c-o 2003-05-27 03:00:20.000000000 +0200
+++ linux-2.6.0test5-work/arch/i386/mm/fault.c 2003-09-17 23:07:03.000000000 +0200
@@ -55,6 +55,85 @@
console_loglevel = loglevel_save;
}

+/* Sometimes AMD K7/K8 reports invalid exceptions on prefetch.
+ Check that here and ignore.
+ Opcode checker based on code by Richard Brunner */
+static int __is_prefetch(struct pt_regs *regs, unsigned long addr)
+{
+ unsigned char *instr = (unsigned char *)(regs->eip);
+ int scan_more;
+ int prefetch = 0;
+ int c;
+
+ /* Avoid recursive faults. This is just an optimization,
+ they must be handled correctly too */
+ if (regs->eip == addr)
+ return 0;
+
+ /* Don't check for LDT code segments because they could have
+ non zero bases. Better would be to add in the base in this case. */
+ if (regs->xcs & (1<<2))
+ return 0;
+
+ scan_more = 1;
+ for (c = 0; c < 15 && scan_more; c++) {
+ unsigned char opcode;
+ unsigned char instr_hi;
+ unsigned char instr_lo;
+
+ if (__get_user(opcode, instr))
+ break;
+
+ instr_hi = opcode & 0xf0;
+ instr_lo = opcode & 0x0f;
+ instr++;
+
+ switch (instr_hi) {
+ case 0x20:
+ case 0x30:
+ /* Values 0x26,0x2E,0x36,0x3E are valid x86
+ prefixes. In long mode, the CPU will signal
+ invalid opcode if some of these prefixes are
+ present so we will never get here anyway */
+ scan_more = ((instr_lo & 7) == 0x6);
+ break;
+
+ case 0x40:
+ /* May be valid in long mode (REX prefixes) */
+ break;
+
+ case 0x60:
+ /* 0x64 thru 0x67 are valid prefixes in all modes. */
+ scan_more = (instr_lo & 0xC) == 0x4;
+ break;
+ case 0xF0:
+ /* 0xF0, 0xF2, and 0xF3 are valid prefixes in all modes. */
+ scan_more = !instr_lo || (instr_lo>>1) == 1;
+ break;
+ case 0x00:
+ /* Prefetch instruction is 0x0F0D or 0x0F18 */
+ scan_more = 0;
+ if (__get_user(opcode, instr))
+ break;
+ prefetch = (instr_lo == 0xF) &&
+ (opcode == 0x0D || opcode == 0x18);
+ break;
+ default:
+ scan_more = 0;
+ break;
+ }
+ }
+ return prefetch;
+}
+
+static inline int is_prefetch(struct pt_regs *regs, unsigned long addr)
+{
+ if (likely(boot_cpu_data.x86_vendor != X86_VENDOR_AMD ||
+ boot_cpu_data.x86 < 6))
+ return 0;
+ return __is_prefetch(regs, addr);
+}
+
asmlinkage void do_invalid_op(struct pt_regs *, unsigned long);

/*
@@ -110,7 +189,7 @@
* atomic region then we must not take the fault..
*/
if (in_atomic() || !mm)
- goto no_context;
+ goto bad_area_nosemaphore;

down_read(&mm->mmap_sem);

@@ -198,8 +277,12 @@
bad_area:
up_read(&mm->mmap_sem);

+bad_area_nosemaphore:
/* User mode accesses just cause a SIGSEGV */
if (error_code & 4) {
+ if (is_prefetch(regs, address))
+ return;
+
tsk->thread.cr2 = address;
tsk->thread.error_code = error_code;
tsk->thread.trap_no = 14;
@@ -232,6 +315,9 @@
if (fixup_exception(regs))
return;

+ if (is_prefetch(regs, address))
+ return;
+
/*
* Oops. The kernel tried to access some bad page. We'll have to
* terminate things with extreme prejudice.
@@ -286,10 +372,13 @@
do_sigbus:
up_read(&mm->mmap_sem);

- /*
- * Send a sigbus, regardless of whether we were in kernel
- * or user mode.
- */
+ /* Kernel mode? Handle exceptions or die */
+ if (!(error_code & 4))
+ goto no_context;
+
+ if (is_prefetch(regs, address))
+ return;
+
tsk->thread.cr2 = address;
tsk->thread.error_code = error_code;
tsk->thread.trap_no = 14;
@@ -298,10 +387,6 @@
info.si_code = BUS_ADRERR;
info.si_addr = (void *)address;
force_sig_info(SIGBUS, &info, tsk);
-
- /* Kernel mode? Handle exceptions or die */
- if (!(error_code & 4))
- goto no_context;
return;

vmalloc_fault:
--- linux-2.6.0test5-work/include/asm-i386/processor.h-o 2003-09-09 20:55:46.000000000 +0200
+++ linux-2.6.0test5-work/include/asm-i386/processor.h 2003-09-11 03:22:16.000000000 +0200
@@ -578,8 +578,6 @@
#define ARCH_HAS_PREFETCH
extern inline void prefetch(const void *x)
{
- if (cpu_data[0].x86_vendor == X86_VENDOR_AMD)
- return; /* Some athlons fault if the address is bad */
alternative_input(ASM_NOP4,
"prefetchnta (%1)",
X86_FEATURE_XMM,

2003-09-18 15:39:06

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers

Andi Kleen wrote:
> I added a check for LDT entries now. This means prefetch exceptions
> inside non standard code segments will be not handled (I will fix
> that in a follow up patch)

> + /* Don't check for LDT code segments because they could have
> + non zero bases. Better would be to add in the base in this case. */
> + if (regs->xcs & (1<<2))
> + return 0;

It is possible to have a non-standard code segment in the GDT, too.
Thus a better to check is "unlikely((regs->xcs & 0xffff) != __USER_CS)".

One way to add in the correct base would be to use the real segment
register. Don't call __get_user: call your own special function which
uses %gs, and temporarily load regs->xcs into %gs.

Alternatively you can just calculate the base and add it.

-- Jamie

2003-09-18 16:05:19

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers

Jamie Lokier wrote:
> > + /* Don't check for LDT code segments because they could have
> > + non zero bases. Better would be to add in the base in this case. */
> > + if (regs->xcs & (1<<2))
> > + return 0;
>
> It is possible to have a non-standard code segment in the GDT, too.
> Thus a better to check is "unlikely((regs->xcs & 0xffff) != __USER_CS)".

Ignore me here, I'm being dense - forgetting about __KERNEL_CS :)

My point is simply that non-zero base GDT segments are possible in
userspace, and whatever code you add later to add the base should
be aware of that.

-- Jamie

2003-09-18 17:06:32

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers

On Thu, Sep 18, 2003 at 05:04:56PM +0100, Jamie Lokier wrote:
> Jamie Lokier wrote:
> > > + /* Don't check for LDT code segments because they could have
> > > + non zero bases. Better would be to add in the base in this case. */
> > > + if (regs->xcs & (1<<2))
> > > + return 0;
> >
> > It is possible to have a non-standard code segment in the GDT, too.
> > Thus a better to check is "unlikely((regs->xcs & 0xffff) != __USER_CS)".
>
> Ignore me here, I'm being dense - forgetting about __KERNEL_CS :)
>
> My point is simply that non-zero base GDT segments are possible in
> userspace, and whatever code you add later to add the base should
> be aware of that.

I don't see how a non standard GDT is possible in user space. The GDT
is only managed by the kernel. 2.6 offers to change it for NPTL, but
that only applies to data segments.

In vm86 mode the user can load a different base without LDT, but that
should not matter here (although it may be better to check for VM86 mode
too)

-Andi

2003-09-18 19:48:29

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers

Andi Kleen wrote:
> > My point is simply that non-zero base GDT segments are possible in
> > userspace, and whatever code you add later to add the base should
> > be aware of that.
>
> I don't see how a non standard GDT is possible in user space. The GDT
> is only managed by the kernel. 2.6 offers to change it for NPTL, but
> that only applies to data segments.

I don't see anything in set_thread_area() which restricts them to data
segments. Btw, NPTL isn't the only userspace code which uses this
function.

Admittedly it would be quite odd to use set_thread_area() to create
and use a GDT code segment - but given that you can do it, it would be
_really_ odd if prefetch instructions gave spurious faults in these
segments yet were fixed up elsewhere.

> In vm86 mode the user can load a different base without LDT, but that
> should not matter here (although it may be better to check for VM86 mode
> too)

Shouldn't spurious faults in prefetch instructions in vm86 mode be
fixed up too?

-- Jamie

2003-09-19 07:13:12

by kaih

[permalink] [raw]
Subject: Re: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers

[email protected] (Andi Kleen) wrote on 17.09.03 in <[email protected]>:

> On Wed, Sep 17, 2003 at 01:50:59PM -0700, Linus Torvalds wrote:

> > Also, you do things like comparing pointers for less/greater than, and at
> > least some versions of gcc has done that wrong - using signed comparisons.
>
> Really? Is that any version we still support (2.95+) ?
> It is certainly legal ISO-C. I changed it for now.

It certainly is *not* legal ISO C, and never was.

Pointer comparision is *only* legal in ISO C if both pointers point to the
same object - same variable, or same piece of malloc'd memory. Anything
else is undefined.

MfG Kai

2003-09-19 10:03:02

by Andreas Schwab

[permalink] [raw]
Subject: Re: [PATCH] Athlon/Opteron Prefetch Fix for 2.6.0test5 + numbers

[email protected] (Kai Henningsen) writes:

> [email protected] (Andi Kleen) wrote on 17.09.03 in <[email protected]>:
>
>> On Wed, Sep 17, 2003 at 01:50:59PM -0700, Linus Torvalds wrote:
>
>> > Also, you do things like comparing pointers for less/greater than, and at
>> > least some versions of gcc has done that wrong - using signed comparisons.
>>
>> Really? Is that any version we still support (2.95+) ?
>> It is certainly legal ISO-C. I changed it for now.
>
> It certainly is *not* legal ISO C, and never was.

Certainly the kernel isn't using ISO C, and never will.

IMHO you can just compare two pointers to char in GNU C and expect to get
a meaningful result. I think the bug Linus was thinking of was some
problems with counting loops where gcc transformed the index variable into
a pointer, but then got the terminating comparison wrong.

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 N?rnberg
Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."