Hi.
Perhaps some of you have read my older two threads:
http://marc.theaimsgroup.com/?t=116312440000001&r=1&w=2 and the even
older http://marc.theaimsgroup.com/?t=116291314500001&r=1&w=2
The issue was basically the following:
I found a severe bug mainly by fortune because it occurs very rarely.
My test looks like the following: I have about 30GB of testing data on
my harddisk,... I repeat verifying sha512 sums on these files and check
if errors occur.
One test pass verifies the 30GB 50 times,... about one to four
differences are found in each pass.
The corrupted data is not one single completely wrong block of data or
so,.. but if you look at the area of the file where differences are
found,.. than some bytes are ok,.. some are wrong,.. and so on (seems to
be randomly).
Also, there seems to be no event that triggers the corruption,.. it
seems to be randomly, too.
It is really definitely not a harware issue (see my old threads my
emails to Tyan/Hitachi and my "workaround" below. My system isn't
overclocked.
My System:
Mainboard: Tyan S2895
Chipsets: Nvidia nforce professional 2200 and 2050 and AMD 8131
CPU: 2x DualCore Opterons model 275
RAM: 4GB Kingston Registered/ECC
Diskdrives: IBM/Hitachi: 1 PATA, 2 SATA
The data corruption error occurs on all drives.
You might have a look at the emails between me and Tyan and Hitachi,..
they contain probalby lots of valuable information (especially my
different tests).
Some days ago,.. an engineer of Tyan suggested me to boot the kernel
with mem=3072M.
When doing this,.. the issue did not occur (I don't want to say it was
solved. Why? See my last emails to Tyan!)
Then he suggested me to disable the memory hole mapping in the BIOS,...
When doing so,.. the error doesn't occur, too.
But I loose about 2GB RAM,.. and,.. more important,.. I cant believe
that this is responsible for the whole issue. I don't consider it a
solution but more a poor workaround which perhaps only by fortune solves
the issue (Why? See my last eMails to Tyan ;) )
So I'd like to ask you if you perhaps could read the current information
in this and previous mails,.. and tell me your opinions.
It is very likely that a large number of users suffer from this error
(namely all Nvidia chipset users) but only few (there are some,.. I
found most of them in the Nvidia forums,.. and they have exactly the
same issue) identify this as an error because it's so rare.
Perhaps someone have an idea why disabling the memhole mapping solves
it. I've always thought that memhole mapping just moves some address
space to higher addreses to avoid the conflict between address space for
PCI devices and address space for pyhsical memory.
But this should be just a simple addition and not solve this obviously
complex error.
Lots of thanks in advance.
Best wishes,
Chris.
#########################################################################
### email #1 to Tyan/Hitachi ###
#########################################################################
(sorry for reposting but the AMD support system requires to add some keywords in
the subject, and I wanted to have the correct subject for all other parties
(Tyan and Hitachi) too, so that CC'ing would be possible for all.
Hi.
I provide this information to:
- Tyan ([email protected]) - Mr. Rodger Dusatko
- Hitachi ([email protected] , please add the GST Support
Request #627-602-082-5 in the subject) Mr. Schledz)
- and with this email for the first time to AMD [email protected]
(for the AMD people: please have a look at the information at the very
end of this email first,... there you'll find links where you can read
the introduction and description about the whole issue).
It might be useful if you contact each other (and especially nvidia
which I wasn't able to contact myself),.. but please CC me in all you
communications.
Also, please forward my emails/information to your responsible technical
engineers and developers.
Please do not ignore this problem:
- it existing,
- several users are experiencing it (thus this is not a single failure
of my system),
- it can cause severe data corruption (which is even more grave, as the
a user won't notice it throught error messages) and
- it happens with different Operating Systems (at least Linux and Windows).
This is my current state of testing. For further information,.. please
do not hesitate to ask.
You'll find old information (included in my previous mails or found at
the linux-kernel mailinglist thread I've included in my mails) at the end.
- In the meantime I do not use diff any longer for my tests, simply
because it takes much longer than to use sha512sums to verify
dataintegrity (but this has not effect on the testing or the issue
itself, it just proves that the error is not in the diff program).
- I always test 30GB insteat of 15
- As before I'm still very sure, that the following components are fully
working and not damaged (see my old mails or lkml):
CPUs => due to extensive gimps/mprime torture tests
memory => due to extensive memtest86+ tests
harddisks => because I use three different disks (SATA-II and PATA) (but
all from IBM/Hitachi or Hitachi) and I did extensive badblock scans
temperature should be ok in my system => proper chassis (not any of the
chep ones) with several fans, CPUs between 38 °C an 45°C, System ambient
about 46°C, videocard, between 55° and 88°C (when under full 3D use),...
the chipsetS (!) don't have temperature monitoring,.. and seem to be
quite hot, but according Tyan this is normal.
Ok now my current state:
- I found (although it was difficult) a lot of resource in the internet
where users report about the same or a very similar problem using the
same hardware components. Some of them:
http://forums.nvidia.com/index.php?showtopic=8171&st=0
http://forums.nvidia.com/index.php?showtopic=18923
http://lkml.org/lkml/2006/8/14/42 (see http://lkml.org/lkml/2006/8/15/109)
Note that I've opened a thread at the nvidia forums myself:
http://forums.nvidia.com/index.php?showtopic=21576
All of them have in common, that the issue is/was not a hardware failure
and it seems that none of them was able to reproduce the failure.
- As far as I understand the Tyan S2895 mainboard manual
ftp://ftp.tyan.com/manuals/m_s2895_101.pdf on page 9,... both the IDE
and SATA are connected to the Nvidia nforce professional 2200,.. so this
may be nvidia related
(If anyone of you has the ability to contact nvidia,.. please do so and
send them all my information (also the old one). It seems that it's not
easily possible to contact them for "end-users")
- I tried different cable routings in my chassis (as far as this was
possible) which did not solve the problem.
I also switched of all other devices in my rooms that might produce
electro-magnetic disturbances....
thus electro-magnetic disturbances are unlikely.
- I found the errata for the AMD 8131 (which is on of my chipsets):
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26310.pdf
Please have a look at it (all the issues) as some might be responsible
for the whole issue.
- I tried to use older BIOS versions (1.02 and 1.03) but all of them
gave me an OPROM error (at bus 12 device 06 function 1) and despite of
that booting,.. the problem still exists.
According to Linux's dmesg this is:
12:06.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
Fusion-MPT Dual Ultra320 SCSI (rev 07)
12:06.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
Fusion-MPT Dual Ultra320 SCSI (rev 07)
- I tried with 8131 Errata 56 PCLK (which actualy disables the AMD
Errata 56 fix) but the issue still exists
- I activated my BIOS's Spread Spectrum Option (although it is not
described what it does).
Is this just for harddisks? And would I have to activate SpredSpectrum
at the Hitachi Feature Tools, too, for having an effect?
- I tried everything with the lowest possible BIOS settings,.. which
solved nothing.
- According to the information found in the threads at the nvidia boards
(see the links above)... this may be a nvidia-Hitachi nvidia-Maxtor (and
even other manufracturers of HDDs) related problem.
Some even claimed that deactivation of Native Command Queueing (NCQ)
helped,.. BUT only for a limited time.
But as far as I know, Linux doesn't support NCQ at all (at the moment).
Thank you very much for now.
Best wishes,
Christoph Anton Mitterer.
----------------------------
Old information:
As a reminder my software/hardware data:
CPU: 2x DualCore Opteron 275
Mainboard: Tyan Thunder K8WE (S2895)
Chipsets: Nvidia nForce professional 2200, Nvidia nForce professional
2050, AMD 8131
Memory: Kingston ValueRAM 4x 1GB Registered ECC
Harddisks: 1x PATA IBM/Hitachi, 2x SATA IBM/Hitachi
Additional Devices/Drives: Plextor PX760A DVD/CD, TerraTec Aureon 7.1
Universe soundcard, Hauppage Nova 500T DualDVB-T card.
Distribution: Debian sid
Kernel: self-compiled 2.6.18.2 (see below for .config) with applied EDAC
patches
The system should be cooled enough so I don't think that this comes from
overheating issues. Nothing is overclocked.
The issue was:
For an in depth description of the problem please have a look at the
linux-kernel mailing list.
The full, current thread:
http://marc.theaimsgroup.com/?t=116312440000001&r=1&w=2
(You'll find there my system specs, too.)
An older thread with the same problem, but where I thougt the problem is
FAT32 related (anyway, might be interesting, too):
http://marc.theaimsgroup.com/?t=116291314500001&r=1&w=2
#########################################################################
### email #2 to Tyan/Hitachi ###
#########################################################################
Hi Mr. Dusatko, hi Mr. Ebner.
Rodger Dusatko - Tyan wrote:
> > Thanks so much for your e-mail.
>
Well I have to thank you for your support, too :-)
> > You seem to have tried many different tests.
> >
>
Dozens ^^
> > If I understand the problem correctly, when you use the on-board SATA, you
> > are receiving corrupted data.
> >
>
It happens with the onboard IDE, too !!!! This sounds reasonable as both
IDE and SATA are connected to the nforce 2200.
If you read the links to pages where other people report about the same
problem (especially at the Nvidia forums) you'll see that others think
that this is nforce related, too.
I established contact with some of them and most think that this may
(but of course there is no definite proof for this) related to a
nforce/disk manufracturer combination. So all of them report for example
that the error occurs with nforce/Hitachi.
Some of them think that it might be NCQ related (did you read the NCQ
related parts of my last email? As far as I can see NCQ will be added in
the kernel for sata_nv in 2.6.19).
Both of this sounds somewhat strange to me...
For my understanding of computer technology I would assume that this
should be a general harddisk error,.. and not only Hitachi or e.g.
Maxtor related.
(I didn't test it with the onboard SCSI, as I don't have any SCSI drives.)
Not that
> > Sometimes we have solved this problem by simply readjusting bios settings.
> >
>
Does this mean that you were able to reproduce the problem?
> > Please try the following:
> >
> > in the Linux boot prompt, please try (mem=3072M). This will show whether it
> > might be a problem related to the memory hole.
> > or use only 2gb of memory.
> >
> >
>
I'm going to test this in a few minutes (althoug I think I did already a
similar test)...
Anyway from a theoretical point of view it sounds very unlikely to me,
that this is a memory related issue at all. Not only because of my
memtest86+ test,.. but also because of the way the linux kernel works in
that area.
> > If it is a memory hole problem, you should have (with Linux) the following
> > settings:
> >
>
My current memhole seetings are these (the ones that I use under
"normal" production):
IOMMU -> enabled
IOMMU -> 64 MB
Memhole -> AUTO
mapping -> HARDWARE
Other memory settings
-> Node Memory Interleave -> enabled
-> Dram Bank Interleave -> enabled
-> MTTR Mapping -> discrete
-> Memory Hole
-> Memory Hole mapping -> enabled
-> Memory Config
-> Memory Clock DDR400
-> Swizzle memory banks enabled
> > CMOS reset (press CMOS Clear button for 20 seconds).
> > Go into Bios -> Set all values to default (F9)
> > Main -> Installed O/S -> Linux
> > Advanced -> Hammer Config
> > -> Node Memory Interleave -> disabled
> > -> Dram Bank Interleave -> disabled
> > -> MTTR Mapping -> discrete
> > -> Memory Hole
> > -> Memory Hole mapping -> Disabled
> > -> Memory Config
> > -> Memory Clock DDR333
> > -> Swizzle memory banks disabled
> >
>
I've already checked excatly this setting ;) expect that I used
DDR400,... could that make any difference?
> > You might try SATA HDDs from another manufacturer.
> >
>
I'm already trying to do so but currently none of my friends was able to
borrow me any devices,... I'm also going to check the issue with other
operating systems (at least if I find any that support the Nvidia
chipsets at all),.. maybe some *BSD or OpenSolaris.
> > Also, I have a newer beta bios version available.
> >
> > ftp://ftp.tech2.de/boards/28xx/2895/BIOS/ -> 2895_1047.zip you might want to
> > try.
> >
>
Please don't understand me wrong,... I still would like you to help and
investigate in that issue... but currently I think (although I may be
wrong) that this could be harddisk firmware related.
So what _excatly_ did you change in that version,.. or is it just a
crappy solution or workaround,...?
Any idea about that spread spectrum option?:
> > - I activated my BIOS's Spread Spectrum Option (although it is not
> > described what it does).
> > Is this just for harddisks? And would I have to activate SpredSpectrum
> > at the Hitachi Feature Tools, too, for having an effect?
> >
>
Thanks so far.
Chris.
#########################################################################
### email #3 to Tyan/Hitachi ###
#########################################################################
Rodger Dusatko - Tyan wrote:
> > Hello Christoph,
> >
> > another customer having data corruption problems said by entering the
> > command mem=3072M he no longer has data corruption problems.
> >
> > Please let me know as soon as possible, that I might know how to help
> > further.
> >
>
I just finished my test....
Used my "old" BIOS settings (not the one included in you mail)... but
set mem=3072M.
It seems (although I'm not yet fully convinced as I've already had cases
where an error occured after lots of sha512-passes) that with mem=3072M
_no error occures_
But of course I get only 2GB ram (of my 4GB which I wanted to upgrad to
even more memory in the next months).
So just to use mem=3072M is not acceptable.
And I must admit that I have strong concerns about the fact that memhole
settings are a proper fix for that.
Of course I'd be glad if I could fix that... but from my own large
system programming experience I know that there are many cases where a
fix isn't really a fix for a problem,... but solves the problem in
conjunction with other errors (that are not yet found).
I'd be glad if you could give me better explanation of the
memhole-solution (and especially how to solve it without mem=3072M
because I'd like to have my full memory) ... because I'd like to fully
understand the issue to secure that it is really fixed or not.
I'll test you beta BIOS tomorrow and report my results.
If you whish I could also call you via phone (just give me your phone-#).
Thanks in advance,
Chris.
#########################################################################
### email #4 to Tyan/Hitachi ###
#########################################################################
One thing I forgot,...
Although using it very very rarely,.. there are some cases where I have
to use M$ Windows.... and afaik,.. you cannot tell windows something
like mem=3072M
So it wouldn't solve that for Windows.
Chris.
#########################################################################
### email #5 to Tyan/Hitachi ###
#########################################################################
Dear Mr. Dusatko, Mr. Ebner and dear sir at the Hitachi GST Support.
I'd like to give you my current status of the problem.
First of all AMD didn't even answer until now, the same applies for my
request at Nvidias knowledge base,... says something about these
companies I think.
For the people at Hitachi: With the advice of Mr. Dusatko from Tyan I
was able to workaround the problem:
Rodger Dusatko - Tyan wrote:
> > as I mentioned earlier, you can do some of these memory hole settings
> > : (for
> > Linux)
>
>>> >>> Go into Bios -> Set all values to default (F9)
>>> >>> Main -> Installed O/S -> Linux
>>> >>> Advanced -> Hammer Config
>>> >>> -> Node Memory Interleave -> disabled
>>> >>> -> Dram Bank Interleave -> disabled
>>> >>> -> MTTR Mapping -> discrete
>>> >>> -> Memory Hole
>>> >>> -> Memory Hole mapping -> Disabled
>>> >>> -> Memory Config
>>> >>> -> Memory Clock DDR333
>>> >>> -> Swizzle memory banks disabled
>>>
The above settings for the BIOS actually lead to a system that did not
make any errors during one of my complete tests (that is verifying
sha512sums 50 times on 30 GB of data).
Actually I seems to depend only on one of the above settings: Memory
hole mapping.
Currently I'm using the following:
Main -> Installed O/S -> Linux
Advanced -> Hammer Config
-> Node Memory Interleave -> Auto
-> Dram Bank Interleave -> Auto
-> MTTR Mapping -> discrete
-> Memory Hole
-> Memory Hole mapping -> Disabled
-> Memory Config
-> Memory Clock->DDR400
->Swizzle memory banks -> Enabled
And still no error occurs.
But as soon as I set Memory Hole mapping to one of the other values
(Auto, Hardware or Software),.. the error occurs.
(Especially for Tyan: Note that when using Software Node Memory
Interleave is always automatically set to Disabled after reboot, while
when using Harware, Auto works - perhaps a bug?)
Ok,.. now you might think,... problem solved,.. but it is defenitely not:
1) Memory Hole mapping costs me 2GB of my 4GB RAM (which are unusable
because of the memory hole),.. this is not really acceptable.
The beta BIOS Mr. Dusatko from Tyan gave might solve this, but I wasn't
able to test this yet.
2) But even it this would solve the problem I'm still very concerned and
encourage especially the people at Hitachi to try to find another reason.
Why? Because I cannot imagine how the memory hole leads to the wole issue:
- The memory hole is a quite simple process where the BIOS / Hardware
remaps to some portions of physical RAM to higher areas,.. to give the
lower areas to PCI devices that make uses of mmap.
Even if there would be an error,... that would not only affect IDE/SATA
but also CD/DVD/SCSI drives and any other memory operations at all.
AND there would be complete block that would be corrupted,.. not only
several bytes (remember: I've reportet that in a currupted block some
bytes are ok,.. some are note,... and so on).
-If you look at the board description
(ftp://ftp.tyan.com/manuals/m_s2895_101.pdf page 9) you see that both
IDE and SATA are connected to the nforce professional 2200, right?
Why should the memhole settings affect only the IDE/SATA drives? If
there was an error in the memory controller it would affect every memory
operation in the system (see above) because the memory controller is not
onboard,.. but integrated in the Operton CPUs. (This is also the reason
why, if the memory controller would have design errors, not only people
using nvidia chipsets have this problem,.. which is apparently the case.)
-Last but not least,.. (as also noted above) the errors are always like
the following: not a complete block is corrupted but just perhaps half
of all its bytes (in any order). Could this come from the simple memory
hole remapping???? In my opinion, definitely not.
So I think "we" are not yet finished with work.
- I ask the Hitachi people to continue their work (or start with it ;) )
in taking a special look at their firmware and how it interoperates with
nforce chipsets.
I found (really) lots of reports where people tells that this issue has
been resolved by firmware upgrades of their vendor (especially for
Maxtor devices).
Nvidia itself suggests this:
http://nvidia.custhelp.com/cgi-bin/nvidia.cfg/php/enduser/std_adp.php?p_faqid=768&p_created=1138923863&p_sid=9qSJ8Yni&p_accessibility=0&p_redirect=&p_lva=&p_sp=cF9zcmNoPSZwX3NvcnRfYnk9JnBfZ3JpZHNvcnQ9JnBfcm93X2NudD00MzImcF9wcm9kcz0mcF9jYXRzPSZwX3B2PSZwX2N2PSZwX3NlYXJjaF90eXBlPWFuc3dlcnMuc2VhcmNoX2ZubCZwX3BhZ2U9MQ**&p_li=&p_topview=1
(although they think that the issue appears only on SATA which is
definitely not true)
Please have a detailed look on the NCQ of the drives:
This would be (according to how NCQ works) the most likely reason for
the error,... and some people say that deactivating it under Windows,
solved the issue. Anyway,... if NCQ was responsible for the error,.. it
would not appear on the IDE drives (but it does).
And I'm not even sure if Linux/libata (until kernel 2.6.18.x) even uses
NCQ. I always thought it would not but I might be wrong. See this part
of my dmesg:
sata_nv 0000:00:07.0: version 2.0
ACPI: PCI Interrupt Link [LTID] enabled at IRQ 22
GSI 18 sharing vector 0xD9 and IRQ 18
ACPI: PCI Interrupt 0000:00:07.0[A] -> Link [LTID] -> GSI 22 (level,
high) -> IRQ 217
PCI: Setting latency timer of device 0000:00:07.0 to 64
ata1: SATA max UDMA/133 cmd 0x1C40 ctl 0x1C36 bmdma 0x1C10 irq 217
ata2: SATA max UDMA/133 cmd 0x1C38 ctl 0x1C32 bmdma 0x1C18 irq 217
scsi0 : sata_nv
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-7, max UDMA/133, 488397168 sectors: LBA48 NCQ (depth 0/32)
ata1.00: ata1: dev 0 multi count 16
ata1.00: configured for UDMA/133
scsi1 : sata_nv
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: ATA-7, max UDMA/133, 488397168 sectors: LBA48 NCQ (depth 0/32)
ata2.00: ata2: dev 0 multi count 16
ata2.00: configured for UDMA/133
Vendor: ATA Model: HDT722525DLA380 Rev: V44O
Type: Direct-Access ANSI SCSI revision: 05
Vendor: ATA Model: HDT722525DLA380 Rev: V44O
Type: Direct-Access ANSI SCSI revision: 05
It says something about NCQ...
-It would also be great if Hitachi could inform me about their current
progress or how long it will take until their engineers start to have a
look at my issue.
Especially for Mr. Dusatko at tyan:
> > Just because memtest86 works it doesn't mean that the memory you are using
> > is compatible memory. That is why we have a recommended list.
> >
> > Each of the modules on our recommended list have been thoroughly tested.
> > Most memories pass the memtest86 test, yet many of these do not past our
> > tests.
> >
> > my tel-nr.
>
My memory modules are actually on your compatible list (they're the
Kingston KVR400D8R3A/1G) so this cannot be the point.
I still was not able to test your beta BIOS but I'll do so as soon as
possible an report the results. And I'm going to call you this or next
week (have to work at the Leibniz-Supercomputing Centre today and
tomorrow,.. so don't know when I have enough time).
Thanks for now.
Best wishes,
Chris.
#########################################################################
### email #6 to Tyan/Hitachi ###
#########################################################################
Rodger Dusatko - Tyan wrote:
> > you mention:
> >
> >
>
>> >> My memory modules are actually on your compatible list (they're the
>> >> Kingston KVR400D8R3A/1G) so this cannot be the point.
>> >>
>>
> >
> > I have talked with so many customers about this very problem. Just because
> > the part-nr. of the Kingston modules is correct, this means absolutely
> > nothing.
> >
> > You need to also have the same chips as on our recommended website. The
> > chips being used are even more important than the kingston part-nr.
> >
> > The chips on the KVR400D8R3A/1G must be Micron, having chip part-nr.
> > MT46V64M8TG-5B D as shown on our recommended memory page.
> >
>
I'll check this these days and inform you about the exact chips on the DIMMs
Anyway...
What do you say to the reasons why I don't think that the memhole stuff
is a real solution but more a poor workaround (see my last email,..
which is attached below).
You didn't comment on my ideas in your last answer.
> > This is a grave problem with Kingston memory and why I would only recommend
> > Kingston memory when your supplier is willing to help you to get the exact
> > modules which we have tested.
> >
>
Well are you absolutely sure that this is memory related? (See also my
comments in my last email)
Note that lots of users were able to solve this via disk drive firmware
upgrades and many of them didn't have Kingston RAMs.
Also,... all RAMs "shoudl" be usable as all "should" follow the SDRAM
standard...
If there would be a Kingston error,.. that data corruption issue should
appear everywhere, shouldn't it? And not only on hard disk accesses.
In all doing respect, and please believe me that I truely respect your
knowledge and so (because you surely know more about hardware because my
computer science study goes more about theoretical stuff)... but I
cannot believe that this is the simple reason,... "wrong RAMs wrong BIOS
settings and you cannot use your full RAM" (see my reasons in my last
email)...
I'd say that there is somewhere a real and perhaps grave error....
either on the board itself ot the nvidia chipset (which I suspect as the
evil here ;-) ).
And I think the error is severe enought that there should be made a
considerable effort to solve it, or at least, exactly locate where there
error is, and why the memhole disabled solves it.
And remember,... it may be the case that the data corruption doesn't
appear when UDMA (at PATA drives) is disabled,.. but this shouldn't have
to do anything with memory vendor or memhole settings,... so why would
this solve the issue, too (if it actually does which I cannot proove)?
I'm also going to start my test with changing the following BIOS settings:
SCSI Bus master from my current setting Enabled to Disabled
Disk Access Mode (don't recall the actual name) from Other to DOS.
I'm going to report you the results next week,.. and I'll probably going
to call you again.
> > Wiith ATP or other vendors, they stick usually to the same chips as long as
> > the vendor part-nr is the same. In such a case, you probably would have been
> > right when the vendor part-nr matches your part-nr.
> >
> > The problems you are having, as I mentioned before, may disappear if you use
> > memory on our recommended memory list.
> >
>
Is it possible for Tyan to borrow me such memory for testing? I live in
Munich and Tyan Germany is in Munich too, if I recall correctly.
Thanks in adc
Best wishes,
Chris.
#########################################################################
### email #7 to Tyan/Hitachi ###
#########################################################################
Sorry I forgot one thing:
The beta BIOS you gave me did not change anything.
As soon as I activate memhole mapping (either to software, hardware or
auto),.. data corruption occurs.
Chris.
#########################################################################
### reply to #1 from Tyan ###
#########################################################################
Hello Chris,
there are often problems which are not really so easy to understand.
As I understand it, the hard disk uses DMA (Direct Memory Access), which is
supported by the chipset.
The processor uses the DMA access to the DIMMs through the chipset to write
to the disks.
Now, I really am not an expert on this, but normally the DMA is not used by
the processor when communicating with the memory, but rather the
hypertransport connection.
This may be an explanation of what is causing the problem. Because a driver
for HDDs also exists, there may be different links where the problem is
occuring.
The driver may be able to solve problems which can make it that even using
the hardware setting for memory hole causes no problems. However, there are
many different amd cpu steppings, all different in how they manage memory
(and in this case, the memory hole). If the drivers take all of these
considerations, they may be able to adjust according to the processor being
used. But I am not sure if the people who write these drivers get involved
with this.
Rodger
, the DMA s supported from the chipset uses the DMA access for
communicating with the processor, the memory
----- Original Message -----
...
...
...
That were all (important) emails so until now.
On Sat Dec 02, 2006 at 01:56:06AM +0100, Christoph Anton Mitterer wrote:
> The issue was basically the following:
> I found a severe bug mainly by fortune because it occurs very rarely.
> My test looks like the following: I have about 30GB of testing data on
> my harddisk,... I repeat verifying sha512 sums on these files and check
> if errors occur.
> One test pass verifies the 30GB 50 times,... about one to four
> differences are found in each pass.
Doh! I have a Tyan S2895 in my system, and I've been pulling my
hair out trying to track down the cause of a similar somewhat
rare failure for the pre-computer sha1 of a block of data to
actually match the calculated sha1. I'd been hunting in vain the
past few days trying to find a cause -- looking for buffer
overflows, non thread safe code, or similar usual suspects.
It is a relief to see I am not alone!
-Erik
--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--
Erik Andersen wrote:
> Doh! I have a Tyan S2895 in my system, and I've been pulling my
> hair out trying to track down the cause of a similar somewhat
> rare failure for the pre-computer sha1 of a block of data to
> actually match the calculated sha1. I'd been hunting in vain the
> past few days trying to find a cause -- looking for buffer
> overflows, non thread safe code, or similar usual suspects.
>
> It is a relief to see I am not alone!
>
^^
You might read my email and all links in it, etc. throughly,.. than you
can try what I did.
Please inform me about all your results, and about your specific
hardware (i.e. CPU type (with stepping and exact model), which harddisks
and so on).
Best wishes,
Chris.
On Sat, Dec 02, 2006 at 01:56:06AM +0100, Christoph Anton Mitterer wrote:
> I found a severe bug mainly by fortune because it occurs very
> rarely. My test looks like the following: I have about 30GB of
> testing data on my harddisk,... I repeat verifying sha512 sums on
> these files and check if errors occur.
Heh, I see this also with an Tyan S2866 (nforce4 chipset). I've been
aware something is a miss for a while because if I transfer about 40GB
of data from one machine to another there are checksum mismatches and
some files have to be transfered again.
I've kept quite about it so far because it's not been clear what the
cause is and because i can mostly ignore it now that I checksum all my
data and check after xfer that it's sane (so I have 2+ copies of all
this stuff everywhere).
> One test pass verifies the 30GB 50 times,... about one to four
> differences are found in each pass.
Sounds about the same occurance rate I see, 30-40GB xfer, one or two
pages (4K) are wrong.
> The corrupted data is not one single completely wrong block of data
> or so,.. but if you look at the area of the file where differences
> are found,.. than some bytes are ok,.. some are wrong,.. and so on
> (seems to be randomly).
For me it seems that a single block in the file would be bad and the
rest OK --- we I'm talking about 2 random blocks in 30BG or so.
Hello Christoph!
On Sat, 2 Dec 2006, Christoph Anton Mitterer wrote:
> I found a severe bug mainly by fortune because it occurs very rarely.
> My test looks like the following: I have about 30GB of testing data on
This sounds very familiar! One of the Linux compute clusters I
administer at work is a 336 node system consisting of the
following components:
* 2x Dual-Core AMD Opteron 275
* Tyan S2891 mainboard
* Hitachi HDS728080PLA380 harddisk
* 4 GB RAM (some nodes have 8 GB) - intensively tested with
memtest86+
* SUSE 9.3 x86_64 (kernel 2.6.11.4-21.14-smp) - But I've also
e.g. tried the latest openSUSE 10.2 RC1+ kernel 2.6.18.2-33 which
makes no difference.
We are running LS-Dyna on these machines and discovered a
testcase which shows a similar data corruption. So I can
confirm that the problem is for real an not a hardware defect
of a single machine!
Here's a diff of a corrupted and a good file written during our
testcase:
("-" == corrupted file, "+" == good file)
...
009f2ff0 67 2a 4c c4 6d 9d 34 44 ad e6 3c 45 05 9a 4d c4 |g*L.m.4D..<E..M.|
-009f3000 39 60 e6 44 20 ab 46 44 56 aa 46 44 c2 35 e6 44 |9.D .FDV.FD.5.D|
-009f3010 45 e1 48 44 88 3d 47 44 f3 81 e6 44 93 0b 46 44 |E.HD.=GD...D..FD|
-009f3020 d4 eb 48 44 22 57 e6 44 3d 3d 48 44 ac 89 49 44 |..HD"W.D==HD..ID|
-009f3030 00 8c e9 44 39 af 2d 44 e7 1b 8d 44 a8 6e e9 44 |...D9.-D...D.n.D|
-009f3040 16 d4 2e 44 5e 12 8c 44 78 51 e9 44 c0 f5 2f 44 |...D^..DxQ.D../D|
...
-009f3fd0 22 ae 4e 44 81 b5 ee 43 0c 8a df 44 8d e2 6b 44 |".ND...C...D..kD|
-009f3fe0 6c a0 e8 43 b6 8f e9 44 22 ae 4e 44 55 e9 ed 43 |l..C...D".NDU..C|
-009f3ff0 a8 b2 e0 44 78 7c 69 44 56 6f e8 43 5e b2 e0 44 |...Dx|iDVo.C^..D|
+009f3000 1b 32 30 44 50 59 3d 45 a2 79 4e c4 66 6e 2f 44 |.20DPY=E.yN.fn/D|
+009f3010 40 91 3d 45 d1 b6 4e c4 a1 6c 31 44 1b cb 3d 45 |@.=E..N..l1D..=E|
+009f3020 0d f6 4e c4 57 7c 33 44 bf cb 3c 45 88 9a 4d c4 |..N.W|3D..<E..M.|
+009f3030 79 e9 29 44 3a 10 3d 45 d3 e1 4d c4 17 28 2c 44 |y.)D:.=E..M..(,D|
+009f3040 f6 50 3d 45 dc 25 4e c4 b6 50 2e 44 b3 4f 3c 45 |.P=E.%N..P.D.O<E|
...
+009f3fd0 9c 70 6c 45 04 be 9f c3 fe fc 8f 44 ce 65 6c 45 |.plE.......D.elE|
+009f3fe0 fc 56 9c c3 32 f7 90 44 e5 3c 6c 45 cd 79 9c c3 |.V..2..D.<lE.y..|
+009f3ff0 f3 55 92 44 c1 10 6c 45 5e 12 a0 c3 60 31 93 44 |.U.D..lE^...1.D|
009f4000 88 cd 6b 45 c1 6d cd c3 00 a5 8b 44 f2 ac 6b 45 |..kE.m.....D..kE|
...
Please notice:
a) the corruption begins at a page boundary
b) the corrupted byte range is a single memory page and
c) almost every fourth byte is set to 0x44 in the corrupted case
(but the other bytes changed, too)
To me this looks as if a wrong memory page got written into the
file.
>From our testing I can also tell that the data corruption does
*not* appear at all when we are booting the nodes with mem=2G.
However, when we are using all the 4GB the data corruption
shows up - but not everytime and thus not on all nodes.
Sometimes a node runs for ours without any problem. That's why
we are testing on 32 nodes in parallel most of the time. I have
the impression that it has something to do with physical memory
layout of the running processes.
Please also notice that this is a silent data corruption. I.e.
there are no error or warning messages in the kernel log or the
mce log at all.
Christoph, I will carefully re-read your entire posting and the
included links on Monday and will also try the memory hole
setting.
If somebody has an explanation for this problem I can offer
some of our compute nodes+time for testing because we really
want to get this fixed as soon as possible.
Best regards,
Karsten
--
Dipl.-Inf. Karsten Weiss - http://www.machineroom.de/knweiss
On Sat, 2 Dec 2006 12:00:36 +0100 (CET)
Karsten Weiss <[email protected]> wrote:
> Hello Christoph!
>
> On Sat, 2 Dec 2006, Christoph Anton Mitterer wrote:
>
> > I found a severe bug mainly by fortune because it occurs very rarely.
> > My test looks like the following: I have about 30GB of testing data on
>
> This sounds very familiar! One of the Linux compute clusters I
> administer at work is a 336 node system consisting of the
> following components:
See the thread http://lkml.org/lkml/2006/8/16/305
Alan wrote:
> See the thread http://lkml.org/lkml/2006/8/16/305
>
Hi Alan.
Thanks for your reply. I've read this thread already some weeks ago....
but from my limited knowledge I understood, that this was an issue
related to a SCSI adapter or so. Or did I understand this wrong. And as
soon as he removed the card everything was fine.
I don't have any PCI SCSI cards,... (but I have an onboard LSI53C1030
controller).
The only cards I have are:
PCIe bus (two slots):
Asus Nividia 7800GTX based card
PCI bus: no card (one slot):
no card
PCI-X bus A (100MHz) (two slots):
Hauppauge Nova T 500 Dual DVB-T card (which is actually a "normal" PCI
card,.. but should be compatible with PCI-X)
TerraTec Aureon 7.1 Universe Soundcard (which is actually a "normal" PCI
card,.. but should be compatible with PCI-X)
PCI-X bus B (133MHz) (one slots):
no card
Chris.
Chris Wedgwood wrote:
> Heh, I see this also with an Tyan S2866 (nforce4 chipset). I've been
> aware something is a miss for a while because if I transfer about 40GB
> of data from one machine to another there are checksum mismatches and
> some files have to be transfered again.
>
It seems that this may be occur on _all_ nvidia chipsets (of course I'm
talking about mainboard chips ;) )
Which harddisk types to you use (vendor an interface type)
> I've kept quite about it so far because it's not been clear what the
> cause is and because i can mostly ignore it now that I checksum all my
> data and check after xfer that it's sane (so I have 2+ copies of all
> this stuff everywhere).
>
I assume that a large number of users actually experience this error,..
but as it's so rare only few correctly identify it.
Most of them might think that its filesystem related or so.
>> The corrupted data is not one single completely wrong block of data
>> or so,.. but if you look at the area of the file where differences
>> are found,.. than some bytes are ok,.. some are wrong,.. and so on
>> (seems to be randomly).
>>
>
> For me it seems that a single block in the file would be bad and the
> rest OK --- we I'm talking about 2 random blocks in 30BG or so.
>
Did you check this with an hex editor? I did it an while the errors were
restricted to one "region" of a file.... it was not so that that region
was completely corrupted but only some single bytes.
Actually it was that mostly one bit was wrong,..
Chris.
On Sat, 2006-12-02 01:56:06, Christoph Anton Mitterer wrote:
> The issue was basically the following: I found a severe bug mainly by
> fortune because it occurs very rarely. My test looks like the following:
> I have about 30GB of testing data on my harddisk,... I repeat verifying
> sha512 sums on these files and check if errors occur. One test pass
> verifies the 30GB 50 times,... about one to four differences are found in
> each pass.
I'm also experiencing silent data corruption on writes to SATA disks
connected to a Nvidia controller (nForce 4 chipset). The problem is
100% reproducible. Details of my configuration (mainboard model, lspci,
etc.) are near the bottom of this message. What follows is a summation
of my findings.
I have confirmed the corruption is occurring on the writes and not the
reads. Furthermore, if I compare the original and copy while both are
still cached in memory no corruption is found. But as soon as I flush the
pagecache (by reading another file larger than memory) to force the copy
of the file to be read from disk the corruption is seen. The corruption
occurs with direct I/O and normal buffered filesystem I/O (ext3).
Booting with "mem=1g" (system has 4 GiB installed) makes no difference.
So it isn't due to remapping memory above the 4 GiB boundary. Booting to
single user and ensuring no unnecessary modules (video, etc.) are loaded
also makes no difference.
The problem affects both disks attached to the nVidia SATA controller but
not the two disks attached to the PATA side of the same controller. All
four disks are different models. The same SATA disks attached to
the Silicon Image 3114 SATA RAID controller (on the same mainboard)
experiences the same corruption but at a lower probability. The same
disks attached to a Promise TX2 SATA controller (in the same system)
experience no corruption.
The system has run memtest86 for 24 hours with no errors.
The corruption occurs with a 32-bit kernel.org 2.6.12 kernel (from a
Knoppix CD), 64-bit kernel.org 2.6.18.1, and 64-bit kernel.org 2.6.19.
The only pattern I can discern is that the corruption only affects the
second four byte block of a sixteen byte aligned region. That is, the
offset of the corrupted bytes always falls in the range 0x...4 to 0x...7.
For example, here are a few representative offsets of corrupted bytes
from one test of many:
0x020f4554
0x020f4555
0x020f4556
0x020f4557
0x020f4555
0x1597f1d4
0x23034ee5
0x2dfd08d4
0x33690b14
0x33690b15
0x33690b16
0x33690b17
Approximately half the corruption involves all four bytes of the second
four byte word of the 16 by aligned region. The remaining instances show
one, two or three bytes being corrupted. There is no pattern that I can
discern to the corruption. It is definitely not anything as simple as the
correct bytes being replaced with zeros or specific bits being forced to
a zero or one state or flipped. Copying a 2 GiB file typically results
in between five and thirty bytes being corrupted. Back to back tests
copying the same file results in different bytes being corrupted.
The problem appears to be sensitive to the data pattern. Some files can be
copied repeatedly without corruption. But others will exhibit corruption
100% of the time when attached to the nVidia SATA controller and over 50%
of the time when connected to the Sil 3114 controller.
I have tried direct I/O with varying transfer sizes from 1 KiB to 128 KiB
(I have not tried single sector I/O). Corruption occurs with all block
sizes. If I execute an I/O loop like this the corruption does not appear
to occur:
while direct read 64 KiB original
direct write copy
direct read + verify copy
done
But that is tentative as I've only done two such tests at this time.
A coworker has hypothesized that this may be a consequence of bus
crosstalk. That it doesn't happen when using the Promise TX2 controller
suggests that if it is a HW bus problem it isn't a hypertransport or
memory bus problem.
Configuration Details
=====================
Mainboard: ASUS A8N-SLI Deluxe with nForce 4 chipset
BIOS: 1016 (original) and 1805 (upgraded to in attempt to resolve)
CPU: AMD Athlon 64 X2 Dual Core Processor 4400+ stepping 02
Memory: 4 x 1 GiB Kingston matched pairs lifetime warranty
Kernel: 2.6.19 kernel.org (as well as others)
Disks: Western Digital WD740GD-00FL (10000 rpm "Raptor" disk)
Western Digital WD1600JD-32H (7200 rpm)
00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
Subsystem: ASUSTeK Computer Inc. Unknown device 815a
Flags: bus master, 66MHz, fast devsel, latency 0
Capabilities: [44] HyperTransport: Slave or Primary Interface
Capabilities: [e0] HyperTransport: MSI Mapping
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev f3)
Subsystem: ASUSTeK Computer Inc. K8N4-E Mainboard
Flags: bus master, 66MHz, fast devsel, latency 0
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
Subsystem: ASUSTeK Computer Inc. K8N4-E Mainboard
Flags: 66MHz, fast devsel, IRQ 3
I/O ports at e000 [size=32]
I/O ports at 4c00 [size=64]
I/O ports at 4c40 [size=64]
Capabilities: [44] Power Management version 2
00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2) (prog-if 10 [OHCI])
Subsystem: ASUSTeK Computer Inc. K8N4-E Mainboard
Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 5
Memory at d4003000 (32-bit, non-prefetchable) [size=4K]
Capabilities: [44] Power Management version 2
00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3) (prog-if 20 [EHCI])
Subsystem: ASUSTeK Computer Inc. K8N4-E Mainboard
Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 7
Memory at feb00000 (32-bit, non-prefetchable) [size=256]
Capabilities: [44] Debug port
Capabilities: [80] Power Management version 2
00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio Controller (rev a2)
Subsystem: ASUSTeK Computer Inc. K8N4-E Mainboard
Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 11
I/O ports at d800 [size=256]
I/O ports at dc00 [size=256]
Memory at d4002000 (32-bit, non-prefetchable) [size=4K]
Capabilities: [44] Power Management version 2
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2) (prog-if 8a [Master SecP PriP])
Subsystem: ASUSTeK Computer Inc. K8N4-E Mainboard
Flags: bus master, 66MHz, fast devsel, latency 0
I/O ports at f000 [size=16]
Capabilities: [44] Power Management version 2
00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3) (prog-if 85 [Master SecO PriO])
Subsystem: ASUSTeK Computer Inc. Unknown device 815a
Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 11
I/O ports at 09f0 [size=8]
I/O ports at 0bf0 [size=4]
I/O ports at 0970 [size=8]
I/O ports at 0b70 [size=4]
I/O ports at d400 [size=16]
Memory at d4001000 (32-bit, non-prefetchable) [size=4K]
Capabilities: [44] Power Management version 2
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3) (prog-if 85 [Master SecO PriO])
Subsystem: ASUSTeK Computer Inc. K8N4-E Mainboard
Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 5
I/O ports at 09e0 [size=8]
I/O ports at 0be0 [size=4]
I/O ports at 0960 [size=8]
I/O ports at 0b60 [size=4]
I/O ports at c000 [size=16]
Memory at d4000000 (32-bit, non-prefetchable) [size=4K]
Capabilities: [44] Power Management version 2
00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev f2) (prog-if 01 [Subtractive decode])
Flags: bus master, 66MHz, fast devsel, latency 0
Bus: primary=00, secondary=05, subordinate=05, sec-latency=128
I/O behind bridge: 00007000-00009fff
Memory behind bridge: d2000000-d3ffffff
Prefetchable memory behind bridge: d4100000-d42fffff
00:0b.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev f3) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=04, subordinate=04, sec-latency=0
Capabilities: [40] Power Management version 2
Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable+
Capabilities: [58] HyperTransport: MSI Mapping
Capabilities: [80] Express Root Port (Slot+) IRQ 0
Capabilities: [100] Virtual Channel
00:0c.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev f3) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
Capabilities: [40] Power Management version 2
Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable+
Capabilities: [58] HyperTransport: MSI Mapping
Capabilities: [80] Express Root Port (Slot+) IRQ 0
Capabilities: [100] Virtual Channel
00:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev f3) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
Capabilities: [40] Power Management version 2
Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable+
Capabilities: [58] HyperTransport: MSI Mapping
Capabilities: [80] Express Root Port (Slot+) IRQ 0
Capabilities: [100] Virtual Channel
00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
I/O behind bridge: 0000a000-0000afff
Memory behind bridge: d0000000-d1ffffff
Prefetchable memory behind bridge: 00000000c0000000-00000000cff00000
Capabilities: [40] Power Management version 2
Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable+
Capabilities: [58] HyperTransport: MSI Mapping
Capabilities: [80] Express Root Port (Slot+) IRQ 0
Capabilities: [100] Virtual Channel
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
Flags: fast devsel
Capabilities: [80] HyperTransport: Host or Secondary Interface
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
Flags: fast devsel
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
Flags: fast devsel
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
Flags: fast devsel
01:00.0 VGA compatible controller: ATI Technologies Inc RV530 [Radeon X1600] (prog-if 00 [VGA])
Subsystem: VISIONTEK Unknown device 1890
Flags: bus master, fast devsel, latency 0, IRQ 3
Memory at c0000000 (64-bit, prefetchable) [size=256M]
Memory at d1000000 (64-bit, non-prefetchable) [size=64K]
I/O ports at a000 [size=256]
Expansion ROM at d0000000 [disabled] [size=128K]
Capabilities: [50] Power Management version 2
Capabilities: [58] Express Endpoint IRQ 0
Capabilities: [80] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-
01:00.1 Display controller: ATI Technologies Inc RV530 [Radeon X1600] (Secondary)
Subsystem: VISIONTEK Unknown device 1891
Flags: fast devsel
Memory at d1010000 (64-bit, non-prefetchable) [disabled] [size=64K]
Capabilities: [50] Power Management version 2
Capabilities: [58] Express Endpoint IRQ 0
05:07.0 Mass storage controller: Promise Technology, Inc. PDC20375 (SATA150 TX2plus) (rev 02)
Subsystem: Promise Technology, Inc. PDC20375 (SATA150 TX2plus)
Flags: bus master, 66MHz, medium devsel, latency 96, IRQ 5
I/O ports at 7000 [size=64]
I/O ports at 7400 [size=16]
I/O ports at 7800 [size=128]
Memory at d3124000 (32-bit, non-prefetchable) [size=4K]
Memory at d3100000 (32-bit, non-prefetchable) [size=128K]
Expansion ROM at d4280000 [disabled] [size=16K]
Capabilities: [60] Power Management version 2
05:08.0 Ethernet controller: Intel Corporation 82557/8/9 [Ethernet Pro 100] (rev 08)
Subsystem: IBM Netfinity 10/100
Flags: bus master, medium devsel, latency 32, IRQ 3
Memory at d3127000 (32-bit, non-prefetchable) [size=4K]
I/O ports at 7c00 [size=64]
Memory at d3000000 (32-bit, non-prefetchable) [size=1M]
Expansion ROM at d4100000 [disabled] [size=1M]
Capabilities: [dc] Power Management version 2
05:0a.0 RAID bus controller: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)
Subsystem: ASUSTeK Computer Inc. Unknown device 8167
Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 7
I/O ports at 8000 [size=8]
I/O ports at 8400 [size=4]
I/O ports at 8800 [size=8]
I/O ports at 8c00 [size=4]
I/O ports at 9000 [size=16]
Memory at d3125000 (32-bit, non-prefetchable) [size=1K]
Expansion ROM at d4200000 [disabled] [size=512K]
Capabilities: [60] Power Management version 2
05:0b.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link) (prog-if 10 [OHCI])
Subsystem: ASUSTeK Computer Inc. K8N4-E Mainboard
Flags: bus master, medium devsel, latency 32, IRQ 3
Memory at d3126000 (32-bit, non-prefetchable) [size=2K]
Memory at d3120000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [44] Power Management version 2
--
Kurtis D. Rader, Linux level 3 support email: [email protected]
IBM Integrated Technology Services DID: +1 503-578-3714
15300 SW Koll Pkwy, MS RHE2-O2 service: 800-IBM-SERV
Beaverton, OR 97006-6063 http://www.ibm.com
On Sat, 2006-12-02 17:17:37, Kurtis D. Rader wrote:
> The same disks attached to a Promise TX2 SATA controller (in the same
> system) experience no corruption.
I spoke too soon. Corruption is occurring with the disks attached to the
Promise TX2 SATA controller but much less frequently. With the drives
attached to the nVidia controller copying certain 2 GiB files would
result in at least five bytes, and as many as thirty, being corrupted
every single time. On the Promise controller a given copy is likely to be
good. And when corruption does occur fewer bytes are being affected ---
as little as a single byte in a 2 GiB file. But still, some files never
show corruption while others do.
The Promise controller in a PCI slot is measurably slower than the nVidia
on the baseboard so the speed of the transfers appears to be a factor. In
addition to the pattern of data. My hunch is this is a nVidia nForce 4
chipset design defect involving buss crosstalk or something similar. Which
may be why I'm not seeing it when writing to my relatively slow PATA disks.
--
Kurtis D. Rader, Linux level 3 support email: [email protected]
IBM Integrated Technology Services DID: +1 503-578-3714
15300 SW Koll Pkwy, MS RHE2-O2 service: 800-IBM-SERV
Beaverton, OR 97006-6063 http://www.ibm.com
Hi!
* On Sat, Dec 02, 2006 at 05:17 PM (-0800), Kurtis D. Rader wrote:
> On Sat, 2006-12-02 01:56:06, Christoph Anton Mitterer wrote:
> > The issue was basically the following: I found a severe bug mainly by
> > fortune because it occurs very rarely. My test looks like the following:
> > I have about 30GB of testing data on my harddisk,... I repeat verifying
> > sha512 sums on these files and check if errors occur. One test pass
> > verifies the 30GB 50 times,... about one to four differences are found in
> > each pass.
>
> I'm also experiencing silent data corruption on writes to SATA disks
> connected to a Nvidia controller (nForce 4 chipset). The problem is
> 100% reproducible. Details of my configuration (mainboard model, lspci,
> etc.) are near the bottom of this message. What follows is a summation
> of my findings.
>
> I have confirmed the corruption is occurring on the writes and not the
> reads. Furthermore, if I compare the original and copy while both are
> still cached in memory no corruption is found. But as soon as I flush the
> pagecache (by reading another file larger than memory) to force the copy
> of the file to be read from disk the corruption is seen. The corruption
> occurs with direct I/O and normal buffered filesystem I/O (ext3).
>
> Booting with "mem=1g" (system has 4 GiB installed) makes no difference.
> So it isn't due to remapping memory above the 4 GiB boundary. Booting to
> single user and ensuring no unnecessary modules (video, etc.) are loaded
> also makes no difference.
>
> The problem affects both disks attached to the nVidia SATA controller but
> not the two disks attached to the PATA side of the same controller. All
> four disks are different models. The same SATA disks attached to
> the Silicon Image 3114 SATA RAID controller (on the same mainboard)
> experiences the same corruption but at a lower probability. The same
> disks attached to a Promise TX2 SATA controller (in the same system)
> experience no corruption.
>
> The system has run memtest86 for 24 hours with no errors.
Although your problem report seems rather clearly to be related to
the disk sub-system (e.g. as it only seems to appear at writings),
I would just like to point out that running "memtest86" for some
time without getting any errors does not necessarily state that the
memory is faultless.
I recently had one case where a machine (Athlon XP 2200+) crashed
irregularly. "memtest86", running several days, didn't find anything,
but running the "stress test" of "Prime95" [1] for a few minutes
clearly showed that the machine just miscalculated (Prime95's stress
test stops in this case).
Just removing and reinserting the two memory modules (2 x 256 MB
DDR-RAM) fixed it. The machine is now stable and "Prime95" hasn't
stopped due to computational errors anymore since then. I suppose
that one of the modules wasn't seated in its socket correctly, but
I don't know why "memtest86" (and "memtest86+") didn't find it.
Bye,
Steffen
[1] http://www.mersenne.org/freesoft.htm
On Sat, 2006-12-02 17:17:37, Kurtis D. Rader wrote:
> I'm also experiencing silent data corruption on writes to SATA disks
> connected to a Nvidia controller (nForce 4 chipset). The problem is
> 100% reproducible. Details of my configuration (mainboard model, lspci,
> etc.) are near the bottom of this message. What follows is a summation
> of my findings.
I ran more tests today. This is definitely not due to faulty memory.
Also, clearly this is not a problem with the nVidia SATA controller that
is part of the nForce 4 chipset since the problem can be reproduced with
a Promise TX2 controller in a PCI slot.
The key question is whether this is a HW quirk of the nForce 4 chipset
that the kernel can and should be working around? What tests can I run that
will help narrow the field of investigation or provide more useful data?
I put one disk on my Promise TX2 SATA controller and the other on the
onboard nVidia controller. As reported before corruption occurs when
writing to either disk but at a lower probability when writing to the
disk on the Promise TX2. Also, if I use a PATA disk as the source of the
copy the probabability of corruption is also greatly reduced (the PATA
disk has about 1/3 the throughput of the SATA disks).
I removed half the memory (two 1 GiB DIMMs from the second bank).
Corruption still occurs. I Replaced the DIMMs in the first bank with the
two removed from the second bank (leaving the second bank unpopulated as
before). Corruption still occurs. I verified by inspection of the e820
map that all memory is mapped below the 4 GiB boundary. I've also been
running prime95 all day with the options "-B2 -t". No errors have been
reported. Coupled with previous clean runs of memtest86 and the symptoms
there seems no reason to believe that faulty memory is the cause of
the corruption.
I should stress that my system is not overclocked. The memory is top of the
line matched pairs of Corsair CMX1024-3200PT DDR2 400 Mhz. The power supply
is an Antec True 380S (380 watts). According to the BIOS temperature
and voltage monitoring everthing is well within operational limits. All
components are less than a year old. I buy the best components and am very
conservative when building a system that I depend upon for doing my job.
I also performed some additional copies of the problematic files using
a program whose core is a direct read, write, verify loop:
ifd = open(source_path, O_RDONLY | O_DIRECT);
ofd = open(dest_path, O_RDWR | O_DIRECT);
while (1) {
if (read( ifd, buf1, blocksize ) != blocksize) exit(0);
again:
write( ofd, buf1, blocksize );
lseek( ofd, -blocksize, SEEK_CUR );
read( ofd, buf2, blocksize );
for (i = 0; i < blocksize; i++) {
if (buf1[i] != buf2[i]) {
fprintf( stderr, "blk %6d offset 0x%04x good %02x bad %02x\n",
blk, i, buf1[i], buf2[i] );
lseek( ofd, -blocksize, SEEK_CUR );
goto again;
}
}
}
It reports corruption at a very low rate (a single block out of 15 GiB).
Rewriting the corrupted block always succeeds on the first try. Note that
the test involves seven 2 GiB and one 1 GiB file (VMware Windows XP guest
image split on 2 GiB boundaries). Of the eight files four have never been
corrupted. Those four are mostly free space (i.e., blocks of nulls). The
four files which consistently show corruption have few free blocks.
Which is further evidence that this involves some subtle HW design fault
that requires a specific pattern of data and bus transactions.
It's interesting to note that running prime95 at the same time as the
disk write test reduces the number of corrupted bytes.
The test loop computes a md5sum for each copied file and compares that to
the known correct md5sum. If any md5sums don't match it then performs a
"cmp -l" of the original file and the copy.
In the "cmp -l" output below the middle column is the good value and the
right-hand column is the bad value from the just created copy of the file.
As reported before all corruption involves the second 32-bit word of a
16-byte aligned region. Note that the offsets reported by cmp(1) start
at one so you need to subtract one to get a proper offset from the start
of the file. So subtract one then convert to hex.
Below are the results of one test with 2 GiB of memory. I'll buy a beer
for the person who can find a pattern to the corruption. Results from
other test runs can be provided upon request. Most corruption involves
only a few bytes in a given 2 GiB file. But I've had a couple of runs
where hundreds, and in one case thousands, of bytes have been corrupted.
iteration 1
1c1
< 748b7ad615a62e41a88a6b5d47bb5581 Windows XP-f001.vmdk
---
> 8aff2d3a23e4f08d9a3145d011368e93 Windows XP-f001.vmdk
3,5c3,5
< 6bbe96c7da14487adab7e0c13b7e54f6 Windows XP-f003.vmdk
< bf7c45c7a6c24bda251a34e73b0cbe9c Windows XP-f004.vmdk
< 241c45aa023556d0bb0b864b3a83a800 Windows XP-f005.vmdk
---
> e08d9b8194a1aac41de611a3ba782e03 Windows XP-f003.vmdk
> 8aa16afaf6bae8dcdeb2bb5595cbc76f Windows XP-f004.vmdk
> e65f5de0bd69d76d40d23593f9221f36 Windows XP-f005.vmdk
Windows XP-f001.vmdk
327748952 115 117
657644309 327 100
657644310 105 0
657644311 221 0
657644312 127 0
778889238 145 45
778889240 350 312
Windows XP-f002.vmdk
Windows XP-f003.vmdk
1025312597 70 60
1622579125 164 174
1622579126 135 125
1622579128 170 140
Windows XP-f004.vmdk
1129237493 50 104
1129237494 1 15
1922382310 14 4
1922382312 321 333
1936442328 236 224
2004252949 37 215
2004252950 0 4
2004252951 200 26
2004252952 371 212
2004430229 200 270
2004430230 16 107
2004430231 0 340
2004430232 242 317
2056253589 164 160
2056253590 340 200
2056253592 151 101
Windows XP-f005.vmdk
113235864 71 73
536394981 1 24
536394982 203 13
536394983 310 376
536394984 2 161
764048760 2 0
Windows XP-f006.vmdk
Windows XP-f007.vmdk
Windows XP-f008.vmdk
--
Kurtis D. Rader, Linux level 3 support email: [email protected]
IBM Integrated Technology Services DID: +1 503-578-3714
15300 SW Koll Pkwy, MS RHE2-O2 service: 800-IBM-SERV
Beaverton, OR 97006-6063 http://www.ibm.com
> The key question is whether this is a HW quirk of the nForce 4 chipset
> that the kernel can and should be working around? What tests can I run that
> will help narrow the field of investigation or provide more useful data?
Really it would need information from Nvidia on the problem, non-problem,
possible errata and/or chipset flaws. In the absence of that I don't see
a good way to debug it further than you have already.
On Sat, 2006-12-02 17:17:37, Kurtis D. Rader wrote:
> I'm also experiencing silent data corruption on writes to SATA disks
> connected to a Nvidia controller (nForce 4 chipset). The problem is
> 100% reproducible. Details of my configuration (mainboard model, lspci,
> etc.) are near the bottom of this message. What follows is a summation
> of my findings.
Various suggestions (e.g., booting with "acpi=off") have either not helped
or have resulted in a system which won't boot.
Today I replaced the ASUS A8N (nVidia nForce 4 chipset) mainboard and
AMD Athlon 64 CPU with a Intel DP965LT (Intel 965 chipset) and E6600
Duo Core 2 CPU. The SATA disks and cables are unchanged. The case, power
supply, and video card are also unchanged. Not one of the previous tests
now results in corruption.
If anyone (e.g., a nVidia employee) wants to pursue this and can provide
a meaningful action plan I'll be happy to install the problem components
in another case and attempt to gather additional diagnostic data.
--
Kurtis D. Rader, Linux level 3 support email: [email protected]
IBM Integrated Technology Services DID: +1 503-578-3714
15300 SW Koll Pkwy, MS RHE2-O2 service: 800-IBM-SERV
Beaverton, OR 97006-6063 http://www.ibm.com
Am Sonntag, 3. Dezember 2006 02:17 schrieb Kurtis D. Rader:
> On Sat, 2006-12-02 01:56:06, Christoph Anton Mitterer wrote:
> > The issue was basically the following: I found a severe bug mainly by
> > fortune because it occurs very rarely. My test looks like the following:
> > I have about 30GB of testing data on my harddisk,... I repeat verifying
> > sha512 sums on these files and check if errors occur. One test pass
> > verifies the 30GB 50 times,... about one to four differences are found in
> > each pass.
>
> I'm also experiencing silent data corruption on writes to SATA disks
> connected to a Nvidia controller (nForce 4 chipset). The problem is
> 100% reproducible. Details of my configuration (mainboard model, lspci,
> etc.) are near the bottom of this message. What follows is a summation
> of my findings.
>
> I have confirmed the corruption is occurring on the writes and not the
> reads. Furthermore, if I compare the original and copy while both are
> still cached in memory no corruption is found. But as soon as I flush the
> pagecache (by reading another file larger than memory) to force the copy
> of the file to be read from disk the corruption is seen. The corruption
> occurs with direct I/O and normal buffered filesystem I/O (ext3).
>
> Booting with "mem=1g" (system has 4 GiB installed) makes no difference.
> So it isn't due to remapping memory above the 4 GiB boundary. Booting to
> single user and ensuring no unnecessary modules (video, etc.) are loaded
> also makes no difference.
>
> The problem affects both disks attached to the nVidia SATA controller but
> not the two disks attached to the PATA side of the same controller. All
> four disks are different models. The same SATA disks attached to
> the Silicon Image 3114 SATA RAID controller (on the same mainboard)
> experiences the same corruption but at a lower probability. The same
> disks attached to a Promise TX2 SATA controller (in the same system)
> experience no corruption.
>
> The system has run memtest86 for 24 hours with no errors.
>
> The corruption occurs with a 32-bit kernel.org 2.6.12 kernel (from a
> Knoppix CD), 64-bit kernel.org 2.6.18.1, and 64-bit kernel.org 2.6.19.
>
> The only pattern I can discern is that the corruption only affects the
> second four byte block of a sixteen byte aligned region. That is, the
> offset of the corrupted bytes always falls in the range 0x...4 to 0x...7.
> For example, here are a few representative offsets of corrupted bytes
> from one test of many:
>
> 0x020f4554
> 0x020f4555
> 0x020f4556
> 0x020f4557
> 0x020f4555
> 0x1597f1d4
> 0x23034ee5
> 0x2dfd08d4
> 0x33690b14
> 0x33690b15
> 0x33690b16
> 0x33690b17
>
> Approximately half the corruption involves all four bytes of the second
> four byte word of the 16 by aligned region. The remaining instances show
> one, two or three bytes being corrupted. There is no pattern that I can
> discern to the corruption. It is definitely not anything as simple as the
> correct bytes being replaced with zeros or specific bits being forced to
> a zero or one state or flipped. Copying a 2 GiB file typically results
> in between five and thirty bytes being corrupted. Back to back tests
> copying the same file results in different bytes being corrupted.
>
> The problem appears to be sensitive to the data pattern. Some files can be
> copied repeatedly without corruption. But others will exhibit corruption
> 100% of the time when attached to the nVidia SATA controller and over 50%
> of the time when connected to the Sil 3114 controller.
>
> I have tried direct I/O with varying transfer sizes from 1 KiB to 128 KiB
> (I have not tried single sector I/O). Corruption occurs with all block
> sizes. If I execute an I/O loop like this the corruption does not appear
> to occur:
>
> while direct read 64 KiB original
> direct write copy
> direct read + verify copy
> done
>
> But that is tentative as I've only done two such tests at this time.
>
> A coworker has hypothesized that this may be a consequence of bus
> crosstalk. That it doesn't happen when using the Promise TX2 controller
> suggests that if it is a HW bus problem it isn't a hypertransport or
> memory bus problem.
>
> Configuration Details
> =====================
> Mainboard: ASUS A8N-SLI Deluxe with nForce 4 chipset
> BIOS: 1016 (original) and 1805 (upgraded to in attempt to resolve)
> CPU: AMD Athlon 64 X2 Dual Core Processor 4400+ stepping 02
> Memory: 4 x 1 GiB Kingston matched pairs lifetime warranty
> Kernel: 2.6.19 kernel.org (as well as others)
> Disks: Western Digital WD740GD-00FL (10000 rpm "Raptor" disk)
> Western Digital WD1600JD-32H (7200 rpm)
>
> 00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev
> a3) Subsystem: ASUSTeK Computer Inc. Unknown device 815a
> Flags: bus master, 66MHz, fast devsel, latency 0
> Capabilities: [44] HyperTransport: Slave or Primary Interface
> Capabilities: [e0] HyperTransport: MSI Mapping
>
> 00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev f3)
> Subsystem: ASUSTeK Computer Inc. K8N4-E Mainboard
> Flags: bus master, 66MHz, fast devsel, latency 0
>
> 00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
> Subsystem: ASUSTeK Computer Inc. K8N4-E Mainboard
> Flags: 66MHz, fast devsel, IRQ 3
> I/O ports at e000 [size=32]
> I/O ports at 4c00 [size=64]
> I/O ports at 4c40 [size=64]
> Capabilities: [44] Power Management version 2
>
> 00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
> (prog-if 10 [OHCI]) Subsystem: ASUSTeK Computer Inc. K8N4-E Mainboard
> Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 5
> Memory at d4003000 (32-bit, non-prefetchable) [size=4K]
> Capabilities: [44] Power Management version 2
>
> 00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
> (prog-if 20 [EHCI]) Subsystem: ASUSTeK Computer Inc. K8N4-E Mainboard
> Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 7
> Memory at feb00000 (32-bit, non-prefetchable) [size=256]
> Capabilities: [44] Debug port
> Capabilities: [80] Power Management version 2
>
> 00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio
> Controller (rev a2) Subsystem: ASUSTeK Computer Inc. K8N4-E Mainboard
> Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 11
> I/O ports at d800 [size=256]
> I/O ports at dc00 [size=256]
> Memory at d4002000 (32-bit, non-prefetchable) [size=4K]
> Capabilities: [44] Power Management version 2
>
> 00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2) (prog-if 8a
> [Master SecP PriP]) Subsystem: ASUSTeK Computer Inc. K8N4-E Mainboard
> Flags: bus master, 66MHz, fast devsel, latency 0
> I/O ports at f000 [size=16]
> Capabilities: [44] Power Management version 2
>
> 00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev
> f3) (prog-if 85 [Master SecO PriO]) Subsystem: ASUSTeK Computer Inc.
> Unknown device 815a
> Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 11
> I/O ports at 09f0 [size=8]
> I/O ports at 0bf0 [size=4]
> I/O ports at 0970 [size=8]
> I/O ports at 0b70 [size=4]
> I/O ports at d400 [size=16]
> Memory at d4001000 (32-bit, non-prefetchable) [size=4K]
> Capabilities: [44] Power Management version 2
>
> 00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev
> f3) (prog-if 85 [Master SecO PriO]) Subsystem: ASUSTeK Computer Inc. K8N4-E
> Mainboard
> Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 5
> I/O ports at 09e0 [size=8]
> I/O ports at 0be0 [size=4]
> I/O ports at 0960 [size=8]
> I/O ports at 0b60 [size=4]
> I/O ports at c000 [size=16]
> Memory at d4000000 (32-bit, non-prefetchable) [size=4K]
> Capabilities: [44] Power Management version 2
>
> 00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev f2) (prog-if
> 01 [Subtractive decode]) Flags: bus master, 66MHz, fast devsel, latency 0
> Bus: primary=00, secondary=05, subordinate=05, sec-latency=128
> I/O behind bridge: 00007000-00009fff
> Memory behind bridge: d2000000-d3ffffff
> Prefetchable memory behind bridge: d4100000-d42fffff
>
> 00:0b.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev f3) (prog-if
> 00 [Normal decode]) Flags: bus master, fast devsel, latency 0
> Bus: primary=00, secondary=04, subordinate=04, sec-latency=0
> Capabilities: [40] Power Management version 2
> Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable+
> Capabilities: [58] HyperTransport: MSI Mapping
> Capabilities: [80] Express Root Port (Slot+) IRQ 0
> Capabilities: [100] Virtual Channel
>
> 00:0c.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev f3) (prog-if
> 00 [Normal decode]) Flags: bus master, fast devsel, latency 0
> Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
> Capabilities: [40] Power Management version 2
> Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable+
> Capabilities: [58] HyperTransport: MSI Mapping
> Capabilities: [80] Express Root Port (Slot+) IRQ 0
> Capabilities: [100] Virtual Channel
>
> 00:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev f3) (prog-if
> 00 [Normal decode]) Flags: bus master, fast devsel, latency 0
> Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
> Capabilities: [40] Power Management version 2
> Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable+
> Capabilities: [58] HyperTransport: MSI Mapping
> Capabilities: [80] Express Root Port (Slot+) IRQ 0
> Capabilities: [100] Virtual Channel
>
> 00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) (prog-if
> 00 [Normal decode]) Flags: bus master, fast devsel, latency 0
> Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
> I/O behind bridge: 0000a000-0000afff
> Memory behind bridge: d0000000-d1ffffff
> Prefetchable memory behind bridge: 00000000c0000000-00000000cff00000
> Capabilities: [40] Power Management version 2
> Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable+
> Capabilities: [58] HyperTransport: MSI Mapping
> Capabilities: [80] Express Root Port (Slot+) IRQ 0
> Capabilities: [100] Virtual Channel
>
> 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
> HyperTransport Technology Configuration Flags: fast devsel
> Capabilities: [80] HyperTransport: Host or Secondary Interface
>
> 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
> Address Map Flags: fast devsel
>
> 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
> DRAM Controller Flags: fast devsel
>
> 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
> Miscellaneous Control Flags: fast devsel
>
> 01:00.0 VGA compatible controller: ATI Technologies Inc RV530 [Radeon
> X1600] (prog-if 00 [VGA]) Subsystem: VISIONTEK Unknown device 1890
> Flags: bus master, fast devsel, latency 0, IRQ 3
> Memory at c0000000 (64-bit, prefetchable) [size=256M]
> Memory at d1000000 (64-bit, non-prefetchable) [size=64K]
> I/O ports at a000 [size=256]
> Expansion ROM at d0000000 [disabled] [size=128K]
> Capabilities: [50] Power Management version 2
> Capabilities: [58] Express Endpoint IRQ 0
> Capabilities: [80] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-
>
> 01:00.1 Display controller: ATI Technologies Inc RV530 [Radeon X1600]
> (Secondary) Subsystem: VISIONTEK Unknown device 1891
> Flags: fast devsel
> Memory at d1010000 (64-bit, non-prefetchable) [disabled] [size=64K]
> Capabilities: [50] Power Management version 2
> Capabilities: [58] Express Endpoint IRQ 0
>
> 05:07.0 Mass storage controller: Promise Technology, Inc. PDC20375 (SATA150
> TX2plus) (rev 02) Subsystem: Promise Technology, Inc. PDC20375 (SATA150
> TX2plus)
> Flags: bus master, 66MHz, medium devsel, latency 96, IRQ 5
> I/O ports at 7000 [size=64]
> I/O ports at 7400 [size=16]
> I/O ports at 7800 [size=128]
> Memory at d3124000 (32-bit, non-prefetchable) [size=4K]
> Memory at d3100000 (32-bit, non-prefetchable) [size=128K]
> Expansion ROM at d4280000 [disabled] [size=16K]
> Capabilities: [60] Power Management version 2
>
> 05:08.0 Ethernet controller: Intel Corporation 82557/8/9 [Ethernet Pro 100]
> (rev 08) Subsystem: IBM Netfinity 10/100
> Flags: bus master, medium devsel, latency 32, IRQ 3
> Memory at d3127000 (32-bit, non-prefetchable) [size=4K]
> I/O ports at 7c00 [size=64]
> Memory at d3000000 (32-bit, non-prefetchable) [size=1M]
> Expansion ROM at d4100000 [disabled] [size=1M]
> Capabilities: [dc] Power Management version 2
>
> 05:0a.0 RAID bus controller: Silicon Image, Inc. SiI 3114
> [SATALink/SATARaid] Serial ATA Controller (rev 02) Subsystem: ASUSTeK
> Computer Inc. Unknown device 8167
> Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 7
> I/O ports at 8000 [size=8]
> I/O ports at 8400 [size=4]
> I/O ports at 8800 [size=8]
> I/O ports at 8c00 [size=4]
> I/O ports at 9000 [size=16]
> Memory at d3125000 (32-bit, non-prefetchable) [size=1K]
> Expansion ROM at d4200000 [disabled] [size=512K]
> Capabilities: [60] Power Management version 2
>
> 05:0b.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000
> Controller (PHY/Link) (prog-if 10 [OHCI]) Subsystem: ASUSTeK Computer Inc.
> K8N4-E Mainboard
> Flags: bus master, medium devsel, latency 32, IRQ 3
> Memory at d3126000 (32-bit, non-prefetchable) [size=2K]
> Memory at d3120000 (32-bit, non-prefetchable) [size=16K]
> Capabilities: [44] Power Management version 2
Hello lkml,
I also have an nForce4 based mainboard and tried hard to reproduce your
testcase but I found no errors so far!
Asus A8N-SLI Deluxe (nForce4)
AMD Athlon 64x2 3800 (overclocked)
2x Samsung SpinPoint 250GB (SATA2)
1x Samsung SpinPoint 400GB (SATA2)
2GB DDR1-Ram
Linux 2.6.19-rc6-mm1 #2 SMP PREEMPT Thu Nov 23 15:34:58 CET 2006 x86_64
GNU/Linux
Filesystem is XFS on LVM on dmraid (nForce4 fakeraid)
I copied a massive amount of data (more than 500GB) several times between the
HDDs and ran md5sum each time, but it found no errors.
Maybe it's a temperature problem? I found my board to be very vulnerable to
higher temperatures. If the nForce4 chipset reaches a temp. over 42?C it
tends to malfunction. First symptoms I've had was that USB ceased to work
then came SATA hangs and finally a kernel panic or bluescreen on Windows.
lspci
00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio
Controller (rev a2)
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
00:07.0 RAID bus controller: nVidia Corporation CK804 Serial ATA Controller
(rev f3)
00:08.0 RAID bus controller: nVidia Corporation CK804 Serial ATA Controller
(rev f3)
00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
00:0b.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0c.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
01:00.0 VGA compatible controller: nVidia Corporation G70 [GeForce 7600 GT]
(rev a1)
05:0b.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000
Controller (PHY/Link)
05:0c.0 Ethernet controller: Marvell Technology Group Ltd. 88E8001 Gigabit
Ethernet Controller (rev 13)
-Christian
On Wed, Dec 06, 2006 at 12:11:38PM +0100, Christian wrote:
> I copied a massive amount of data (more than 500GB) several times
> between the HDDs and ran md5sum each time, but it found no errors.
It might be a known problem that your BIOS addresses already, or maybe
it's restricted to some revisions of the chip(s)?
Ville Herva wrote:
> I saw something very similar with Via KT133 years ago. Then the culprit was
> botched PCI implementation that sometimes corrupted PCI transfers when there
> was heavy PCI I/O going on. Usually than meant running two disk transfers at
> the same time. Doing heavy network I/O at the time made it more likely
> happen.
Hm I do only on concurrent test,... and network is not used very much
during the tests.
> I used this crude hack:
> http://v.iki.fi/~vherva/tmp/wrchk.c
>
I'll have a look at it :)
> If the problem in your case is that the PCI transfer gets corrupted when it
> happens to a certain memory area, I guess you could try to binary search for
> the bad spot with the kernel BadRam patches
> http://www.linuxjournal.com/article/4489 (I seem to recall it was possible
> to turn off memory areas with vanilla kernel boot params without a patch,
> but I can't find a reference.)
>
I know badram,.. but the thing is,.. that it's highly unlikely that my
RAMs are damaged. Many hours of memtest86+ runs did not show any error
(not even ECC errors),...
And why should memhol mapping disabled solve the issue if memory was
damaged? That could only be if the badblocks would be in the address
space used by the memhole....
Chris.
On Sat, 2 Dec 2006, Karsten Weiss wrote:
> On Sat, 2 Dec 2006, Christoph Anton Mitterer wrote:
>
> > I found a severe bug mainly by fortune because it occurs very rarely.
> > My test looks like the following: I have about 30GB of testing data on
>
> This sounds very familiar! One of the Linux compute clusters I
> administer at work is a 336 node system consisting of the
> following components:
>
> * 2x Dual-Core AMD Opteron 275
> * Tyan S2891 mainboard
> * Hitachi HDS728080PLA380 harddisk
> * 4 GB RAM (some nodes have 8 GB) - intensively tested with
> memtest86+
> * SUSE 9.3 x86_64 (kernel 2.6.11.4-21.14-smp) - But I've also
> e.g. tried the latest openSUSE 10.2 RC1+ kernel 2.6.18.2-33 which
> makes no difference.
>
> We are running LS-Dyna on these machines and discovered a
> testcase which shows a similar data corruption. So I can
> confirm that the problem is for real an not a hardware defect
> of a single machine!
Last week we did some more testing with the following result:
We could not reproduce the data corruption anymore if we boot the machines
with the kernel parameter "iommu=soft" i.e. if we use software bounce
buffering instead of the hw-iommu. (As mentioned before, booting with
mem=2g works fine, too, because this disables the iommu altogether.)
I.e. on these systems the data corruption only happens if the hw-iommu
(PCI-GART) of the Opteron CPUs is in use.
Christoph, Erik, Chris: I would appreciate if you would test and hopefully
confirm this workaround, too.
Best regards,
Karsten
--
__________________________________________creating IT solutions
Dipl.-Inf. Karsten Weiss science + computing ag
phone: +49 7071 9457 452 Hagellocher Weg 73
teamline: +49 7071 9457 681 72070 Tuebingen, Germany
email: [email protected] http://www.science-computing.de
On Mon, Dec 11, 2006 at 10:24:02AM +0100, Karsten Weiss wrote:
> We could not reproduce the data corruption anymore if we boot the
> machines with the kernel parameter "iommu=soft" i.e. if we use
> software bounce buffering instead of the hw-iommu. (As mentioned
> before, booting with mem=2g works fine, too, because this disables
> the iommu altogether.)
I can confirm this also seems to be the case for me, I'm still doing
more testing to confirm this. But it would seem:
nforce4, transfer of a large mount of data with 4GB+ of RAM I get some
corruption. This is present on both the nv SATA and also Sil 3112
connected drives.
Using iommu=soft so far seems to be working without any corruption.
I still need to do more testing on other machines which have less
memory (so the IOMMU won't be in use there either) and see if there
are problems there.
Karsten Weiss wrote:
> Here's a diff of a corrupted and a good file written during our
> testcase:
>
> ("-" == corrupted file, "+" == good file)
> ...
> 009f2ff0 67 2a 4c c4 6d 9d 34 44 ad e6 3c 45 05 9a 4d c4 |g*L.m.4D..<E..M.|
> -009f3000 39 60 e6 44 20 ab 46 44 56 aa 46 44 c2 35 e6 44 |9.D .FDV.FD.5.D|
> ....
> +009f3ff0 f3 55 92 44 c1 10 6c 45 5e 12 a0 c3 60 31 93 44 |.U.D..lE^...1.D|
> 009f4000 88 cd 6b 45 c1 6d cd c3 00 a5 8b 44 f2 ac 6b 45 |..kE.m.....D..kE|
>
Well as I told in my mails to the list I made the experience that not
all bytes of the corrupted area are invalid,.. but only some,.. while it
seems that in you diff ALL the bytes are wrong, right?
> Please notice:
>
> a) the corruption begins at a page boundary
> b) the corrupted byte range is a single memory page and
> c) almost every fourth byte is set to 0x44 in the corrupted case
> (but the other bytes changed, too)
>
> To me this looks as if a wrong memory page got written into the
> file.
>
Hmm and do you have any ideas what's the reason for all this? Defect in
the nforce chipset? Or even in the CPU (the Opterons do have integrated
memory controllers).
> >From our testing I can also tell that the data corruption does
> *not* appear at all when we are booting the nodes with mem=2G.
> However, when we are using all the 4GB the data corruption
> shows up - but not everytime and thus not on all nodes.
> Sometimes a node runs for ours without any problem. That's why
> we are testing on 32 nodes in parallel most of the time. I have
> the impression that it has something to do with physical memory
> layout of the running processes.
>
Hmm maybe,.. but I have absolutely no idea ;)
> Please also notice that this is a silent data corruption. I.e.
> there are no error or warning messages in the kernel log or the
> mce log at all.
>
Yes I can confirm that.
> Christoph, I will carefully re-read your entire posting and the
> included links on Monday and will also try the memory hole
> setting.
>
And did you get out anything new?
Karsten Weiss wrote:
> Last week we did some more testing with the following result:
>
> We could not reproduce the data corruption anymore if we boot the machines
> with the kernel parameter "iommu=soft" i.e. if we use software bounce
> buffering instead of the hw-iommu. (As mentioned before, booting with
> mem=2g works fine, too, because this disables the iommu altogether.)
>
I can confirm this,...
booting with mem=2G => works fine,...
(all of the following tests were made with memory hole mapping=hardware
in the BIOS,.. so I could access my full ram):
booting with iommu=soft => works fine
booting with iommu=noagp => DOESN'T solve the error
booting with iommu=off => the system doesn't even boot and panics
When I set IOMMU to disabled in the BIOS the error is not solved-
I tried to set bigger space for the IOMMU in the BIOS (256MB instead of
64MB),.. but it does not solve the problem.
Any ideas why iommu=disabled in the bios does not solve the issue?
> I.e. on these systems the data corruption only happens if the hw-iommu
> (PCI-GART) of the Opteron CPUs is in use.
>
1) And does this now mean that there's an error in the hardware (chipset
or CPU/memcontroller)?
> Christoph, Erik, Chris: I would appreciate if you would test and hopefully
> confirm this workaround, too.
>
Yes I can absolutely confirm this...
Do my additional tests help you?
Do you have any ideas why the issue doesn't occur (even with memhole
mapping=hardware in the bios and no iommu=soft at kernel command line)
when dma is disabled for the disks (or a slower dma mode is used)?
Chris.
Ah and I forgot,...
Did anyone made any test under Windows? I cannot set there iommu=soft,
can I?
Chris.
Chris Wedgwood wrote:
>> Did anyone made any test under Windows? I cannot set there
>> iommu=soft, can I?
>>
> Windows never uses the hardware iommu, so it's always doing the
> equivalent on iommu=soft
>
That would mean that I'm not able to reproduce the issue unter windows,
right?
Does that apply for all versions (up to and including Vista).
Don't understand me wrong,.. I don't use Windows (expect for upgrading
my Plextor firmware and EAC ;) )... but I ask because the more
information we get (even if it's not Linux specific) the more steps we
can take ;)
Chris.
On Wed, Dec 13, 2006 at 08:18:21PM +0100, Christoph Anton Mitterer wrote:
> booting with iommu=soft => works fine
> booting with iommu=noagp => DOESN'T solve the error
> booting with iommu=off => the system doesn't even boot and panics
> When I set IOMMU to disabled in the BIOS the error is not solved-
> I tried to set bigger space for the IOMMU in the BIOS (256MB instead of
> 64MB),.. but it does not solve the problem.
> Any ideas why iommu=disabled in the bios does not solve the issue?
The kernel will still use the IOMMU if the BIOS doesn't set it up if
it can, check your dmesg for IOMMU strings, there might be something
printed to this effect.
> 1) And does this now mean that there's an error in the hardware
> (chipset or CPU/memcontroller)?
My guess is it's a kernel bug, I don't know for certain. Perhaps we
shaould start making a more comprehensive list of affected kernels &
CPUs?
On Wed, Dec 13, 2006 at 08:20:59PM +0100, Christoph Anton Mitterer wrote:
> Did anyone made any test under Windows? I cannot set there
> iommu=soft, can I?
Windows never uses the hardware iommu, so it's always doing the
equivalent on iommu=soft
Karsten Weiss wrote:
> "Memory hole mapping" was set to "hardware". With "disabled" we only
> see 3 of our 4 GB memory.
>
That sounds reasonable,... I even only see 2,5 GB,.. as my memhole takes
1536 MB (don't ask me which PCI device needs that much address space ;) )
On Wed, 13 Dec 2006, Christoph Anton Mitterer wrote:
>> Christoph, I will carefully re-read your entire posting and the
>> included links on Monday and will also try the memory hole
>> setting.
>>
> And did you get out anything new?
As I already mentioned the kernel parameter "iommu=soft" fixes
the data corruption for me. We saw no more data corruption
during a test on 48 machines over the last week-end. Chris
Wedgewood already confirmed that this setting fixed the data
corruption for him, too.
Of course, the big question "Why does the hardware iommu *not*
work on those machines?" still remains.
I have also tried setting "memory hole mapping" to "disabled"
instead of "hardware" on some of the machines and this *seems*
to work stable, too. However, I did only test it on about a
dozen machines because this bios setting costs us 1 GB memory
(and iommu=soft does not).
BTW: Maybe I should also mention that other machines types
(e.g. the HP xw9300 dual opteron workstations) which also use a
NVIDIA chipset and Opterons never had this problem as far as I
know.
Best regards,
Karsten
--
Dipl.-Inf. Karsten Weiss - http://www.machineroom.de/knweiss
Karsten Weiss wrote:
> Of course, the big question "Why does the hardware iommu *not*
> work on those machines?" still remains.
>
I'm going to check AMDs errata docs these days,.. perhaps I find
something that relates. But I'd ask you to do the same as I don't
consider myself as an expert in these issues ;-)
Chris Wedgwood said that iommu isn't used unter windows at all,.. so I
think the following three solutions would be possible:
- error in the Opteron (memory controller)
- error in the Nvidia chipsets
- error in the kernel
> I have also tried setting "memory hole mapping" to "disabled"
> instead of "hardware" on some of the machines and this *seems*
> to work stable, too. However, I did only test it on about a
> dozen machines because this bios setting costs us 1 GB memory
> (and iommu=soft does not).
>
Yes... loosing so much memory is a big drawback,.. anyway it would be
great if you can make some more extensive tests that we'd be able to say
if memholemapping=disabled in the BIOS really solves that issue, too, or
not.
Does anyone know how memhole mapping in the BIOS relates to the iommu stuff?
Is it likely or explainable that both would sovle the issue?
> BTW: Maybe I should also mention that other machines types
> (e.g. the HP xw9300 dual opteron workstations) which also use a
> NVIDIA chipset and Opterons never had this problem as far as I
> know.
>
Uhm,.. that's really strange,... I would have thought that this would
affect all systems that uses either the (mayby) buggy nforce chipset,..
or the (mayby) buggy Opteron.
Did those systems have exactly the same Nvidia-Type? Same question for
the CPU (perhaps the issue only occurs for a speciffic stepping)
Again I have:
nforce professional 2200
nforce professional 2050
Opteron model 275 (stepping E6)
btw: I think that is already clear but again:
Both "solutions" solve the problem for me:
Either
- memhole mapping=disabled in the BIOS (but you loose some memory)
- without any iommu= option for the kernel
or
- memhole mapping=hardware in the BIOS (I suppuse it will work with
software too)
- with iommu=soft for the kernel
Best wishes,
Chris.
On Mon Dec 11, 2006 at 10:24:02AM +0100, Karsten Weiss wrote:
> Last week we did some more testing with the following result:
>
> We could not reproduce the data corruption anymore if we boot the machines
> with the kernel parameter "iommu=soft" i.e. if we use software bounce
> buffering instead of the hw-iommu. (As mentioned before, booting with
> mem=2g works fine, too, because this disables the iommu altogether.)
>
> I.e. on these systems the data corruption only happens if the hw-iommu
> (PCI-GART) of the Opteron CPUs is in use.
>
> Christoph, Erik, Chris: I would appreciate if you would test and hopefully
> confirm this workaround, too.
What did you set the BIOS to when testing this setting?
Memory Hole enabled? IOMMU enabled?
-Erik
--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--
On Wed, 13 Dec 2006, Erik Andersen wrote:
> On Mon Dec 11, 2006 at 10:24:02AM +0100, Karsten Weiss wrote:
> > Last week we did some more testing with the following result:
> >
> > We could not reproduce the data corruption anymore if we boot the machines
> > with the kernel parameter "iommu=soft" i.e. if we use software bounce
> > buffering instead of the hw-iommu. (As mentioned before, booting with
> > mem=2g works fine, too, because this disables the iommu altogether.)
> >
> > I.e. on these systems the data corruption only happens if the hw-iommu
> > (PCI-GART) of the Opteron CPUs is in use.
> >
> > Christoph, Erik, Chris: I would appreciate if you would test and hopefully
> > confirm this workaround, too.
>
> What did you set the BIOS to when testing this setting?
> Memory Hole enabled? IOMMU enabled?
"Memory hole mapping" was set to "hardware". With "disabled" we only
see 3 of our 4 GB memory.
Best regards,
Karsten
--
__________________________________________creating IT solutions
Dipl.-Inf. Karsten Weiss science + computing ag
phone: +49 7071 9457 452 Hagellocher Weg 73
teamline: +49 7071 9457 681 72070 Tuebingen, Germany
email: [email protected] http://www.science-computing.de
On Mon Dec 11, 2006 at 10:24:02AM +0100, Karsten Weiss wrote:
> We could not reproduce the data corruption anymore if we boot
> the machines with the kernel parameter "iommu=soft" i.e. if we
> use software bounce buffering instead of the hw-iommu.
I just realized that booting with "iommu=soft" makes my pcHDTV
HD5500 DVB cards not work. Time to go back to disabling the
memhole and losing 1 GB. :-(
-Erik
--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--
Erik Andersen wrote:
> I just realized that booting with "iommu=soft" makes my pcHDTV
> HD5500 DVB cards not work. Time to go back to disabling the
> memhole and losing 1 GB. :-(
Crazy,...
I have a Hauppauge Nova-T 500 DualDVB-T card,... I'll check it later if
I have the same problem and will inform you (please remember me if I
forget ;) )
Chris.
On Wed, 13 Dec 2006, Chris Wedgwood wrote:
> > Any ideas why iommu=disabled in the bios does not solve the issue?
>
> The kernel will still use the IOMMU if the BIOS doesn't set it up if
> it can, check your dmesg for IOMMU strings, there might be something
> printed to this effect.
FWIW: As far as I understand the linux kernel code (I am no kernel
developer so please correct me if I am wrong) the PCI dma mapping code is
abstracted by struct dma_mapping_ops. I.e. there are currently four
possible implementations for x86_64 (see linux-2.6/arch/x86_64/kernel/)
1. pci-nommu.c : no IOMMU at all (e.g. because you have < 4 GB memory)
Kernel boot message: "PCI-DMA: Disabling IOMMU."
2. pci-gart.c : (AMD) Hardware-IOMMU.
Kernel boot message: "PCI-DMA: using GART IOMMU" (this message
first appeared in 2.6.16)
3. pci-swiotlb.c : Software-IOMMU (used e.g. if there is no hw iommu)
Kernel boot message: "PCI-DMA: Using software bounce buffering
for IO (SWIOTLB)"
4. pci-calgary.c : Calgary HW-IOMMU from IBM; used in pSeries servers.
This HW-IOMMU supports dma address mapping with memory proctection,
etc.
Kernel boot message: "PCI-DMA: Using Calgary IOMMU" (since 2.6.18!)
What all this means is that you can use "dmesg|grep ^PCI-DMA:" to see
which implementation your kernel is currently using.
As far as our problem machines are concerned the "PCI-DMA: using GART
IOMMU" case is broken (data corruption). But both "PCI-DMA: Disabling
IOMMU" (trigged with mem=2g) and "PCI-DMA: Using software bounce buffering
for IO (SWIOTLB)" (triggered with iommu=soft) are stable.
BTW: It would be really great if this area of the kernel would get some
more and better documentation. The information at
linux-2.6/Documentation/x86_64/boot_options.txt is very terse. I had to
read the code to get a *rough* idea what all the "iommu=" options
actually do and how they interact.
> > 1) And does this now mean that there's an error in the hardware
> > (chipset or CPU/memcontroller)?
>
> My guess is it's a kernel bug, I don't know for certain. Perhaps we
> shaould start making a more comprehensive list of affected kernels &
> CPUs?
BTW: Did someone already open an official bug at
http://bugzilla.kernel.org ?
Best regards,
Karsten
--
__________________________________________creating IT solutions
Dipl.-Inf. Karsten Weiss science + computing ag
phone: +49 7071 9457 452 Hagellocher Weg 73
teamline: +49 7071 9457 681 72070 Tuebingen, Germany
email: [email protected] http://www.science-computing.de
Lennart Sorensen wrote:
> I upgrade my plextor firmware using linux. pxupdate for most devices,
> and pxfw for new drivers (like the PX760). Works perfectly for me. It
> is one of the reasons I buy plextors.
Yes I know about it,.. although never tested it,... anyway the main
reason for Windows is Exact Audio Copy (but Andre Wiehthoff is working
on a C port :-D )
Unfortunately my PX760 seems to be defect,.. posted about the issue to
lkml but no success :-(
Best wishes,
Chris.
On Wed, Dec 13, 2006 at 08:57:23PM +0100, Christoph Anton Mitterer wrote:
> Don't understand me wrong,.. I don't use Windows (expect for upgrading
> my Plextor firmware and EAC ;) )... but I ask because the more
> information we get (even if it's not Linux specific) the more steps we
> can take ;)
I upgrade my plextor firmware using linux. pxupdate for most devices,
and pxfw for new drivers (like the PX760). Works perfectly for me. It
is one of the reasons I buy plextors.
--
Len Sorensen
Hi.
I've just looked for some kernel config options that might relate to our
issue:
1)
Old style AMD Opteron NUMA detection (CONFIG_K8_NUMA)
Enable K8 NUMA node topology detection. You should say Y here if you
have a multi processor AMD K8 system. This uses an old method to read
the NUMA configuration directly from the builtin Northbridge of Opteron.
It is recommended to use X86_64_ACPI_NUMA instead, which also takes
priority if both are compiled in.
ACPI NUMA detection (CONFIG_X86_64_ACPI_NUMA)
Enable ACPI SRAT based node topology detection.
What should one select for the Opterons? And is it possible that this
has something to do with our datacorruption error?
2)
The same two questions for the memory model (Discontiguous or Sparse)
3)
The same two questions for CONFIG_MIGRATION ()
4)
And does someone know if the nforce/opteron iommu requires IBM Calgary
IOMMU support?
This is unrelated to our issue,.. but it would be nice if some of your
could send me their .config,.. I'd like to compare them with my own and
see if I could something tweak or so.
(Of course only people with 2x DualCore Systems ;) )
Chris.
On Wed, Dec 13, 2006 at 09:34:16PM +0100, Karsten Weiss wrote:
> FWIW: As far as I understand the linux kernel code (I am no kernel
> developer so please correct me if I am wrong) the PCI dma mapping code is
> abstracted by struct dma_mapping_ops. I.e. there are currently four
> possible implementations for x86_64 (see
> linux-2.6/arch/x86_64/kernel/)
>
> 1. pci-nommu.c : no IOMMU at all (e.g. because you have < 4 GB memory)
> Kernel boot message: "PCI-DMA: Disabling IOMMU."
>
> 2. pci-gart.c : (AMD) Hardware-IOMMU.
> Kernel boot message: "PCI-DMA: using GART IOMMU" (this message
> first appeared in 2.6.16)
>
> 3. pci-swiotlb.c : Software-IOMMU (used e.g. if there is no hw iommu)
> Kernel boot message: "PCI-DMA: Using software bounce buffering
> for IO (SWIOTLB)"
Used if there's no HW IOMMU *and* it's needed (because you have >4GB
memory) or you told the kernel to use it (iommu=soft).
> 4. pci-calgary.c : Calgary HW-IOMMU from IBM; used in pSeries servers.
> This HW-IOMMU supports dma address mapping with memory proctection,
> etc.
> Kernel boot message: "PCI-DMA: Using Calgary IOMMU" (since
> 2.6.18!)
Calgary is found in pSeries servers, but also in high-end xSeries
(Intel based) servers. It would be a little awkward if pSeries servers
(which are based on PowerPC processors) used code under arch/x86-64
:-)
> BTW: It would be really great if this area of the kernel would get some
> more and better documentation. The information at
> linux-2.6/Documentation/x86_64/boot_options.txt is very terse. I had to
> read the code to get a *rough* idea what all the "iommu=" options
> actually do and how they interact.
Patches happily accepted :-)
Cheers,
Muli
On Wed, Dec 13, 2006 at 01:29:25PM -0700, Erik Andersen wrote:
> On Mon Dec 11, 2006 at 10:24:02AM +0100, Karsten Weiss wrote:
> > We could not reproduce the data corruption anymore if we boot
> > the machines with the kernel parameter "iommu=soft" i.e. if we
> > use software bounce buffering instead of the hw-iommu.
>
> I just realized that booting with "iommu=soft" makes my pcHDTV
> HD5500 DVB cards not work. Time to go back to disabling the
> memhole and losing 1 GB. :-(
That points to a bug in the driver (likely) or swiotlb (unlikely), as
the IOMMU in use should be transparent to the driver. Which driver is
it?
Cheers,
Muli
On Thu, Dec 14, 2006 at 12:33:23AM +0100, Christoph Anton Mitterer wrote:
> 4)
> And does someone know if the nforce/opteron iommu requires IBM Calgary
> IOMMU support?
It doesn't, Calgary isn't found in machine with Opteron CPUs or NForce
chipsets (AFAIK). However, compiling Calgary in should make no
difference, as we detect in run-time which IOMMU is found and the
machine.
Cheers,
Muli
On Wed, Dec 13, 2006 at 09:11:29PM +0100, Christoph Anton Mitterer wrote:
> - error in the Opteron (memory controller)
> - error in the Nvidia chipsets
> - error in the kernel
My guess without further information would be that some, but not all
BIOSes are doing some work to avoid this.
Does anyone have an amd64 with an nforce4 chipset and >4GB that does
NOT have this problem? If so it might be worth chasing the BIOS
vendors to see what errata they are dealing with.
On Thu Dec 14, 2006 at 11:23:11AM +0200, Muli Ben-Yehuda wrote:
> > I just realized that booting with "iommu=soft" makes my pcHDTV
> > HD5500 DVB cards not work. Time to go back to disabling the
> > memhole and losing 1 GB. :-(
>
> That points to a bug in the driver (likely) or swiotlb (unlikely), as
> the IOMMU in use should be transparent to the driver. Which driver is
> it?
presumably one of cx88xx, cx88_blackbird, cx8800, cx88_dvb,
cx8802, cx88_alsa, lgdt330x, tuner, cx2341x, btcx_risc,
video_buf, video_buf_dvb, tveeprom, or dvb_pll. It seems
to take an amazing number of drivers to make these devices
actually work...
-Erik
--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--
On Thu, Dec 14, 2006 at 02:52:35AM -0700, Erik Andersen wrote:
> On Thu Dec 14, 2006 at 11:23:11AM +0200, Muli Ben-Yehuda wrote:
> > > I just realized that booting with "iommu=soft" makes my pcHDTV
> > > HD5500 DVB cards not work. Time to go back to disabling the
> > > memhole and losing 1 GB. :-(
> >
> > That points to a bug in the driver (likely) or swiotlb (unlikely), as
> > the IOMMU in use should be transparent to the driver. Which driver is
> > it?
>
> presumably one of cx88xx, cx88_blackbird, cx8800, cx88_dvb,
> cx8802, cx88_alsa, lgdt330x, tuner, cx2341x, btcx_risc,
> video_buf, video_buf_dvb, tveeprom, or dvb_pll. It seems
> to take an amazing number of drivers to make these devices
> actually work...
Yikes! do you know which one actually handles the DMA mappings? I
suspect a missnig unmap or sync, which swiotlb requires to sync back
the bounce buffer with the driver's buffer.
Cheers,
Muli
On Thu, 14 Dec 2006, Muli Ben-Yehuda wrote:
> On Wed, Dec 13, 2006 at 09:34:16PM +0100, Karsten Weiss wrote:
>
> > BTW: It would be really great if this area of the kernel would get some
> > more and better documentation. The information at
> > linux-2.6/Documentation/x86_64/boot_options.txt is very terse. I had to
> > read the code to get a *rough* idea what all the "iommu=" options
> > actually do and how they interact.
>
> Patches happily accepted :-)
Well, you asked for it. :-) So here's my little contribution. Please
*double* *check*!
(BTW: I would like to know what "DAC" and "SAC" means in this context)
===
From: Karsten Weiss <[email protected]>
Patch summary:
- Better explanation of some of the iommu kernel parameter options.
- "32MB<<order" instead of "32MB^order".
- Mention the default "order".
- SWIOTLB config help text
- removed the duplication of the iommu kernel parameter documentation.
- mention Documentation/x86_64/boot-options.txt in
Documentation/kernel-parameters.txt
- list the four existing PCI DMA mapping implementations of arch x86_64
Signed-off-by: Karsten Weiss <[email protected]>
---
--- linux-2.6.19/arch/x86_64/kernel/pci-dma.c.original 2006-12-14 11:15:38.348598021 +0100
+++ linux-2.6.19/arch/x86_64/kernel/pci-dma.c 2006-12-14 12:14:48.176967312 +0100
@@ -223,30 +223,10 @@
}
EXPORT_SYMBOL(dma_set_mask);
-/* iommu=[size][,noagp][,off][,force][,noforce][,leak][,memaper[=order]][,merge]
- [,forcesac][,fullflush][,nomerge][,biomerge]
- size set size of iommu (in bytes)
- noagp don't initialize the AGP driver and use full aperture.
- off don't use the IOMMU
- leak turn on simple iommu leak tracing (only when CONFIG_IOMMU_LEAK is on)
- memaper[=order] allocate an own aperture over RAM with size 32MB^order.
- noforce don't force IOMMU usage. Default.
- force Force IOMMU.
- merge Do lazy merging. This may improve performance on some block devices.
- Implies force (experimental)
- biomerge Do merging at the BIO layer. This is more efficient than merge,
- but should be only done with very big IOMMUs. Implies merge,force.
- nomerge Don't do SG merging.
- forcesac For SAC mode for masks <40bits (experimental)
- fullflush Flush IOMMU on each allocation (default)
- nofullflush Don't use IOMMU fullflush
- allowed overwrite iommu off workarounds for specific chipsets.
- soft Use software bounce buffering (default for Intel machines)
- noaperture Don't touch the aperture for AGP.
- allowdac Allow DMA >4GB
- nodac Forbid DMA >4GB
- panic Force panic when IOMMU overflows
-*/
+/*
+ * See <Documentation/x86_64/boot-options.txt> for the iommu kernel parameter
+ * documentation.
+ */
__init int iommu_setup(char *p)
{
iommu_merge = 1;
--- linux-2.6.19/arch/x86_64/Kconfig.original 2006-12-14 11:37:35.832142506 +0100
+++ linux-2.6.19/arch/x86_64/Kconfig 2006-12-14 11:47:24.346056710 +0100
@@ -431,8 +431,8 @@
on systems with more than 3GB. This is usually needed for USB,
sound, many IDE/SATA chipsets and some other devices.
Provides a driver for the AMD Athlon64/Opteron/Turion/Sempron GART
- based IOMMU and a software bounce buffer based IOMMU used on Intel
- systems and as fallback.
+ based hardware IOMMU and a software bounce buffer based IOMMU used
+ on Intel systems and as fallback.
The code is only active when needed (enough memory and limited
device) unless CONFIG_IOMMU_DEBUG or iommu=force is specified
too.
@@ -458,6 +458,11 @@
# need this always selected by IOMMU for the VIA workaround
config SWIOTLB
bool
+ help
+ Support for a software bounce buffer based IOMMU used on Intel
+ systems which don't have a hardware IOMMU. Using this code
+ PCI devices with 32bit memory access only are able to be
+ used on systems with more than 3 GB.
config X86_MCE
bool "Machine check support" if EMBEDDED
--- linux-2.6.19/Documentation/x86_64/boot-options.txt.original 2006-12-14 11:11:32.099300994 +0100
+++ linux-2.6.19/Documentation/x86_64/boot-options.txt 2006-12-14 12:10:24.028009890 +0100
@@ -180,35 +180,66 @@
pci=lastbus=NUMBER Scan upto NUMBER busses, no matter what the mptable says.
pci=noacpi Don't use ACPI to set up PCI interrupt routing.
-IOMMU
+IOMMU (input/output memory management unit)
+
+ Currently four x86_64 PCI DMA mapping implementations exist:
+
+ 1. <arch/x86_64/kernel/pci-nommu.c>: use no hardware/software IOMMU at all
+ (e.g. because you have < 3 GB memory).
+ Kernel boot message: "PCI-DMA: Disabling IOMMU"
+
+ 2. <arch/x86_64/kernel/pci-gart.c>: AMD GART based hardware IOMMU.
+ Kernel boot message: "PCI-DMA: using GART IOMMU"
+
+ 3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used
+ e.g. if there is no hardware IOMMU in the system and it is need because
+ you have >3GB memory or told the kernel to us it (iommu=soft))
+ Kernel boot message: "PCI-DMA: Using software bounce buffering
+ for IO (SWIOTLB)"
+
+ 4. <arch/x86_64/pci-calgary.c> : IBM Calgary hardware IOMMU. Used in IBM
+ pSeries and xSeries servers. This hardware IOMMU supports DMA address
+ mapping with memory protection, etc.
+ Kernel boot message: "PCI-DMA: Using Calgary IOMMU"
iommu=[size][,noagp][,off][,force][,noforce][,leak][,memaper[=order]][,merge]
[,forcesac][,fullflush][,nomerge][,noaperture]
- size set size of iommu (in bytes)
- noagp don't initialize the AGP driver and use full aperture.
- off don't use the IOMMU
- leak turn on simple iommu leak tracing (only when CONFIG_IOMMU_LEAK is on)
- memaper[=order] allocate an own aperture over RAM with size 32MB^order.
- noforce don't force IOMMU usage. Default.
- force Force IOMMU.
- merge Do SG merging. Implies force (experimental)
- nomerge Don't do SG merging.
- forcesac For SAC mode for masks <40bits (experimental)
- fullflush Flush IOMMU on each allocation (default)
- nofullflush Don't use IOMMU fullflush
- allowed overwrite iommu off workarounds for specific chipsets.
- soft Use software bounce buffering (default for Intel machines)
- noaperture Don't touch the aperture for AGP.
- allowdac Allow DMA >4GB
- When off all DMA over >4GB is forced through an IOMMU or bounce
- buffering.
- nodac Forbid DMA >4GB
- panic Always panic when IOMMU overflows
+ size set size of IOMMU (in bytes)
+ noagp don't initialize the AGP driver and use full aperture.
+ off don't initialize and use any kind of IOMMU.
+ leak turn on simple iommu leak tracing (only when
+ CONFIG_IOMMU_LEAK is on)
+ memaper[=order] allocate an own aperture over RAM with size 32MB<<order.
+ (default: order=1, i.e. 64MB)
+ noforce don't force hardware IOMMU usage when it is not needed.
+ (default).
+ force Force the use of the hardware IOMMU even when it is
+ not actually needed (e.g. because < 3 GB memory).
+ merge Do scather-gather (SG) merging. Implies force (experimental)
+ nomerge Don't do scather-gather (SG) merging.
+ forcesac For SAC mode for masks <40bits (experimental)
+ fullflush Flush AMD GART based hardware IOMMU on each allocation
+ (default)
+ nofullflush Don't use IOMMU fullflush
+ allowed overwrite iommu off workarounds for specific chipsets.
+ soft Use software bounce buffering (SWIOTLB) (default for Intel
+ machines). This can be used to prevent the usage
+ of a available hardware IOMMU.
+ noaperture Ask the AMD GART based hardware IOMMU driver not to
+ touch the aperture for AGP.
+ allowdac Allow DMA >4GB
+ When off all DMA over >4GB is forced through an IOMMU or
+ bounce buffering.
+ nodac Forbid DMA >4GB
+ panic Always panic when IOMMU overflows
swiotlb=pages[,force]
+ pages Prereserve that many 128K pages for the software IO bounce
+ buffering.
+ force Force all IO through the software TLB.
- pages Prereserve that many 128K pages for the software IO bounce buffering.
- force Force all IO through the software TLB.
+ Settings for the IBM Calgary hardware IOMMU currently found in IBM
+ pSeries and xSeries machines:
calgary=[64k,128k,256k,512k,1M,2M,4M,8M]
calgary=[translate_empty_slots]
--- linux-2.6.19/Documentation/kernel-parameters.txt.original 2006-12-14 11:03:46.584429749 +0100
+++ linux-2.6.19/Documentation/kernel-parameters.txt 2006-12-14 11:11:22.172025378 +0100
@@ -104,6 +104,9 @@
Do not modify the syntax of boot loader parameters without extreme
need or coordination with <Documentation/i386/boot.txt>.
+There are also arch-specific kernel-parameters not documented here.
+See for example <Documentation/x86_64/boot-options.txt>.
+
Note that ALL kernel parameters listed below are CASE SENSITIVE, and that
a trailing = on the name of any parameter states that that parameter will
be entered as an environment variable, whereas its absence indicates that
--
__________________________________________creating IT solutions
Dipl.-Inf. Karsten Weiss science + computing ag
phone: +49 7071 9457 452 Hagellocher Weg 73
teamline: +49 7071 9457 681 72070 Tuebingen, Germany
email: [email protected] http://www.science-computing.de
On Thu, Dec 14, 2006 at 12:38:08PM +0100, Karsten Weiss wrote:
> On Thu, 14 Dec 2006, Muli Ben-Yehuda wrote:
>
> > On Wed, Dec 13, 2006 at 09:34:16PM +0100, Karsten Weiss wrote:
> >
> > > BTW: It would be really great if this area of the kernel would get some
> > > more and better documentation. The information at
> > > linux-2.6/Documentation/x86_64/boot_options.txt is very terse. I had to
> > > read the code to get a *rough* idea what all the "iommu=" options
> > > actually do and how they interact.
> >
> > Patches happily accepted :-)
>
> Well, you asked for it. :-) So here's my little contribution. Please
> *double* *check*!
Looks good, some nits below.
> (BTW: I would like to know what "DAC" and "SAC" means in this
> context)
Single / Double Address Cycle. DAC is used with 32-bit PCI to push a
64-bit address in two cycles.
> @@ -458,6 +458,11 @@
> # need this always selected by IOMMU for the VIA workaround
> config SWIOTLB
> bool
> + help
> + Support for a software bounce buffer based IOMMU used on Intel
> + systems which don't have a hardware IOMMU. Using this code
> + PCI devices with 32bit memory access only are able to be
> + used on systems with more than 3 GB.
I would rephrase as follows: "Support for software bounce buffers used
on x86-64 systems which don't have a hardware IOMMU. Using this PCI
devices which can only access 32-bits of memory can be used on systems
with more than 3 GB of memory".
> + size set size of IOMMU (in bytes)
Due to historical precedence, some of these options are only valid for
GART. Perhaps mention for each option which IOMMUs it is valid for or
split them on a per IOMMU basis?
This one (size) is gart only.
> + noagp don't initialize the AGP driver and use full
> aperture.
gart only.
> + off don't initialize and use any kind of IOMMU.
all.
> + leak turn on simple iommu leak tracing (only when
> + CONFIG_IOMMU_LEAK is on)
gart only.
> + memaper[=order] allocate an own aperture over RAM with size 32MB<<order.
> + (default: order=1, i.e. 64MB)
gart only.
> + noforce don't force hardware IOMMU usage when it is not needed.
> + (default).
all.
> + force Force the use of the hardware IOMMU even when it is
> + not actually needed (e.g. because < 3 GB
> memory).
all.
> + merge Do scather-gather (SG) merging. Implies force
> (experimental)
gart only.
> + nomerge Don't do scather-gather (SG) merging.
gart only.
> + forcesac For SAC mode for masks <40bits (experimental)
gart only.
> + fullflush Flush AMD GART based hardware IOMMU on each allocation
> + (default)
gart only.
> + nofullflush Don't use IOMMU fullflush
gart only.
> + allowed overwrite iommu off workarounds for specific
> chipsets.
gart only.
> + soft Use software bounce buffering (SWIOTLB) (default for Intel
> + machines). This can be used to prevent the usage
> + of a available hardware IOMMU.
all.
> + noaperture Ask the AMD GART based hardware IOMMU driver not to
> + touch the aperture for AGP.
gart only.
> + allowdac Allow DMA >4GB
> + When off all DMA over >4GB is forced through an IOMMU or
> + bounce buffering.
gart only.
> + nodac Forbid DMA >4GB
gart only.
> + panic Always panic when IOMMU overflows
gart and Calgary.
The rest looks good. Please resend and I'll add my Acked-by.
Cheers,
Muli
On Thu, 14 Dec 2006, Muli Ben-Yehuda wrote:
> The rest looks good. Please resend and I'll add my Acked-by.
Thanks a lot for your comments and suggestions. Here's my 2nd try:
===
From: Karsten Weiss <[email protected]>
$ diffstat ~/iommu-patch_v2.patch
Documentation/kernel-parameters.txt | 3
Documentation/x86_64/boot-options.txt | 104 +++++++++++++++++++++++-----------
arch/x86_64/Kconfig | 10 ++-
arch/x86_64/kernel/pci-dma.c | 28 +--------
4 files changed, 87 insertions(+), 58 deletions(-)
Patch description:
- add SWIOTLB config help text
- mention Documentation/x86_64/boot-options.txt in
Documentation/kernel-parameters.txt
- remove the duplication of the iommu kernel parameter documentation.
- Better explanation of some of the iommu kernel parameter options.
- "32MB<<order" instead of "32MB^order".
- Mention the default "order" value.
- list the four existing PCI-DMA mapping implementations of arch x86_64
- group the iommu= option keywords by PCI-DMA mapping implementation.
- Distinguish iommu= option keywords from number arguments.
- Explain the meaning of DAC and SAC.
Signed-off-by: Karsten Weiss <[email protected]>
---
--- linux-2.6.19/arch/x86_64/kernel/pci-dma.c.original 2006-12-14 11:15:38.348598021 +0100
+++ linux-2.6.19/arch/x86_64/kernel/pci-dma.c 2006-12-14 12:14:48.176967312 +0100
@@ -223,30 +223,10 @@
}
EXPORT_SYMBOL(dma_set_mask);
-/* iommu=[size][,noagp][,off][,force][,noforce][,leak][,memaper[=order]][,merge]
- [,forcesac][,fullflush][,nomerge][,biomerge]
- size set size of iommu (in bytes)
- noagp don't initialize the AGP driver and use full aperture.
- off don't use the IOMMU
- leak turn on simple iommu leak tracing (only when CONFIG_IOMMU_LEAK is on)
- memaper[=order] allocate an own aperture over RAM with size 32MB^order.
- noforce don't force IOMMU usage. Default.
- force Force IOMMU.
- merge Do lazy merging. This may improve performance on some block devices.
- Implies force (experimental)
- biomerge Do merging at the BIO layer. This is more efficient than merge,
- but should be only done with very big IOMMUs. Implies merge,force.
- nomerge Don't do SG merging.
- forcesac For SAC mode for masks <40bits (experimental)
- fullflush Flush IOMMU on each allocation (default)
- nofullflush Don't use IOMMU fullflush
- allowed overwrite iommu off workarounds for specific chipsets.
- soft Use software bounce buffering (default for Intel machines)
- noaperture Don't touch the aperture for AGP.
- allowdac Allow DMA >4GB
- nodac Forbid DMA >4GB
- panic Force panic when IOMMU overflows
-*/
+/*
+ * See <Documentation/x86_64/boot-options.txt> for the iommu kernel parameter
+ * documentation.
+ */
__init int iommu_setup(char *p)
{
iommu_merge = 1;
--- linux-2.6.19/arch/x86_64/Kconfig.original 2006-12-14 11:37:35.832142506 +0100
+++ linux-2.6.19/arch/x86_64/Kconfig 2006-12-14 14:01:24.009193996 +0100
@@ -431,8 +431,8 @@
on systems with more than 3GB. This is usually needed for USB,
sound, many IDE/SATA chipsets and some other devices.
Provides a driver for the AMD Athlon64/Opteron/Turion/Sempron GART
- based IOMMU and a software bounce buffer based IOMMU used on Intel
- systems and as fallback.
+ based hardware IOMMU and a software bounce buffer based IOMMU used
+ on Intel systems and as fallback.
The code is only active when needed (enough memory and limited
device) unless CONFIG_IOMMU_DEBUG or iommu=force is specified
too.
@@ -458,6 +458,12 @@
# need this always selected by IOMMU for the VIA workaround
config SWIOTLB
bool
+ help
+ Support for software bounce buffers used on x86-64 systems
+ which don't have a hardware IOMMU (e.g. the current generation
+ of Intel's x86-64 CPUs). Using this PCI devices which can only
+ access 32-bits of memory can be used on systems with more than
+ 3 GB of memory. If unsure, say Y.
config X86_MCE
bool "Machine check support" if EMBEDDED
--- linux-2.6.19/Documentation/x86_64/boot-options.txt.original 2006-12-14 11:11:32.099300994 +0100
+++ linux-2.6.19/Documentation/x86_64/boot-options.txt 2006-12-14 14:14:55.869560532 +0100
@@ -180,39 +180,79 @@
pci=lastbus=NUMBER Scan upto NUMBER busses, no matter what the mptable says.
pci=noacpi Don't use ACPI to set up PCI interrupt routing.
-IOMMU
+IOMMU (input/output memory management unit)
- iommu=[size][,noagp][,off][,force][,noforce][,leak][,memaper[=order]][,merge]
- [,forcesac][,fullflush][,nomerge][,noaperture]
- size set size of iommu (in bytes)
- noagp don't initialize the AGP driver and use full aperture.
- off don't use the IOMMU
- leak turn on simple iommu leak tracing (only when CONFIG_IOMMU_LEAK is on)
- memaper[=order] allocate an own aperture over RAM with size 32MB^order.
- noforce don't force IOMMU usage. Default.
- force Force IOMMU.
- merge Do SG merging. Implies force (experimental)
- nomerge Don't do SG merging.
- forcesac For SAC mode for masks <40bits (experimental)
- fullflush Flush IOMMU on each allocation (default)
- nofullflush Don't use IOMMU fullflush
- allowed overwrite iommu off workarounds for specific chipsets.
- soft Use software bounce buffering (default for Intel machines)
- noaperture Don't touch the aperture for AGP.
- allowdac Allow DMA >4GB
- When off all DMA over >4GB is forced through an IOMMU or bounce
- buffering.
- nodac Forbid DMA >4GB
- panic Always panic when IOMMU overflows
-
- swiotlb=pages[,force]
-
- pages Prereserve that many 128K pages for the software IO bounce buffering.
- force Force all IO through the software TLB.
-
- calgary=[64k,128k,256k,512k,1M,2M,4M,8M]
- calgary=[translate_empty_slots]
- calgary=[disable=<PCI bus number>]
+ Currently four x86-64 PCI-DMA mapping implementations exist:
+
+ 1. <arch/x86_64/kernel/pci-nommu.c>: use no hardware/software IOMMU at all
+ (e.g. because you have < 3 GB memory).
+ Kernel boot message: "PCI-DMA: Disabling IOMMU"
+
+ 2. <arch/x86_64/kernel/pci-gart.c>: AMD GART based hardware IOMMU.
+ Kernel boot message: "PCI-DMA: using GART IOMMU"
+
+ 3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used
+ e.g. if there is no hardware IOMMU in the system and it is need because
+ you have >3GB memory or told the kernel to us it (iommu=soft))
+ Kernel boot message: "PCI-DMA: Using software bounce buffering
+ for IO (SWIOTLB)"
+
+ 4. <arch/x86_64/pci-calgary.c> : IBM Calgary hardware IOMMU. Used in IBM
+ pSeries and xSeries servers. This hardware IOMMU supports DMA address
+ mapping with memory protection, etc.
+ Kernel boot message: "PCI-DMA: Using Calgary IOMMU"
+
+ iommu=[<size>][,noagp][,off][,force][,noforce][,leak[=<nr_of_leak_pages>]
+ [,memaper[=<order>]][,merge][,forcesac][,fullflush][,nomerge][,noaperture]
+
+ General iommu options:
+ off Don't initialize and use any kind of IOMMU.
+ noforce Don't force hardware IOMMU usage when it is not needed.
+ (default).
+ force Force the use of the hardware IOMMU even when it is
+ not actually needed (e.g. because < 3 GB memory).
+ soft Use software bounce buffering (SWIOTLB) (default for
+ Intel machines). This can be used to prevent the usage
+ of an available hardware IOMMU.
+
+ iommu options only relevant to the AMD GART hardware IOMMU:
+ <size> Set the size of the remapping area in bytes.
+ allowed Overwrite iommu off workarounds for specific chipsets.
+ fullflush Flush IOMMU on each allocation (default).
+ nofullflush Don't use IOMMU fullflush.
+ leak Turn on simple iommu leak tracing (only when
+ CONFIG_IOMMU_LEAK is on). Default number of leak pages
+ is 20.
+ memaper[=<order>] Allocate an own aperture over RAM with size 32MB<<order.
+ (default: order=1, i.e. 64MB)
+ merge Do scather-gather (SG) merging. Implies "force"
+ (experimental).
+ nomerge Don't do scather-gather (SG) merging.
+ noaperture Ask the IOMMU not to touch the aperture for AGP.
+ forcesac Force single-address cycle (SAC) mode for masks <40bits
+ (experimental).
+ noagp Don't initialize the AGP driver and use full aperture.
+ allowdac Allow double-address cycle (DAC) mode, i.e. DMA >4GB.
+ DAC is used with 32-bit PCI to push a 64-bit address in
+ two cycles. When off all DMA over >4GB is forced through
+ an IOMMU or software bounce buffering.
+ nodac Forbid DAC mode, i.e. DMA >4GB.
+ panic Always panic when IOMMU overflows.
+
+ iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU
+ implementation:
+ swiotlb=<pages>[,force]
+ <pages> Prereserve that many 128K pages for the software IO
+ bounce buffering.
+ force Force all IO through the software TLB.
+
+ Settings for the IBM Calgary hardware IOMMU currently found in IBM
+ pSeries and xSeries machines:
+
+ calgary=[64k,128k,256k,512k,1M,2M,4M,8M]
+ calgary=[translate_empty_slots]
+ calgary=[disable=<PCI bus number>]
+ panic Always panic when IOMMU overflows
64k,...,8M - Set the size of each PCI slot's translation table
when using the Calgary IOMMU. This is the size of the translation
--- linux-2.6.19/Documentation/kernel-parameters.txt.original 2006-12-14 11:03:46.584429749 +0100
+++ linux-2.6.19/Documentation/kernel-parameters.txt 2006-12-14 11:11:22.172025378 +0100
@@ -104,6 +104,9 @@
Do not modify the syntax of boot loader parameters without extreme
need or coordination with <Documentation/i386/boot.txt>.
+There are also arch-specific kernel-parameters not documented here.
+See for example <Documentation/x86_64/boot-options.txt>.
+
Note that ALL kernel parameters listed below are CASE SENSITIVE, and that
a trailing = on the name of any parameter states that that parameter will
be entered as an environment variable, whereas its absence indicates that
--
__________________________________________creating IT solutions
Dipl.-Inf. Karsten Weiss science + computing ag
phone: +49 7071 9457 452 Hagellocher Weg 73
teamline: +49 7071 9457 681 72070 Tuebingen, Germany
email: [email protected] http://www.science-computing.de
On Thu, Dec 14, 2006 at 02:16:31PM +0100, Karsten Weiss wrote:
> On Thu, 14 Dec 2006, Muli Ben-Yehuda wrote:
>
> > The rest looks good. Please resend and I'll add my Acked-by.
>
> Thanks a lot for your comments and suggestions. Here's my 2nd try:
>
> ===
>
> From: Karsten Weiss <[email protected]>
>
> $ diffstat ~/iommu-patch_v2.patch
> Documentation/kernel-parameters.txt | 3
> Documentation/x86_64/boot-options.txt | 104 +++++++++++++++++++++++-----------
> arch/x86_64/Kconfig | 10 ++-
> arch/x86_64/kernel/pci-dma.c | 28 +--------
> 4 files changed, 87 insertions(+), 58 deletions(-)
>
> Patch description:
>
> - add SWIOTLB config help text
> - mention Documentation/x86_64/boot-options.txt in
> Documentation/kernel-parameters.txt
> - remove the duplication of the iommu kernel parameter documentation.
> - Better explanation of some of the iommu kernel parameter options.
> - "32MB<<order" instead of "32MB^order".
> - Mention the default "order" value.
> - list the four existing PCI-DMA mapping implementations of arch x86_64
> - group the iommu= option keywords by PCI-DMA mapping implementation.
> - Distinguish iommu= option keywords from number arguments.
> - Explain the meaning of DAC and SAC.
>
> Signed-off-by: Karsten Weiss <[email protected]>
Acked-by: Muli Ben-Yehuda <[email protected]>
Cheers,
Muli
Muli Ben-Yehuda wrote:
>> 4)
>> And does someone know if the nforce/opteron iommu requires IBM Calgary
>> IOMMU support?
>>
> It doesn't, Calgary isn't found in machine with Opteron CPUs or NForce
> chipsets (AFAIK). However, compiling Calgary in should make no
> difference, as we detect in run-time which IOMMU is found and the
> machine.
Yes,.. I've read the relevant section shortly after sending that email ;-)
btw & for everybody:
I'm working (as student) at the LRZ (Leibniz Computing Centre) in Munich
where we have very large Linux Cluster and lots of different other
machines,...
I'm going to test for that error on most of the different types of
systems we have,.. and will inform you about my results (if they're
interesting).
Chris.
On Sat, 2006-12-02 at 01:56 +0100, Christoph Anton Mitterer wrote:
> Hi.
>
> Perhaps some of you have read my older two threads:
> http://marc.theaimsgroup.com/?t=116312440000001&r=1&w=2 and the even
> older http://marc.theaimsgroup.com/?t=116291314500001&r=1&w=2
>
> The issue was basically the following:
> I found a severe bug mainly by fortune because it occurs very rarely.
> My test looks like the following: I have about 30GB of testing data on
> my harddisk,... I repeat verifying sha512 sums on these files and check
> if errors occur.
> One test pass verifies the 30GB 50 times,... about one to four
> differences are found in each pass.
This sounds very similar to a corruption issue I was experiencing on my
nforce4 based system. After replacing most of my hardware to no avail, I
discovered that if increased the voltage for my RAM chips the corruption
went away. Note that I was not overclocking at all.
Worth a try.
Dax Kelson
[email protected] wrote:
>On Wed, Dec 13, 2006 at 09:11:29PM +0100, Christoph Anton Mitterer wrote:
>
>> - error in the Opteron (memory controller)
>> - error in the Nvidia chipsets
>> - error in the kernel
>
>My guess without further information would be that some, but not all
>BIOSes are doing some work to avoid this.
>
>Does anyone have an amd64 with an nforce4 chipset and >4GB that does
>NOT have this problem? If so it might be worth chasing the BIOS
>vendors to see what errata they are dealing with.
We have a number of Tyan S2891 systems at work, most with 8GB but all at
least 4GB (data corruption still occurs whether 4 or 8GB is installed;
didn't try less than 4GB...). All have 2 of the following CPUs:
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 248
stepping : 1
cpu MHz : 2210.208
cache size : 1024 KB
- the older models have no problem with data corruption,
but fail to boot 2.6.18 and up (exactly like
http://bugzilla.kernel.org/show_bug.cgi?id=7505 )
- the newer models had problems with data corruption (running md5sum
over a large number of files would show differences from run to run).
Sometimes the system would hang (no messages on the serial console,
no magic sysrq, nothing).
These have no problem booting 2.6.18 and up, however.
These were delivered with a 2.02 BIOS version.
On a whim I tried booting with "nosmp noapic", and running on one CPU
the systems seemed stable, no data corruption and no crashes.
- The older models flashed to the latest 2.02 BIOS from the Tyan website
still have no data corruption but still won't boot 2.6.18 and up.
- The newer models flashed (downgraded!) to the 2.01 BIOS available from the Tyan
website seem to work fine, no data corruption while running on both
CPUs and no crashes (although perhaps time is too short to tell for
sure, first one I did was 10 days ago).
- I have an idea that perhaps the 2.02 BIOS the newer systems were
delivered with is a subtely different version than the one on the
website. I may try flashing 2.02 again once the current 2.01 on these
systems has proven to be stable.
- Apparently there's something different on the motherboards from the
first batch and the second batch, otherwise I couldn't explain the
difference in ability to boot 2.6.18 and up. However, I haven't had an
opportunity to open two systems up to compare them visually.
Paul Slootman
Hi my friends....
It became a little bit silent about this issue... any new ideas or results?
Karsten Weiss wrote:
> BTW: Did someone already open an official bug at
> http://bugzilla.kernel.org ?
Karsten, did you already file a bug?
I told the whole issue to the Debian people which are about to release
etch and suggested them to use iommu=soft by default.
This brings me to:
Chris Wedgwood wrote:
> Does anyone have an amd64 with an nforce4 chipset and >4GB that does
> NOT have this problem? If so it might be worth chasing the BIOS
> vendors to see what errata they are dealing with.
John Chaves replied and claimed that he wouldn't suffer from that
problem (I've CC'ed him to this post).
You can read his message at the bottom of this post.
@ John: Could you please tell us in detail how you've tested your system?
Muli told us some information about the iommu options (when he
discuessed Karstens patch) has anybody made tests with the other iommu
options?
Ok and what does it all come down to? We still don't know the exact
reason...
Perhaps a kernel bug, a Opteron and/or Chipset bug,.. and perhaps there
are even some BIOSes that solve the issue...
For the kernel-bug reason,... who is the responsible developer for the
relevant code? Can we contact him to read our threads and perhaps review
the code?
Is anyone able (or wants to try) to inform AMD and/or Nvidia about the
issue (perhaps with pointing to that thread).
Someone might even try to contact some board vendors (some of us seem to
have Tyan boards). Although I'm in contact with the German support Team
of Tyan, I wasn't very successful with the US team... perhaps they have
other ideas.
Last but not least.... if we don't find a solution what should we do?
In my opinion at least the following:
1) Inform other OS communities (*BSD) and point the to our thread. Some
of you claimed that Windows wouldn't use the hwiommu at all so I think
we don't have to contact big evil.
2) Contact the major Linux Distributions (I've already did it for
Debian) and inform them about the potential issue and pointing them to
this thread (where one can find all the relevant information, I think)
3) Workaround for the kernel:
I have to less knowledge to know exactly what to do but I remember there
are other fixes for mainboard flaws and buggy chipsets in the kernel
(e.g. the RZ1000 or something like this in the "old" IDE driver)...
Perhaps someone (who knows what to do ;-) ) could write some code that
automatically uses iommu=soft,... but then we have the question: In
which case :-( . I imagine that the AMD users who don't suffer from this
issue would like to continue using their hwiommus..
What I'm currently plan to do:
1) If know one else is willing to try contacting AMD/Nvidia,.. I'd try
again.
2) I told you that I'm going to test the whole issue in the Leibniz
Supercomputing Centre where I work as student...
This is a little bit delayed (organisational problems :-) )
Anyway,... I'm not only going to test it on our Linux Cluster but also
some Sun Fire's (whe have maaaannnnny of them ;-) ). According to my
boss they have nvidia chipsets... (He is probably contacting Sun for the
issue).
So much for now.
Best wishes,
Chris.
John Chaves message:
Here's another data point in case it helps.
The following system does *not* have the data corruption issue.
Motherboard: Iwill DK88 <http://www.iwill.net/product_2.asp?p_id=102>
Chipset: NVIDIA nForce4 Professional 2200
CPUs: Two Dual Core AMD Opteron(tm) Processor 280
Memory: 32GB
Disks: Four 500GB SATA in linux RAID1 over RAID0 setup
Kernel: 2.6.18
This system is a workhorse with extreme disk I/O of huge files,
and the nature of the work done would have revealed data
corruption pretty quickly.
FWIW,
John Chaves
His lspic:
0000:00:00.0 Memory controller: nVidia Corporation CK804 Memory
Controller (rev a3)
Flags: bus master, 66MHz, fast devsel, latency 0
Capabilities: [44] #08 [01e0]
Capabilities: [e0] #08 [a801]
0000:00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
Subsystem: nVidia Corporation: Unknown device cb84
Flags: bus master, 66MHz, fast devsel, latency 0
0000:00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
Subsystem: nVidia Corporation: Unknown device cb84
Flags: 66MHz, fast devsel, IRQ 9
I/O ports at d400 [size=32]
I/O ports at 4c00 [size=64]
I/O ports at 4c40 [size=64]
Capabilities: [44] Power Management version 2
0000:00:02.0 USB Controller: nVidia Corporation CK804 USB Controller
(rev a2) (prog-if 10 [OHCI])
Subsystem: nVidia Corporation: Unknown device cb84
Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 209
Memory at feafc000 (32-bit, non-prefetchable) [size=4K]
Capabilities: [44] Power Management version 2
0000:00:02.1 USB Controller: nVidia Corporation CK804 USB Controller
(rev a3) (prog-if 20 [EHCI])
Subsystem: nVidia Corporation: Unknown device cb84
Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 193
Memory at feafdc00 (32-bit, non-prefetchable) [size=256]
Capabilities: [44] #0a [2098]
Capabilities: [80] Power Management version 2
0000:00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev a2)
(prog-if 8a [Master SecP PriP])
Subsystem: nVidia Corporation: Unknown device cb84
Flags: bus master, 66MHz, fast devsel, latency 0
I/O ports at 3000 [size=16]
Capabilities: [44] Power Management version 2
0000:00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA
Controller (rev a3) (prog-if 85 [Master SecO PriO])
Subsystem: nVidia Corporation: Unknown device cb84
Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 193
I/O ports at e800 [size=8]
I/O ports at e400 [size=4]
I/O ports at e000 [size=8]
I/O ports at dc00 [size=4]
I/O ports at d800 [size=16]
Memory at feafe000 (32-bit, non-prefetchable) [size=4K]
Capabilities: [44] Power Management version 2
0000:00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA
Controller (rev a3) (prog-if 85 [Master SecO PriO])
Subsystem: nVidia Corporation: Unknown device cb84
Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 201
I/O ports at fc00 [size=8]
I/O ports at f800 [size=4]
I/O ports at f400 [size=8]
I/O ports at f000 [size=4]
I/O ports at ec00 [size=16]
Memory at feaff000 (32-bit, non-prefetchable) [size=4K]
Capabilities: [44] Power Management version 2
0000:00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
(prog-if 01 [Subtractive decode])
Flags: bus master, 66MHz, fast devsel, latency 0
Bus: primary=00, secondary=05, subordinate=05, sec-latency=128
I/O behind bridge: 0000b000-0000bfff
Memory behind bridge: fc900000-fe9fffff
Prefetchable memory behind bridge: e0000000-e00fffff
0000:00:0b.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
(prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=04, subordinate=04, sec-latency=0
Memory behind bridge: fc800000-fc8fffff
Capabilities: [40] Power Management version 2
Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable-
Capabilities: [58] #08 [a800]
Capabilities: [80] #10 [0141]
0000:00:0c.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
(prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
Memory behind bridge: fc700000-fc7fffff
Capabilities: [40] Power Management version 2
Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable-
Capabilities: [58] #08 [a800]
Capabilities: [80] #10 [0141]
0000:00:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
(prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
Capabilities: [40] Power Management version 2
Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable-
Capabilities: [58] #08 [a800]
Capabilities: [80] #10 [0141]
0000:00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
(prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
Capabilities: [40] Power Management version 2
Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable-
Capabilities: [58] #08 [a800]
Capabilities: [80] #10 [0141]
0000:00:18.0 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] HyperTransport Technology Configuration
Flags: fast devsel
Capabilities: [80] #08 [2101]
Capabilities: [a0] #08 [2101]
Capabilities: [c0] #08 [2101]
0000:00:18.1 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] Address Map
Flags: fast devsel
0000:00:18.2 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] DRAM Controller
Flags: fast devsel
0000:00:18.3 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] Miscellaneous Control
Flags: fast devsel
0000:00:19.0 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] HyperTransport Technology Configuration
Flags: fast devsel
Capabilities: [80] #08 [2101]
Capabilities: [a0] #08 [2101]
Capabilities: [c0] #08 [2101]
0000:00:19.1 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] Address Map
Flags: fast devsel
0000:00:19.2 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] DRAM Controller
Flags: fast devsel
0000:00:19.3 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] Miscellaneous Control
Flags: fast devsel
0000:03:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721
Gigabit Ethernet PCI Express (rev 11)
Subsystem: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI
Express
Flags: bus master, fast devsel, latency 0, IRQ 185
Memory at fc7f0000 (64-bit, non-prefetchable) [size=64K]
Capabilities: [48] Power Management version 2
Capabilities: [50] Vital Product Data
Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
Capabilities: [d0] #10 [0001]
0000:04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721
Gigabit Ethernet PCI Express (rev 11)
Subsystem: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI
Express
Flags: bus master, fast devsel, latency 0, IRQ 177
Memory at fc8f0000 (64-bit, non-prefetchable) [size=64K]
Capabilities: [48] Power Management version 2
Capabilities: [50] Vital Product Data
Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
Capabilities: [d0] #10 [0001]
0000:05:07.0 VGA compatible controller: ATI Technologies Inc Rage XL
(rev 27) (prog-if 00 [VGA])
Subsystem: ATI Technologies Inc Rage XL
Flags: bus master, stepping, medium devsel, latency 64, IRQ 10
Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
I/O ports at b800 [size=256]
Memory at fe8ff000 (32-bit, non-prefetchable) [size=4K]
Expansion ROM at e0000000 [disabled] [size=128K]
Capabilities: [5c] Power Management version 2
0000:06:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8132 PCI-X
Bridge (rev 11) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 64
Bus: primary=06, secondary=08, subordinate=08, sec-latency=64
Capabilities: [60] Capabilities: [b8] #08 [8000]
Capabilities: [c0] #08 [0041]
Capabilities: [f4] #08 [a800]
0000:06:01.1 PIC: Advanced Micro Devices [AMD] AMD-8132 PCI-X IOAPIC
(rev 11) (prog-if 10 [IO-APIC])
Subsystem: Advanced Micro Devices [AMD] AMD-8132 PCI-X IOAPIC
Flags: bus master, medium devsel, latency 0
Memory at febfe000 (64-bit, non-prefetchable) [size=4K]
0000:06:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8132 PCI-X
Bridge (rev 11) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 64
Bus: primary=06, secondary=07, subordinate=07, sec-latency=64
Capabilities: [60] Capabilities: [b8] #08 [8000]
Capabilities: [c0] #08 [8840]
Capabilities: [f4] #08 [a800]
0000:06:02.1 PIC: Advanced Micro Devices [AMD] AMD-8132 PCI-X IOAPIC
(rev 11) (prog-if 10 [IO-APIC])
Subsystem: Advanced Micro Devices [AMD] AMD-8132 PCI-X IOAPIC
Flags: bus master, medium devsel, latency 0
Memory at febff000 (64-bit, non-prefetchable) [size=4K]
His /proc/cpuinfo:processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 280
stepping : 2
cpu MHz : 2400.020
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm
3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips : 4802.02
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 280
stepping : 2
cpu MHz : 2400.020
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm
3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips : 4799.29
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
processor : 2
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 280
stepping : 2
cpu MHz : 2400.020
cache size : 1024 KB
physical id : 1
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm
3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips : 4799.36
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
processor : 3
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 280
stepping : 2
cpu MHz : 2400.020
cache size : 1024 KB
physical id : 1
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm
3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips : 4799.37
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
On Friday 22 December 2006 20:04, Christoph Anton Mitterer wrote:
> This brings me to:
> Chris Wedgwood wrote:
> > Does anyone have an amd64 with an nforce4 chipset and >4GB that does
> > NOT have this problem? If so it might be worth chasing the BIOS
> > vendors to see what errata they are dealing with.
> John Chaves replied and claimed that he wouldn't suffer from that
> problem (I've CC'ed him to this post).
> You can read his message at the bottom of this post.
> @ John: Could you please tell us in detail how you've tested your system?
I didn't need to run a specific test for this. The normal workload of the
machine approximates a continuous selftest for almost the last year.
Large files (4-12GB is typical) are being continuously packed and unpacked
with gzip and bzip2. Statistical analysis of the datasets is followed by
verification of the data, sometimes using diff, or md5sum, or python
scripts using numarray to mmap 2GB chunks at a time. The machine
often goes for days with a load level of 20+ and 32GB RAM + another 32GB
swap in use. It would be very unlikely for data corruption to go unnoticed.
When I first got the machine I did have some problems with disks being
dropped from the RAID and occasional log messages implicating the IOMMU.
But that was with kernel 2.6.16.?, Kernels since 2.6.17 haven't had any
problem.
John
John A Chaves wrote:
> I didn't need to run a specific test for this. The normal workload of the
> machine approximates a continuous selftest for almost the last year.
>
> Large files (4-12GB is typical) are being continuously packed and unpacked
> with gzip and bzip2. Statistical analysis of the datasets is followed by
> verification of the data, sometimes using diff, or md5sum, or python
> scripts using numarray to mmap 2GB chunks at a time. The machine
> often goes for days with a load level of 20+ and 32GB RAM + another 32GB
> swap in use. It would be very unlikely for data corruption to go unnoticed.
>
> When I first got the machine I did have some problems with disks being
> dropped from the RAID and occasional log messages implicating the IOMMU.
> But that was with kernel 2.6.16.?, Kernels since 2.6.17 haven't had any
> problem.
>
Ah thanks for that info,.. as far as I can tell,.. this "testing
environment" should have found any corruptions I there had been any.
So I think we could take this as our first working system where the
issue don't occur although we would expect it...
Chris.
Hi everybody.
After my last mails to this issue (btw: anything new in the meantime? I
received no replys..) I wrote again to nvidia and AMD...
This time with some more success.
Below is the answer from Mr. Friedman to my mail. He says that he wasn't
able to reproduce the problem and asks for a testing system.
Unfortunately I cannot ship my system as this is my only home PC and I
need it for daily work. But perhaps someone else here might has a system
(with the error) that he can send to Nvidia...
I cc'ed Mr. Friedman so he'll read your replies.
To Mr. Friedman: What system did you exactly use for your testing?
(Hardware configuration, BIOS settings and so on). As we've seen before
it might be possible that some BIOSes correct the problem.
Best wishes,
Chris.
Lonni J Friedman wrote:
> Christoph,
> Thanks for your email. I'm aware of the LKML threads, and have spent
> considerable time attempting to reproduce this problem on one of our
> reference motherboards without success. If you could ship a system
> which reliably reproduces the problem, I'd be happy to investigate further.
>
> Thanks,
> Lonni J Friedman
> NVIDIA Corporation
>
> Christoph Anton Mitterer wrote:
>
>> Hi.
>>
>> First of all: This is only a copy from a thread to nvnews.net
>> (http://www.nvnews.net/vbulletin/showthread.php?t=82909). You probably
>> should read the description there.
>>
>> Please note that his is also a very important issue. It is most likely
>> not only Linux related but a general nforce chipset design flaw, so
>> perhaps you should forwad this mail to your engineers too. (Please CC me
>> in all mails).
>>
>> Also note: I'm not one of the normal "end users" with simple problems or
>> damaged hardware. I study computer science and work in one of Europes
>> largest supercomputing centres (Leibniz supercomputing centre).
>> Believe me: I know what I'm talking about.... and I'm investigating in
>> this issue (with many others) for some weeks now.
>>
>> Please answer either to the specific lkml thread, to the nvnews.net post
>> or directly to me (via email).
>> And I'd be grateful if you could give me email-addresses from your
>> developers or enginers, or even better, forward this email to them and
>> CC me. Of course I'll keep their emails-addresses absolutely confident
>> if you wish.
>>
>> Best wishes,
>> Christoph Anton Mitterer.
>> Munich University of Applied Sciences / Department of Mathematics and
>> Computer Science
>> Leibniz Supercomputing Centre / Department for High Performance
>> Computing and Compute Servers
>>
>>
>>
>>
>> Here is the copy:
>> Hi.
>>
>> I've already tried to "resolve" this via the nvidia knowledgebase but
>> either they don't want to know about that issue or there is noone who is
>> competent enought to give information/solutions about it.
>> They finally pointed me to this fourm and told me that Linux
>> <http://www.nvnews.net/vbulletin/showthread.php?t=82909#> support would
>> be handled here (they did not realise that this is probably a hardware
>> <http://www.nvnews.net/vbulletin/showthread.php?t=82909#> flaw and not
>> OS related).
>>
>> I must admit that I'm a little bit bored with Nvidia's policy in such
>> matters and thus I only describe the problem in brief.
>> If here is any competent chipset engineer who reads this, than he might
>> read the main discussion-thread (and some spin-off threads) of the issue
>> which takes place at the linux-kernel mailing list (again this is
>> probably not Linux related).
>> You can find the archive here:
>> http://marc.theaimsgroup.com/?t=116502121800001&r=1&w=2
>> <http://marc.theaimsgroup.com/?t=116502121800001&r=1&w=2>
>>
>>
>> Now a short description:
>> -I (and many others) found a data corruption issue that happens on AMD
>> Opteron / Nvidia chipset systems
>> <http://www.nvnews.net/vbulletin/showthread.php?t=82909#>.
>>
>> -What happens: If one reads/writes large amounts of data there are errors.
>> We test this the following way: Create some test data (huge amounts
>> of),.. make md5sums of it (or with other hash algorithms), then verify
>> them over and over.
>> The test shoes differences (refer the lkml thread for more information
>> about this). Always at differnt files (!!!!). It may happen at read AND
>> write access <http://www.nvnews.net/vbulletin/showthread.php?t=82909#>.
>> Note that even for affected users the error occurs rarely (but this is
>> of course still far to often): My personal tests shows about the following:
>> Test data: 30GB (of random data), I verify sha512sum 50 times (that is
>> what I call one complete test). So I verify 30*50GB. In one complete
>> test there are about 1-3 files with differences. With about 100
>> corrupted bytes (at leas very low data sizes, far below an MB)
>>
>> -It probably happens with all the nforce chipsets (see the lkml thread
>> where everybody tells his hardware)
>>
>> -The reasons are not single hardware defects (dozens of hight quality
>> memory <http://www.nvnews.net/vbulletin/showthread.php?t=82909#>, CPU,
>> PCI bus, HDD bad block scans, PCI parity, ECC, etc. tests showed this,
>> and even with different hardware compontents the issue remained)
>>
>> -It is probably not an Operating System related bug, although Windows
>> won't suffer from it. The reason therefore is, that windows is (too
>> stupid) ... I mean unable to use the hardware iommu at all.
>>
>> -It happens with both, PATA and SATA disk. To be exact: It is may that
>> this has nothing special to do with harddisks at all.
>> It is probably PCI-DMA related (see lkml for more infos and reasons for
>> this thesis).
>>
>> -Only users with much main memory (don't know the exact value by hard
>> and I'm to lazy to look it up)... say 4GB will suffer from this problem.
>> Why? Only users who need the memory hole mapping and the iommu will
>> suffer from the problem (this is why we think it is chipset related).
>>
>> -We found two "workarounds" but these have both big problems:
>> Workaround 1: Disable Memory Hole Mapping in the system BIOS at all.
>> The issue no longer occurs, BUT you loose a big part of your main memory
>> (depending on the size of the memhole, which itself depends on the PCI
>> devices). In my case I loose 1,5GB from my 4GB. Most users will probably
>> loose 1GB.
>> => inacceptable
>>
>> Workaround 2: As told Windows won't suffer from the problem because it
>> always uses an software iommu. (btw: the same applies for Intel CPUs
>> with EMT64/Intel 64,.. these CPUs don't even have a hardware iommu).
>> Linux is able to use the hardware iommu (which of course accelerates the
>> whole system).
>> If you tell the kernel (Linux) to use a software iommu (with the kernel
>> parameter iommu=soft),.. the issue won't appear.
>> => this is better than workaround 1 but still not really acceptable.
>> Why? There are some following problems:
>>
>> The hardware iommu and systems with such big main memory is largely used
>> in computing centres. Those groups won't abdicate the hwiommu in
>> general, simply because some Opteron (and perhaps Athlon) / Nvidia
>> combinations make problems.
>> (I can tell this because I work at the Leibniz Supercomputing Centre,..
>> one of the largest in Europe)
>>
>> But as we don't know the exact reason for the issue, we cannot
>> selectively switch the iommu=soft for affected
>> mainboards/chipsets/cpu-steppings/and alike.
>>
>> We'd have to use a kernel wide iommu=soft as a catchall solution.
>> But it is highly unlikely that this is accepted by the Linux community
>> (not to talk about end users like the supercomputing centres) and I
>> don't want to talk about other OS'es.
>>
>>
>> So we (and of course all, and especially professional, customers) need
>> Nvidias help.
>>
>> Perhaps this might be solvable via BIOS fixes, but of course not by the
>> stupid-solution "disable hwiommu via the BIOS".
>> Perhaps the reason is a Linux kernel bug (although this is highly unlikely).
>> Last but not least,.. perhaps this is AMD Opteron/Athlon (Note: These
>> CPUs have the memory controllers directly integrated) issue and/or
>> Nvidia nforce chipset issue.
>>
>> Regards,
>> Chris.
>> *
>> btw: For answers from Nvidia engineers/developers or end-users who
>> suffer from that issue too,... please post it to the lkml thread (see
>> above for the link) and if not possible here.
>> You may even contact me via email ([email protected]) or personal
>> messages.*
>>
>> PS: Please post any other resources/links to threads about this or
>> similar problems.
>>
>
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may contain
> confidential information. Any unauthorized review, use, disclosure or distribution
> is prohibited. If you are not the intended recipient, please contact the sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------
>
>
Christoph Anton Mitterer wrote:
> Hi.
>
> Perhaps some of you have read my older two threads:
> http://marc.theaimsgroup.com/?t=116312440000001&r=1&w=2 and the even
> older http://marc.theaimsgroup.com/?t=116291314500001&r=1&w=2
>
> The issue was basically the following:
> I found a severe bug mainly by fortune because it occurs very rarely.
> My test looks like the following: I have about 30GB of testing data on
> my harddisk,... I repeat verifying sha512 sums on these files and check
> if errors occur.
> One test pass verifies the 30GB 50 times,... about one to four
> differences are found in each pass.
>
> The corrupted data is not one single completely wrong block of data or
> so,.. but if you look at the area of the file where differences are
> found,.. than some bytes are ok,.. some are wrong,.. and so on (seems to
> be randomly).
>
> Also, there seems to be no event that triggers the corruption,.. it
> seems to be randomly, too.
>
> It is really definitely not a harware issue (see my old threads my
> emails to Tyan/Hitachi and my "workaround" below. My system isn't
> overclocked.
>
>
>
> My System:
> Mainboard: Tyan S2895
> Chipsets: Nvidia nforce professional 2200 and 2050 and AMD 8131
> CPU: 2x DualCore Opterons model 275
> RAM: 4GB Kingston Registered/ECC
> Diskdrives: IBM/Hitachi: 1 PATA, 2 SATA
>
>
> The data corruption error occurs on all drives.
>
>
> You might have a look at the emails between me and Tyan and Hitachi,..
> they contain probalby lots of valuable information (especially my
> different tests).
>
>
>
> Some days ago,.. an engineer of Tyan suggested me to boot the kernel
> with mem=3072M.
> When doing this,.. the issue did not occur (I don't want to say it was
> solved. Why? See my last emails to Tyan!)
> Then he suggested me to disable the memory hole mapping in the BIOS,...
> When doing so,.. the error doesn't occur, too.
> But I loose about 2GB RAM,.. and,.. more important,.. I cant believe
> that this is responsible for the whole issue. I don't consider it a
> solution but more a poor workaround which perhaps only by fortune solves
> the issue (Why? See my last eMails to Tyan ;) )
>
>
>
> So I'd like to ask you if you perhaps could read the current information
> in this and previous mails,.. and tell me your opinions.
> It is very likely that a large number of users suffer from this error
> (namely all Nvidia chipset users) but only few (there are some,.. I
> found most of them in the Nvidia forums,.. and they have exactly the
> same issue) identify this as an error because it's so rare.
>
> Perhaps someone have an idea why disabling the memhole mapping solves
> it. I've always thought that memhole mapping just moves some address
> space to higher addreses to avoid the conflict between address space for
> PCI devices and address space for pyhsical memory.
> But this should be just a simple addition and not solve this obviously
> complex error.
If this is related to some problem with using the GART IOMMU with memory
hole remapping enabled, then 2.6.20-rc kernels may avoid this problem on
nForce4 CK804/MCP04 chipsets as far as transfers to/from the SATA
controller are concerned as the sata_nv driver now supports 64-bit DMA
on these chipsets and so no longer requires the IOMMU.
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/
Hi.
Just for you information: I've put the issue into the kernel.org bugzilla.
http://bugzilla.kernel.org/show_bug.cgi?id=7768
Chris.
Hi.
Some days ago I received the following message from "Sunny Days". I
think he did not send it lkml so I forward it now:
Sunny Days wrote:
> hello,
>
> i have done some extensive testing on this.
>
> various opterons, always single socket
> various dimms 1 and 2gb modules
> and hitachi+seagate disks with various firmwares and sizes
> but i am getting a diferent pattern in the corruption.
> My test file was 10gb.
>
> I have mapped the earliest corruption as low as 10mb in the written data.
> i have also monitor the adress range used from the cp /md5sum proccess
> under /proc//$PID/maps to see if i could find a pattern but i was
> unable to.
>
> i also tested ext2 and lvm with similar results aka corruption.
> later on the week i should get a pci promise controller and test on that one.
>
> Things i have not tested is the patch that linus released 10 days ago
> and reiserfs3/4
>
> my nvidia chipset was ck804 (a3)
>
> Hope somehow we get to the bottom of this.
>
> Hope this helps
>
>
> btw amd erratas that could possible influence this are
>
> 115, 123, 156 with the latter been fascinating as it the workaround
> suggested is 0x0 page entry.
>
>
Does anyone has any opinions about this? Could you please read the
mentioned erratas and tell me what you think?
Best wishes,
Chris.
@ Sunny Days: Thanks for you mail.
Hi everybody.
Sorry again for my late reply...
Robert gave us the following interesting information some days ago:
Robert Hancock wrote:
> If this is related to some problem with using the GART IOMMU with memory
> hole remapping enabled, then 2.6.20-rc kernels may avoid this problem on
> nForce4 CK804/MCP04 chipsets as far as transfers to/from the SATA
> controller are concerned as the sata_nv driver now supports 64-bit DMA
> on these chipsets and so no longer requires the IOMMU.
>
I've just tested it with my "normal" BIOS settings, that is memhole
mapping = hardware, IOMMU = enabled and 64MB and _without_ (!)
iommu=soft as kernel parameters.
I only had the time for a small test (that is 3 passes with each 10
complete sha512sums cyles over about 30GB data)... but sofar, no
corruption occured.
It is surely far to eraly to tell that our issue was solved by
2.6.20-rc-something.... but I ask all of you that had systems that
suffered from the corruption to make _intensive_ tests with the most
recent rc of 2.6.20 (I've used 2.6.20-rc5) and report your results.
I'll do a extensive test tomorrow.
And of course (!!): Test without using iommu=soft and with enabled
memhole mapping (in the BIOS). (It won't make any sense to look if the
new kernel solves our problem while still applying one of our two
workarounds).
Please also note that there might be two completely data corruption
problems. The onle "solved" by iommu=soft and another reported by Kurtis
D. Rader.
I've asked him to clarify this in a post. :-)
Ok,... now if this (the new kernel) would really solve the issue... we
should try to find out what exactly was changed in the code, and if it
sounds logical that this solved the problem or not.
The new kernel could just make the corruption even more rare.
Best wishes,
Chris.
Sorry, as always I've forgot some things... *g*
Robert Hancock wrote:
> If this is related to some problem with using the GART IOMMU with memory
> hole remapping enabled
What is that GART thing exactly? Is this the hardware IOMMU? I've always
thought GART was something graphics card related,.. but if so,.. how
could this solve our problem (that seems to occur mainly on harddisks)?
> then 2.6.20-rc kernels may avoid this problem on
> nForce4 CK804/MCP04 chipsets as far as transfers to/from the SATA
> controller are concerned
Does this mean that PATA is no related? The corruption appears on PATA
disks to, so why should it only solve the issue at SATA disks? Sounds a
bit strange to me?
> as the sata_nv driver now supports 64-bit DMA
> on these chipsets and so no longer requires the IOMMU.
>
Can you explain this a little bit more please? Is this a drawback (like
a performance decrease)? Like under Windows where they never use the
hardware iommu but always do it via software?
Best wishes,
Chris.
Christoph Anton Mitterer wrote:
> Sorry, as always I've forgot some things... *g*
>
>
> Robert Hancock wrote:
>
>> If this is related to some problem with using the GART IOMMU with memory
>> hole remapping enabled
> What is that GART thing exactly? Is this the hardware IOMMU? I've always
> thought GART was something graphics card related,.. but if so,.. how
> could this solve our problem (that seems to occur mainly on harddisks)?
The GART built into the Athlon 64/Opteron CPUs is normally used for
remapping graphics memory so that an AGP graphics card can see
physically non-contiguous memory as one contiguous region. However,
Linux can also use it as an IOMMU which allows devices which normally
can't access memory above 4GB to see a mapping of that memory that
resides below 4GB. In pre-2.6.20 kernels both the SATA and PATA
controllers on the nForce 4 chipsets can only access memory below 4GB so
transfers to memory above this mark have to go through the IOMMU. In
2.6.20 this limitation is lifted on the nForce4 SATA controllers.
>
>> then 2.6.20-rc kernels may avoid this problem on
>> nForce4 CK804/MCP04 chipsets as far as transfers to/from the SATA
>> controller are concerned
> Does this mean that PATA is no related? The corruption appears on PATA
> disks to, so why should it only solve the issue at SATA disks? Sounds a
> bit strange to me?
The PATA controller will still be using 32-bit DMA and so may also use
the IOMMU, so this problem would not be avoided.
>
>> as the sata_nv driver now supports 64-bit DMA
>> on these chipsets and so no longer requires the IOMMU.
>>
> Can you explain this a little bit more please? Is this a drawback (like
> a performance decrease)? Like under Windows where they never use the
> hardware iommu but always do it via software?
No, it shouldn't cause any performance loss. In previous kernels the
nForce4 SATA controller was controlled using an interface quite similar
to a PATA controller. In 2.6.20 kernels they use a more efficient
interface that NVidia calls ADMA, which in addition to supporting NCQ
also supports DMA without any 4GB limitations, so it can access all
memory directly without requiring IOMMU assistance.
Note that if this corruption problem is, as has been suggested, related
to memory hole remapping and the IOMMU, then this change only prevents
the SATA controller transfers from experiencing this problem. Transfers
on the PATA controller as well as any other devices with 32-bit DMA
limitations might still have problems. As such this really just avoids
the problem, not fixes it.
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/
Robert Hancock wrote:
>> What is that GART thing exactly? Is this the hardware IOMMU? I've always
>> thought GART was something graphics card related,.. but if so,.. how
>> could this solve our problem (that seems to occur mainly on harddisks)?
>>
> The GART built into the Athlon 64/Opteron CPUs is normally used for
> remapping graphics memory so that an AGP graphics card can see
> physically non-contiguous memory as one contiguous region. However,
> Linux can also use it as an IOMMU which allows devices which normally
> can't access memory above 4GB to see a mapping of that memory that
> resides below 4GB. In pre-2.6.20 kernels both the SATA and PATA
> controllers on the nForce 4 chipsets can only access memory below 4GB so
> transfers to memory above this mark have to go through the IOMMU. In
> 2.6.20 this limitation is lifted on the nForce4 SATA controllers.
>
Ah, I see. Thanks for that introduction :-)
>> Does this mean that PATA is no related? The corruption appears on PATA
>> disks to, so why should it only solve the issue at SATA disks? Sounds a
>> bit strange to me?
>>
> The PATA controller will still be using 32-bit DMA and so may also use
> the IOMMU, so this problem would not be avoided.
>
>
>> Can you explain this a little bit more please? Is this a drawback (like
>> a performance decrease)? Like under Windows where they never use the
>> hardware iommu but always do it via software?
>>
>
> No, it shouldn't cause any performance loss. In previous kernels the
> nForce4 SATA controller was controlled using an interface quite similar
> to a PATA controller. In 2.6.20 kernels they use a more efficient
> interface that NVidia calls ADMA, which in addition to supporting NCQ
> also supports DMA without any 4GB limitations, so it can access all
> memory directly without requiring IOMMU assistance.
>
> Note that if this corruption problem is, as has been suggested, related
> to memory hole remapping and the IOMMU, then this change only prevents
> the SATA controller transfers from experiencing this problem. Transfers
> on the PATA controller as well as any other devices with 32-bit DMA
> limitations might still have problems. As such this really just avoids
> the problem, not fixes it.
>
Ok,.. that sounds reasonable,.. so the whole thing might (!) actually be
a hardware design error,... but we just don't use that hardware any
longer when accessing devices via sata_nv.
So this doesn't solve our problem with PATA drives or other devices
(although we had until now no reports of errors with other devices) and
we have to stick with iommu=soft.
If one use iommu=soft the sata_nv will continue to use the new code for
the ADMA, right?
Best wishes,
Chris.
Christoph Anton Mitterer wrote:
> Ok,.. that sounds reasonable,.. so the whole thing might (!) actually be
> a hardware design error,... but we just don't use that hardware any
> longer when accessing devices via sata_nv.
>
> So this doesn't solve our problem with PATA drives or other devices
> (although we had until now no reports of errors with other devices) and
> we have to stick with iommu=soft.
>
> If one use iommu=soft the sata_nv will continue to use the new code for
> the ADMA, right?
Right, that shouldn't affect it.
On Tue, Jan 16, 2007 at 08:26:05AM -0600, Robert Hancock wrote:
> >If one use iommu=soft the sata_nv will continue to use the new code
> >for the ADMA, right?
>
> Right, that shouldn't affect it.
right now i'm thinking if we can't figure out which cpu/bios
combinations are safe we might almost be better off doing iommu=soft
for *all* k8 stuff except for those that are whitelisted; though this
seems extremely drastic
it's not clear if this only affect nvidia based chipsets, the nature
of the corruption makes me think it's not an iommu software bug (we
see a few bytes not entire pages corrupted, it's not even clear if
it's entire cachelines trashed) --- perhaps other vendors have more
recent bios errata or maybe it's just that nvidia has sold a lot of
these so they are more visible? (i'm assuming at this point it might
be some kind of cpu errata that some bioses deal with because some
mainboards don't ever seem to see this whilst others do)
in some ways the problem is worse with recent kernels --- because the
ethernet and sata can address over 4GB and don't use the iommu anymore
the problem is going to be *much* harder to hit, but still here
lurking to cause problems for people. with ethernet you'll probably
end up getting the odd trashed tcp frame and dropping it, so those
will go mostly unnoticed, so this is why sata seems to be the easier
way to show it
Chris Wedgwood wrote:
> right now i'm thinking if we can't figure out which cpu/bios
> combinations are safe we might almost be better off doing iommu=soft
> for *all* k8 stuff except for those that are whitelisted; though this
> seems extremely drastic
>
I agree,... it seems drastic, but this is the only really secure solution.
But it seems that none of the responsible developers read our thread or
the bugreport and gave his opinion about the issue.
> it's not clear if this only affect nvidia based chipsets, the nature
> of the corruption makes me think it's not an iommu software bug (we
> see a few bytes not entire pages corrupted, it's not even clear if
> it's entire cachelines trashed) --- perhaps other vendors have more
> recent bios errata or maybe it's just that nvidia has sold a lot of
> these so they are more visible? (i'm assuming at this point it might
> be some kind of cpu errata that some bioses deal with because some
> mainboards don't ever seem to see this whilst others do)
>
Well we can hope that Nvidia will find out more (though I'm not too
optimistic).
> in some ways the problem is worse with recent kernels --- because the
> ethernet and sata can address over 4GB and don't use the iommu anymore
> the problem is going to be *much* harder to hit, but still here
> lurking to cause problems for people.
Yes I agree,.. this is a dangerous situation...
But we should not forget about the issue, just because SATA is not
longer affected.
Chris.
On Tuesday 16 January 2007 19:01, Chris Wedgwood wrote:
> On Tue, Jan 16, 2007 at 08:26:05AM -0600, Robert Hancock wrote:
> > >If one use iommu=soft the sata_nv will continue to use the new code
> > >for the ADMA, right?
> >
> > Right, that shouldn't affect it.
>
> right now i'm thinking if we can't figure out which cpu/bios
> combinations are safe we might almost be better off doing iommu=soft
> for *all* k8 stuff except for those that are whitelisted; though this
> seems extremely drastic
>
> it's not clear if this only affect nvidia based chipsets, the nature
> of the corruption makes me think it's not an iommu software bug (we
> see a few bytes not entire pages corrupted, it's not even clear if
> it's entire cachelines trashed) --- perhaps other vendors have more
> recent bios errata or maybe it's just that nvidia has sold a lot of
> these so they are more visible? (i'm assuming at this point it might
> be some kind of cpu errata that some bioses deal with because some
> mainboards don't ever seem to see this whilst others do)
FYI it seems that I was also hit by this bug with qlogic fc card + adaptec
taro raid controller on Thunder K8SRE S2891 mainboard with nvidia chipset on
it.
http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/b8bdbde9721f7d35/45701994c95fe2cf?lnk=st&q=arkadiusz+fibre&rnum=8#45701994c95fe2cf
--
Arkadiusz Mi?kiewicz PLD/Linux Team
arekm / maven.pl http://ftp.pld-linux.org/
Arkadiusz Miskiewicz wrote:
> FYI it seems that I was also hit by this bug with qlogic fc card + adaptec
> taro raid controller on Thunder K8SRE S2891 mainboard with nvidia chipset on
> it.
>
> http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/b8bdbde9721f7d35/45701994c95fe2cf?lnk=st&q=arkadiusz+fibre&rnum=8#45701994c95fe2cf
>
I'm aware of your old thread and at least I considered your postings
from it :-)
Anyway, thanks for your information. =)
Chris.
Chris Wedgwood <[email protected]> writes:
> right now i'm thinking if we can't figure out which cpu/bios
> combinations are safe we might almost be better off doing iommu=soft
> for *all* k8 stuff except for those that are whitelisted; though this
> seems extremely drastic
Do you (someone) have (maintain) a list of affected systems,
including motherboard type and possibly version, BIOS version and
CPU type? A similar list of unaffected systems with 4GB+ RAM could
be useful, too.
I'm afraid with default iommu=soft it will be a mystery forever.
--
Krzysztof Halasa
On Tue, Jan 16, 2007 at 08:52:32PM +0100, Christoph Anton Mitterer wrote:
> I agree,... it seems drastic, but this is the only really secure
> solution.
I'd like to here from Andi how he feels about this? It seems like a
somewhat drastic solution in some ways given a lot of hardware doesn't
seem to be affected (or maybe in those cases it's just really hard to
hit, I don't know).
> Well we can hope that Nvidia will find out more (though I'm not too
> optimistic).
Ideally someone from AMD needs to look into this, if some mainboards
really never see this problem, then why is that? Is there errata that
some BIOS/mainboard vendors are dealing with that others are not?
> But we should not forget about the issue, just because SATA is not
> longer affected.
Right.
On Tue, Jan 16, 2007 at 09:31:31PM +0100, Krzysztof Halasa wrote:
> Do you (someone) have (maintain) a list of affected systems,
> including motherboard type and possibly version, BIOS version and
> CPU type? A similar list of unaffected systems with 4GB+ RAM could
> be useful, too.
All I know is that some system hit this and some don't seem to. Why
it's not clear.
> I'm afraid with default iommu=soft it will be a mystery forever.
Right, but given windows doesn't use the iommu at all and that a lot
of newer hardware/drivers doesn't need it it might be the safest
option since it clearly has been causing corruption for a number of
people for well over a year now.
On Wednesday 17 January 2007 07:31, Chris Wedgwood wrote:
> On Tue, Jan 16, 2007 at 08:52:32PM +0100, Christoph Anton Mitterer wrote:
> > I agree,... it seems drastic, but this is the only really secure
> > solution.
>
> I'd like to here from Andi how he feels about this? It seems like a
> somewhat drastic solution in some ways given a lot of hardware doesn't
> seem to be affected (or maybe in those cases it's just really hard to
> hit, I don't know).
AMD is looking at the issue. Only Nvidia chipsets seem to be affected,
although there were similar problems on VIA in the past too.
Unless a good workaround comes around soon I'll probably default
to iommu=soft on Nvidia.
-Andi
> I'd like to here from Andi how he feels about this? It seems like a
> somewhat drastic solution in some ways given a lot of hardware doesn't
> seem to be affected (or maybe in those cases it's just really hard to
> hit, I don't know).
>
> > Well we can hope that Nvidia will find out more (though I'm not too
> > optimistic).
>
> Ideally someone from AMD needs to look into this, if some mainboards
> really never see this problem, then why is that? Is there errata that
> some BIOS/mainboard vendors are dealing with that others are not?
NVIDIA and AMD are ivestigating this issue, we don't know what the
problem is yet.
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
Chris Wedgwood wrote:
> I'd like to here from Andi how he feels about this? It seems like a
> somewhat drastic solution in some ways given a lot of hardware doesn't
> seem to be affected (or maybe in those cases it's just really hard to
> hit, I don't know).
>
Yes this might be true,.. those who have reported working systems might
just have a configuration where the error happens even rarer or where
some other event(s) work around it.
>> Well we can hope that Nvidia will find out more (though I'm not too
>> optimistic).
>>
> Ideally someone from AMD needs to look into this, if some mainboards
> really never see this problem, then why is that? Is there errata that
> some BIOS/mainboard vendors are dealing with that others are not?
>
Some time ago I've asked here in a post if some of you could try to
contact AMD and/or Nvidia,.. as no one did,... I wrote them again (to
all forums and email addresses I knew). (You can see the text here
http://www.nvnews.net/vbulletin/showthread.php?t=82909).
Now Nvidia replied and it seems (thanks to Mr. Friedman) that they're
actually try to investigate in the issue...
I received on reply from AMD (actually in German which is strange as I
wrote to their US support)... where they told me they'd have forwarded
my mail to their Linux engineers... but no reply since then.
Perhaps some of you have some "contacts" and can use them...
Andi Kleen wrote:
> AMD is looking at the issue. Only Nvidia chipsets seem to be affected,
> although there were similar problems on VIA in the past too.
> Unless a good workaround comes around soon I'll probably default
> to iommu=soft on Nvidia.
I've just read the posts about AMDs and NVIDIAs effort to find the
issue,... but in the meantime this would be the best solution.
And if "we"'ll ever find a rue solution,.. we could still deactivate the
iommu=soft setting.
Best wishes,
Chris.
On Wed, 17 Jan 2007, Andi Kleen wrote:
> On Wednesday 17 January 2007 07:31, Chris Wedgwood wrote:
>> On Tue, Jan 16, 2007 at 08:52:32PM +0100, Christoph Anton Mitterer wrote:
>>> I agree,... it seems drastic, but this is the only really secure
>>> solution.
>>
>> I'd like to here from Andi how he feels about this? It seems like a
>> somewhat drastic solution in some ways given a lot of hardware doesn't
>> seem to be affected (or maybe in those cases it's just really hard to
>> hit, I don't know).
>
> AMD is looking at the issue. Only Nvidia chipsets seem to be affected,
> although there were similar problems on VIA in the past too.
> Unless a good workaround comes around soon I'll probably default
> to iommu=soft on Nvidia.
>
> -Andi
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
We've just verified that configuring the graphics aperture to be
write-combining instead of write-back using an MTRR also solves the
problem. It appears to be a cache incoherency issue in the graphics
aperture.
This script does the trick:
[ -- cut here -- ]
#!/bin/bash
# Read the northbridge offset 0x90 to get the size of the aperture
size=0x`lspci -xxx -s 0:18.3 | awk '/^90:/ { print $2 }'`
# bit 0 indicates the aperture is enabled, bits 1 - 3 indicate the size
if [ $((size & 1)) -eq 0 ] ; then
echo "GART disabled; exiting"
exit 0
fi
shft=$(((size >> 1) & 7))
size=$((0x2000000 << shft))
# Read the northbridge offset 0x94 to get the base address of the aperture
base=0x`lspci -xxx -s 0:18.3 | awk '/^90:/ { print $6 }'`
base=$((base << 25))
basehex=`printf 0x%08x $base`
printf "IOMMU aperture found at base=0x%08x size=0x%08x (%d KiB)\n" $base $size $((size/1024))
if grep -q $basehex /proc/mtrr ; then
echo "MTRR already configured for IOMMU aperture; exiting"
exit 0
fi
echo "Configuring write-combining MTRR for IOMMU aperture"
printf "base=0x%08x size=0x%08x type=write-combining\n" $base $size >/proc/mtrr
exit 0
[ -- cut here-- ]
Chip
--
Charles M. "Chip" Coldwell
Senior Software Engineer
Red Hat, Inc
978-392-2426
On Wed, 17 Jan 2007, Chip Coldwell wrote:
> On Wed, 17 Jan 2007, Andi Kleen wrote:
>
>> On Wednesday 17 January 2007 07:31, Chris Wedgwood wrote:
>>> On Tue, Jan 16, 2007 at 08:52:32PM +0100, Christoph Anton Mitterer wrote:
>>>> I agree,... it seems drastic, but this is the only really secure
>>>> solution.
>>>
>>> I'd like to here from Andi how he feels about this? It seems like a
>>> somewhat drastic solution in some ways given a lot of hardware doesn't
>>> seem to be affected (or maybe in those cases it's just really hard to
>>> hit, I don't know).
>>
>> AMD is looking at the issue. Only Nvidia chipsets seem to be affected,
>> although there were similar problems on VIA in the past too.
>> Unless a good workaround comes around soon I'll probably default
>> to iommu=soft on Nvidia.
>>
>
> We've just verified that configuring the graphics aperture to be
> write-combining instead of write-back using an MTRR also solves the
> problem. It appears to be a cache incoherency issue in the graphics
> aperture.
I take it back. Further testing has revealed that this does not solve
the problem.
Chip
--
Charles M. "Chip" Coldwell
Senior Software Engineer
Red Hat, Inc
978-392-2426
> We've just verified that configuring the graphics aperture to be
> write-combining instead of write-back using an MTRR also solves the
> problem. It appears to be a cache incoherency issue in the graphics
> aperture.
Interesting.
Unfortunately it is also not correct. It was intentional to
mark the IOMMU half. of the aperture write-back, as opposed
to uncached as the AGP half. Otherwise you get illegal cache attribute
conflicts with the memory that is being remapped which can also cause
corruption.
The Northbridge guarantees coherency over the aperture, but
only if the caching attributes match.
You would need to change_page_attr() every kernel address that is mapped into
the IOMMU to use an uncached aperture. AGP does this, but the frequency of
mapping for the IOMMU is much higher and it would be prohibitively costly
unfortunately.
In the past we saw corruptions from such conflicts, so this is more
than just theory. I suspect you traded a more easy to trigger corruption with
a more subtle one.
-Andi
Andi Kleen <[email protected]> wrote on 22:29 16/01/2007 +0100 :
> AMD is looking at the issue. Only Nvidia chipsets seem to be affected,
> although there were similar problems on VIA in the past too.
> Unless a good workaround comes around soon I'll probably default
> to iommu=soft on Nvidia.
>
> -Andi
Not only has it only been on Nvidia chipsets but we have only seen
reports on the Nvidia CK804 SATA controller. Please write in or add
yourself to the bugzilla entry [1] and tell us which hardware you have
if you get 4kB pagesize corruption and it goes away with "iommu=soft".
thanks
-joachim
[1] http://bugzilla.kernel.org/show_bug.cgi?id=7768
On Wed Jan 17, 2007 at 08:29:53AM +1100, Andi Kleen wrote:
> AMD is looking at the issue. Only Nvidia chipsets seem to be affected,
> although there were similar problems on VIA in the past too.
> Unless a good workaround comes around soon I'll probably default
> to iommu=soft on Nvidia.
I just tried again and while using iommu=soft does avoid the
corruption problem, as with previous kernels with 2.6.20-rc5
using iommu=soft still makes my pcHDTV HD5500 DVB cards not work.
I still have to disable memhole and lose 1 GB. :-(
-Erik
--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--
joachim wrote:
> Not only has it only been on Nvidia chipsets but we have only seen
> reports on the Nvidia CK804 SATA controller. Please write in or add
> yourself to the bugzilla entry [1] and tell us which hardware you have
> if you get 4kB pagesize corruption and it goes away with "iommu=soft".
How do I find out if I get a 4kB pagesize corruption (or is this the
same as "our corruption"?
Chris.
btw: Should we only post the controller, or other hardware details, too?
Erik Andersen wrote:
> I just tried again and while using iommu=soft does avoid the
> corruption problem, as with previous kernels with 2.6.20-rc5
> using iommu=soft still makes my pcHDTV HD5500 DVB cards not work.
> I still have to disable memhole and lose 1 GB. :-(
Please add this to the bugreport
(http://bugzilla.kernel.org/show_bug.cgi?id=7768)
Chris.
On Thu, Jan 18, 2007 at 04:00:28AM -0700, Erik Andersen wrote:
> I just tried again and while using iommu=soft does avoid the
> corruption problem, as with previous kernels with 2.6.20-rc5 using
> iommu=soft still makes my pcHDTV HD5500 DVB cards not work.
i would file a separate bug about that, presumably it won't work in
intel based machines too if the driver has dma api bugs
On Thu, Jan 18, 2007 at 10:29:14AM +0100, joachim wrote:
> Not only has it only been on Nvidia chipsets but we have only seen
> reports on the Nvidia CK804 SATA controller.
People have reported problems with other controllers. I have one here
I can test given a day or so.
I don't think it's SATA related, it just happens that it shows up well
there, for networking you would end up with the odd corrupted packet
probably and end up just dropping those so it might not be noticeable.
On Thu, 18 Jan 2007, Andi Kleen wrote:
>
> The Northbridge guarantees coherency over the aperture, but
> only if the caching attributes match.
That's interesting. Makes sense, I suppose.
> You would need to change_page_attr() every kernel address that is mapped into
> the IOMMU to use an uncached aperture. AGP does this, but the frequency of
> mapping for the IOMMU is much higher and it would be prohibitively costly
> unfortunately.
But it still might be a reasonable thing to do to test the theory that
the problem is cache coherency across the graphics aperture, even if
it isn't a long-term solution for the problem.
> In the past we saw corruptions from such conflicts, so this is more
> than just theory. I suspect you traded a more easy to trigger
> corruption with a more subtle one.
Yup. That was the inspiration for the script.
Chip
--
Charles M. "Chip" Coldwell
Senior Software Engineer
Red Hat, Inc
978-392-2426
On Friday 19 January 2007 08:57, Chip Coldwell wrote:
> But it still might be a reasonable thing to do to test the theory that
> the problem is cache coherency across the graphics aperture, even if
> it isn't a long-term solution for the problem.
I suspect it would disturb timing so badly that it might hide the original
problem. If that is true then adding udelays might hide it too.
Ok i guess you could test with a UP kernel. There change_page_attr
should be much cheaper because it doesn't need to IPI to other CPUs. Also use
a .2.6.20-rc* kernel that uses CLFLUSH in there, not WBINVD which is also
very costly.
Anyways I guess we can just wait what the hardware people figure out.
-Andi
On Thursday 18 January 2007 22:00, Erik Andersen wrote:
> I just tried again and while using iommu=soft does avoid the
> corruption problem, as with previous kernels with 2.6.20-rc5
> using iommu=soft still makes my pcHDTV HD5500 DVB cards not work.
This must be some separate bug and needs to be fixed anyways.
-Andi
On Wed, 17 Jan 2007, Andi Kleen wrote:
> On Wednesday 17 January 2007 07:31, Chris Wedgwood wrote:
> > On Tue, Jan 16, 2007 at 08:52:32PM +0100, Christoph Anton Mitterer wrote:
> > > I agree,... it seems drastic, but this is the only really secure
> > > solution.
> >
> > I'd like to here from Andi how he feels about this? It seems like a
> > somewhat drastic solution in some ways given a lot of hardware doesn't
> > seem to be affected (or maybe in those cases it's just really hard to
> > hit, I don't know).
>
> AMD is looking at the issue. Only Nvidia chipsets seem to be affected,
> although there were similar problems on VIA in the past too.
> Unless a good workaround comes around soon I'll probably default
> to iommu=soft on Nvidia.
We (Sun, AMD, Nvidia and Red Hat) have been testing a patch that seems
to solve the problem. AMD and Nvidia analyzed an HDT trace that
seemed to indicate that CPU updates of the GATT were still in cache
when a subsequent table walk caused by a device load used a stale GATT
PTE. That analysis inspired this patch, submitted to this list as an
RFC. It is not obvious (to me, at least) why this problem has only
shown up on Nvidia SATA controllers.
We are continuing to investigate.
diff --git a/arch/x86_64/kernel/pci-gart.c b/arch/x86_64/kernel/pci-gart.c
index 030eb37..1dd461a 100644
--- a/arch/x86_64/kernel/pci-gart.c
+++ b/arch/x86_64/kernel/pci-gart.c
@@ -69,6 +69,8 @@ static u32 gart_unmapped_entry;
#define AGPEXTERN
#endif
+#define GATT_CLFLUSH(i) asm volatile ("clflush (%0)" :: "r" (iommu_gatt_base + (i)))
+
/* backdoor interface to AGP driver */
AGPEXTERN int agp_memory_reserved;
AGPEXTERN __u32 *agp_gatt_table;
@@ -221,6 +223,7 @@ static dma_addr_t dma_map_area(struct device *dev, dma_addr_t phys_mem,
for (i = 0; i < npages; i++) {
iommu_gatt_base[iommu_page + i] = GPTE_ENCODE(phys_mem);
SET_LEAK(iommu_page + i);
+ GATT_CLFLUSH(iommu_page + i);
phys_mem += PAGE_SIZE;
}
return iommu_bus_base + iommu_page*PAGE_SIZE + (phys_mem & ~PAGE_MASK);
@@ -348,6 +351,7 @@ static int __dma_map_cont(struct scatterlist *sg, int start, int stopat,
while (pages--) {
iommu_gatt_base[iommu_page] = GPTE_ENCODE(addr);
SET_LEAK(iommu_page);
+ GATT_CLFLUSH(iommu_page);
addr += PAGE_SIZE;
iommu_page++;
}
Chip
--
Charles M. "Chip" Coldwell
Senior Software Engineer
Red Hat, Inc
978-392-2426