2007-12-18 17:28:52

by Matthew Bloch

[permalink] [raw]
Subject: Testing RAM from userspace / question about memmap= arguments

Hi - I'm trying to come up with a way of thoroughly testing every byte
of RAM from within Linux on amd64 (so that it can be automated better
than using memtest86+), and came up with an idea which I'm not sure is
supported or practical.

The obvious problem with testing memory from user space is that you
can't mlock all of it, so the best you can do is about three quarters,
and hope that the rest of the memory is okay.

In order to test all of the memory, I'd like to run the user-space
memtester over two boots of the kernel.

Say we have a 1024MB machine, the first boot I'd not specify any
arguments and assume the kernel would start at the bottom of physical
memory and work its way up, so that the kernel & working userspace would
live at the bottom, and the rest would be testable from space.

On the second boot, could I then specify:

memmap=exact memmap=512M@512M memmap=512M@0

i.e. such that the kernel's idea of the usable memory started in the
middle of physical RAM, and that's where it would locate itself? That
way, on the second boot, the same test in userspace would definitely
grab the previously inaccessible RAM at the start for testing.

I can see a few potential problems, but since my understanding of the
low-level memory mapping is muddy at best, I won't speculate; I'd just
appreciate any more expert views on whether this does work, or could be
made to work.

Thanks,

--
Matthew


2007-12-20 12:34:27

by Jon Masters

[permalink] [raw]
Subject: Re: Testing RAM from userspace / question about memmap= arguments

On Tue, 2007-12-18 at 17:06 +0000, Matthew Bloch wrote:

> I can see a few potential problems, but since my understanding of the
> low-level memory mapping is muddy at best, I won't speculate; I'd just
> appreciate any more expert views on whether this does work, or could be
> made to work.

Yo,

I don't think your testing approach is thorough enough. Clearly (knowing
your line of business - as a virtual machine provider), you want to do
pre-production testing as part of your provisioning. I would suggest
instead of using mlock() from userspace of simply writing a kernel
module that does this for every page of available memory.

You could script it via a minimal userland, containing only busybox,
some form of SSH implementation, whatever.

Jon.

P.S. With the above, you could also know which pages were faulty, an
consequently play with some of the bad RAM patches to exclude faulty
pages from the virtual machines running on a given host... ;-)

2007-12-20 14:38:16

by Matthew Bloch

[permalink] [raw]
Subject: Re: Testing RAM from userspace / question about memmap= arguments

Jon Masters wrote:
> On Tue, 2007-12-18 at 17:06 +0000, Matthew Bloch wrote:
>
>> I can see a few potential problems, but since my understanding of the
>> low-level memory mapping is muddy at best, I won't speculate; I'd just
>> appreciate any more expert views on whether this does work, or could be
>> made to work.
>
> Yo,
>
> I don't think your testing approach is thorough enough. Clearly (knowing
> your line of business - as a virtual machine provider), you want to do
> pre-production testing as part of your provisioning. I would suggest
> instead of using mlock() from userspace of simply writing a kernel
> module that does this for every page of available memory.

Yes this is to improve the efficiency of server burn-ins. I would
consider a kernel module, but I still wouldn't be able to test the
memory in which the kernel is sitting, which is my problem. I'm not
sure even a kernel module could reliably test the memory in which it is
residing (memtest86+ relocates itself to do this). Also I don't see how
userspace testing is any less thorough than doing it in the kernel; I
just need a creative way of accessing every single page of memory.

I may do some experiments with the memmap args, some bad RAM and
shuffling it between DIMM sockets when I have the time :)

--
Matthew

2007-12-21 12:58:29

by Pavel Machek

[permalink] [raw]
Subject: Re: Testing RAM from userspace / question about memmap= arguments

On Tue 2007-12-18 17:06:24, Matthew Bloch wrote:
> Hi - I'm trying to come up with a way of thoroughly testing every byte
> of RAM from within Linux on amd64 (so that it can be automated better
> than using memtest86+), and came up with an idea which I'm not sure is
> supported or practical.
>
> The obvious problem with testing memory from user space is that you
> can't mlock all of it, so the best you can do is about three quarters,
> and hope that the rest of the memory is okay.
>
> In order to test all of the memory, I'd like to run the user-space
> memtester over two boots of the kernel.
>
> Say we have a 1024MB machine, the first boot I'd not specify any
> arguments and assume the kernel would start at the bottom of physical
> memory and work its way up, so that the kernel & working userspace would
> live at the bottom, and the rest would be testable from space.
>
> On the second boot, could I then specify:
>
> memmap=exact memmap=512M@512M memmap=512M@0

Actually, with kexec, you can probably doing without reboot.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-12-22 08:13:11

by Richard D

[permalink] [raw]
Subject: RE: Testing RAM from userspace / question about memmap= arguments

Cant you, modify bootmem allocator to test with memtest patterns and then
use kexec (as Pavel suggested) to test the one where kernel was sitting
earlier?

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Pavel Machek
Sent: Friday, December 21, 2007 6:28 PM
To: Matthew Bloch
Cc: [email protected]
Subject: Re: Testing RAM from userspace / question about memmap= arguments

On Tue 2007-12-18 17:06:24, Matthew Bloch wrote:
> Hi - I'm trying to come up with a way of thoroughly testing every byte
> of RAM from within Linux on amd64 (so that it can be automated better
> than using memtest86+), and came up with an idea which I'm not sure is
> supported or practical.
>
> The obvious problem with testing memory from user space is that you
> can't mlock all of it, so the best you can do is about three quarters,
> and hope that the rest of the memory is okay.
>
> In order to test all of the memory, I'd like to run the user-space
> memtester over two boots of the kernel.
>
> Say we have a 1024MB machine, the first boot I'd not specify any
> arguments and assume the kernel would start at the bottom of physical
> memory and work its way up, so that the kernel & working userspace would
> live at the bottom, and the rest would be testable from space.
>
> On the second boot, could I then specify:
>
> memmap=exact memmap=512M@512M memmap=512M@0

Actually, with kexec, you can probably doing without reboot.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures)
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-12-22 13:46:32

by Pavel Machek

[permalink] [raw]
Subject: Re: Testing RAM from userspace / question about memmap= arguments

On Sat 2007-12-22 13:42:47, Richard D wrote:
> Cant you, modify bootmem allocator to test with memtest patterns and then
> use kexec (as Pavel suggested) to test the one where kernel was sitting
> earlier?


I do not think you need to modify anything in kernel. Just use
/dev/mem to test areas that kernel doesn't see, then kexec into place
you already tested, and test the rest.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-12-22 15:37:11

by Richard D

[permalink] [raw]
Subject: RE: Testing RAM from userspace / question about memmap= arguments

I was thinking that by the time userspace is ready, the memory that can be
tested will be less.

-----Original Message-----
From: Pavel Machek [mailto:[email protected]]
Sent: Saturday, December 22, 2007 7:16 PM
To: Richard D
Cc: 'Matthew Bloch'; [email protected]
Subject: Re: Testing RAM from userspace / question about memmap= arguments

On Sat 2007-12-22 13:42:47, Richard D wrote:
> Cant you, modify bootmem allocator to test with memtest patterns and then
> use kexec (as Pavel suggested) to test the one where kernel was sitting
> earlier?


I do not think you need to modify anything in kernel. Just use
/dev/mem to test areas that kernel doesn't see, then kexec into place
you already tested, and test the rest.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures)
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-12-22 16:06:31

by David Newall

[permalink] [raw]
Subject: Re: Testing RAM from userspace / question about memmap= arguments

Pavel Machek wrote:
> On Sat 2007-12-22 13:42:47, Richard D wrote:
>
>> Cant you, modify bootmem allocator to test with memtest patterns and then
>> use kexec (as Pavel suggested) to test the one where kernel was sitting
>> earlier?
>>
>
>
> I do not think you need to modify anything in kernel. Just use
> /dev/mem to test areas that kernel doesn't see, then kexec into place
> you already tested, and test the rest.
>

That's still an insufficient test. One failure mode is writes at one
location corrupting cells at another.

The idea of wanting to do comprehensive and robust memory testing from
within the operating system seems dubious at best, to me. If there is
something wrong with memtest86, doing the tests from within Linux is not
the answer. The answer is to fix memtest86. If the problem is that you
automation, e.g. switching a server from production to memory test mode
at midnight and back again at 6am, the answer is still to "fix"
memtest86. Writing something that grabs some physical RAM from Linux's
control, tests it, and then moves the kernel itself so that it can test
the rest, is adding a whole extra layer of complexity to an already
challenging (I assume, based on errors that dedicated software-based
testers miss) problem.

Give up on this misguided idea and build on the best tools that are
already available.

2007-12-22 18:44:25

by Pavel Machek

[permalink] [raw]
Subject: Re: Testing RAM from userspace / question about memmap= arguments

On Sat 2007-12-22 21:00:11, Richard D wrote:
> I was thinking that by the time userspace is ready, the memory that can be
> tested will be less.

Which does not matter when you can test the rest using second
(kexec-ed) kernel, right?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-12-22 18:49:23

by Pavel Machek

[permalink] [raw]
Subject: Re: Testing RAM from userspace / question about memmap= arguments

On Sun 2007-12-23 02:36:14, David Newall wrote:
> Pavel Machek wrote:
>> On Sat 2007-12-22 13:42:47, Richard D wrote:
>>
>>> Cant you, modify bootmem allocator to test with memtest patterns and then
>>> use kexec (as Pavel suggested) to test the one where kernel was sitting
>>> earlier?
>>
>> I do not think you need to modify anything in kernel. Just use
>> /dev/mem to test areas that kernel doesn't see, then kexec into place
>> you already tested, and test the rest.
>
> That's still an insufficient test. One failure mode is writes at one
> location corrupting cells at another.
>
> The idea of wanting to do comprehensive and robust memory testing from
> within the operating system seems dubious at best, to me. If there is
> something wrong with memtest86, doing the tests from within Linux is not
> the answer. The answer is to fix memtest86. If the problem is that you
> automation, e.g. switching a server from production to memory test mode at
> midnight and back again at 6am, the answer is still to "fix" memtest86.
> Writing something that grabs some physical RAM from Linux's control, tests
> it, and then moves the kernel itself so that it can test the rest, is
> adding a whole extra layer of complexity to an already challenging (I
> assume, based on errors that dedicated software-based testers miss)
> problem.

Well, we have kexec. We already have way for kernel to relocate itself.

> Give up on this misguided idea and build on the best tools that are already
> available.

Yes, the idea is "interesting". I do not think it quite cuts
"misguided" part.

memtest has following problems:

0) it is kind of hard to run memtest over ssh

1) if linux fixes some problem with PCI quirk or microcode
upload, memtest will not see the fix

2) if memory only fails while something else happens (DMA to
other piece of memory? Hard disk load glitching powre
supply?), memtest will not see the problem.

(Of course, memtest-under-linux has some problems too. Like "if it
freezes, was it bad memory or kernel problem").
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-12-22 20:11:17

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Testing RAM from userspace / question about memmap= arguments

On Tue, 18 Dec 2007 17:06:24 +0000
Matthew Bloch <[email protected]> wrote:

> Hi - I'm trying to come up with a way of thoroughly testing every byte
> of RAM from within Linux on amd64 (so that it can be automated better
> than using memtest86+), and came up with an idea which I'm not sure is
> supported or practical.
>
> The obvious problem with testing memory from user space is that you
> can't mlock all of it, so the best you can do is about three quarters,
> and hope that the rest of the memory is okay.

well... to be honest the more obvious problem will be that you won't be testing the RAM, you'll be testing the CPU's cache.. over and over again.

memtest86+ does various magic to basically bypass the caches (by disabling them ;-)...
Doing that in a live kernel situation, and from userspace to boot...... that's... and issue.

2007-12-22 20:37:32

by David Newall

[permalink] [raw]
Subject: Re: Testing RAM from userspace / question about memmap= arguments

Pavel Machek wrote:
> memtest has following problems:
>
> 0) it is kind of hard to run memtest over ssh
>

It's kind of hard to run anything over SSH if it has to be run before
userspace is up. But the kernel can collect results from a modified
memtest, after it chains back.

> 1) if linux fixes some problem with PCI quirk or microcode
> upload, memtest will not see the fix
>

What are you saying? Linux is going to fix faulty RAM? The point with
testing RAM is you *want* to see it fail; you don't want Linux to fix it.

> 2) if memory only fails while something else happens (DMA to
> other piece of memory? Hard disk load glitching powre
> supply?), memtest will not see the problem.

These are not RAM faults. The very last thing you want is evidence that
you've got a faulty piece of RAM when the fault is actually a hard disk
glitch!

2007-12-22 20:49:01

by Matthew Bloch

[permalink] [raw]
Subject: Re: Testing RAM from userspace / question about memmap= arguments

David Newall wrote:
> Pavel Machek wrote:
>> On Sat 2007-12-22 13:42:47, Richard D wrote:
>>
>>> Cant you, modify bootmem allocator to test with memtest patterns and
>>> then
>>> use kexec (as Pavel suggested) to test the one where kernel was sitting
>>> earlier?
>>>
>>
>>
>> I do not think you need to modify anything in kernel. Just use
>> /dev/mem to test areas that kernel doesn't see, then kexec into place
>> you already tested, and test the rest.
>>
>
> That's still an insufficient test. One failure mode is writes at one
> location corrupting cells at another.
>
> The idea of wanting to do comprehensive and robust memory testing from
> within the operating system seems dubious at best, to me.

Well if we're trying to be thorough, either way is flawed - you can't
possibly test pathologically-misbehaving memory from code running from
inside of it, you'd want some kind of non-uniform memory arrangement to
do that properly. memtest86's value is that it at least *tries* to work
in this environment by dynamically relocating itself, but its memory
testing algorithms aren't the hard bit. Also I'm not necessarily
interested in *which* section of which DIMM is faulty, just a yes or no
is enough so I can send the faulty ones back to the shop.

I don't agree that adding a network stack to memtest86's bare kernel is
going to be easier than working out how to get Linux to do the same job,
with its luxurious programming environment. I can already automate
memtest via serial consoles, power cycling, network booting and so on
but it's ugly.

I will report back in the new year when I've had a chance to play with
our collection of dodgy hardware.

--
Matthew

2007-12-22 20:55:49

by Pavel Machek

[permalink] [raw]
Subject: Re: Testing RAM from userspace / question about memmap= arguments

On Sun 2007-12-23 07:06:58, David Newall wrote:
> Pavel Machek wrote:
>> memtest has following problems:
>>
>> 0) it is kind of hard to run memtest over ssh
>>
>
> It's kind of hard to run anything over SSH if it has to be run before
> userspace is up. But the kernel can collect results from a modified
> memtest, after it chains back.

memtest can be ran from userspace, that's the point.

>> 1) if linux fixes some problem with PCI quirk or microcode
>> upload, memtest will not see the fix
>>
>
> What are you saying? Linux is going to fix faulty RAM?

Yes, that's what CPU microcode update is for. And I want to test my
RAM with up-to-date microcode.

>> 2) if memory only fails while something else happens (DMA to
>> other piece of memory? Hard disk load glitching powre
>> supply?), memtest will not see the problem.
>
> These are not RAM faults. The very last thing you want is evidence that
> you've got a faulty piece of RAM when the fault is actually a hard disk
> glitch!

No, it may be power supply leading to RAM problems. Yes, I want to
detect that.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-12-23 07:36:16

by David Newall

[permalink] [raw]
Subject: Re: Testing RAM from userspace / question about memmap= arguments

Pavel Machek wrote:
> On Sun 2007-12-23 07:06:58, David Newall wrote:
>
>> It's kind of hard to run anything over SSH if it has to be run before
>> userspace is up. But the kernel can collect results from a modified
>> memtest, after it chains back.
>>
>
> memtest can be ran from userspace, that's the point.
>

I'm not sure I believe that. You need to tinker with hardware tables
before you know what physical RAM is being used. Sequential virtual
pages might be mapped to sequential physical RAM, but it might also be
mapped psuedo-randomly, or even page-reverse-sequential! How can you do
a basic walking bit test when you could be accessing pages in random order?

>>> 1) if linux fixes some problem with PCI quirk or microcod
>>> upload, memtest will not see the fix
>>>
>>>
>> What are you saying? Linux is going to fix faulty RAM?
>>
>
> Yes, that's what CPU microcode update is for. And I want to test my
> RAM with up-to-date microcode.
>

Don't microcode updates fix CPU bugs? That's not fixing faulty RAM. If
base microcode is so faulty as to make RAM access unreliable, the CPU
probably won't even POST, let alone boot the kernel and start a whole
bunch of userspace stuff, before it can get around to checking to see if
there is new microcode for that CPU and download it.

I suppose a CPU retains microcode updates, once loaded, until power-down
or some hard reboot that you surely can avoid. If it does happen that
you have an update that works around something unrelated to the CPU, for
example maybe interaction with a bridge, then you can update the CPU
before running memtest. Once loaded it's there until power down.

>> These are not RAM faults. The very last thing you want is evidence that
>> you've got a faulty piece of RAM when the fault is actually a hard disk
>> glitch!
>>
>
> No, it may be power supply leading to RAM problems. Yes, I want to
> detect that.

I'm sure you don't mean that. I'm sure you don't want a faulty power
supply to look like faulty RAM. No amount of replacing pieces of memory
is going to solve a faulty power supply. At worst you'll hit on a
combination of pieces that pass the test ... and then the system will
fail, mysteriously, in production. I'm certain you don't want that.

Anyhow, good luck with your idea. I think it's crazy, and that you're
doomed to failure. Doomed! I tell you.

2007-12-23 11:19:18

by Pavel Machek

[permalink] [raw]
Subject: Re: Testing RAM from userspace / question about memmap= arguments

On Sun 2007-12-23 18:05:59, David Newall wrote:
> Pavel Machek wrote:
>> On Sun 2007-12-23 07:06:58, David Newall wrote:
>>
>>> It's kind of hard to run anything over SSH if it has to be run before
>>> userspace is up. But the kernel can collect results from a modified
>>> memtest, after it chains back.
>>>
>>
>> memtest can be ran from userspace, that's the point.
>>
>
> I'm not sure I believe that. You need to tinker with hardware tables
> before you know what physical RAM is being used. Sequential virtual

No, I can just use /dev/mem. (After passing mem=XXX exactmap to kernel
so that I know what I may play with).

>> Yes, that's what CPU microcode update is for. And I want to test my
>> RAM with up-to-date microcode.
>>
>
> Don't microcode updates fix CPU bugs? That's not fixing faulty RAM.

L1/L2 cache is part of memory subsystem.

> I suppose a CPU retains microcode updates, once loaded, until power-down or
> some hard reboot that you surely can avoid. If it does happen that
> you

If CPU retains microcode after reset, then you are right. I'm not
sure.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-12-25 23:12:13

by Pavel Machek

[permalink] [raw]
Subject: Re: Testing RAM from userspace / question about memmap= arguments

On Sat 2007-12-22 12:09:59, Arjan van de Ven wrote:
> On Tue, 18 Dec 2007 17:06:24 +0000
> Matthew Bloch <[email protected]> wrote:
>
> > Hi - I'm trying to come up with a way of thoroughly testing every byte
> > of RAM from within Linux on amd64 (so that it can be automated better
> > than using memtest86+), and came up with an idea which I'm not sure is
> > supported or practical.
> >
> > The obvious problem with testing memory from user space is that you
> > can't mlock all of it, so the best you can do is about three quarters,
> > and hope that the rest of the memory is okay.
>
> well... to be honest the more obvious problem will be that you won't be testing the RAM, you'll be testing the CPU's cache.. over and over again.
>
> memtest86+ does various magic to basically bypass the caches (by disabling them ;-)...
> Doing that in a live kernel situation, and from userspace to boot...... that's... and issue.

Are you sure? I always assumed that memtest just used patterns bigger
than L1/L2 caches... ... and IIRC my celeron testing confirmed it, if
I disabled L2 cache in BIOS, memtest behave differently.

Anyway, if you can do iopl(), we may as well let you disable caches,
but you are right, that will need a kernel patch.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-12-26 16:47:47

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Testing RAM from userspace / question about memmap= arguments

On Wed, 26 Dec 2007 00:09:57 +0100
Pavel Machek <[email protected]> wrote:

> On Sat 2007-12-22 12:09:59, Arjan van de Ven wrote:
> > On Tue, 18 Dec 2007 17:06:24 +0000

> > memtest86+ does various magic to basically bypass the caches (by
> > disabling them ;-)... Doing that in a live kernel situation, and
> > from userspace to boot...... that's... and issue.
>
> Are you sure? I always assumed that memtest just used patterns bigger
> than L1/L2 caches...

that's... not nearly usable or enough. Caches are relatively smart
about things like use-once.... and they're huge. 12Mb today. You'd need
patterns bigger than 100Mb to get even close to being reasonably
confident that there's nothing left.

> ... and IIRC my celeron testing confirmed it, if
> I disabled L2 cache in BIOS, memtest behave differently.
>
> Anyway, if you can do iopl(), we may as well let you disable caches,
> but you are right, that will need a kernel patch.

and a new syscall of some sorts I suspect; "flush all caches" is a ring
0 operation (and you probably need to do it in an ipi anyway on all
cpus)

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2007-12-26 20:38:55

by Maxim Levitsky

[permalink] [raw]
Subject: Re: Testing RAM from userspace / question about memmap= arguments

В сообщении от Wednesday 26 December 2007 12:17:56 Arjan van de Ven написал(а):
> On Wed, 26 Dec 2007 00:09:57 +0100
> Pavel Machek <[email protected]> wrote:
>
> > On Sat 2007-12-22 12:09:59, Arjan van de Ven wrote:
> > > On Tue, 18 Dec 2007 17:06:24 +0000
>
> > > memtest86+ does various magic to basically bypass the caches (by
> > > disabling them ;-)... Doing that in a live kernel situation, and
> > > from userspace to boot...... that's... and issue.
> >
> > Are you sure? I always assumed that memtest just used patterns bigger
> > than L1/L2 caches...
>
> that's... not nearly usable or enough. Caches are relatively smart
> about things like use-once.... and they're huge. 12Mb today. You'd need
> patterns bigger than 100Mb to get even close to being reasonably
> confident that there's nothing left.
>
> > ... and IIRC my celeron testing confirmed it, if
> > I disabled L2 cache in BIOS, memtest behave differently.
> >
> > Anyway, if you can do iopl(), we may as well let you disable caches,
> > but you are right, that will need a kernel patch.
>
> and a new syscall of some sorts I suspect; "flush all caches" is a ring
> 0 operation (and you probably need to do it in an ipi anyway on all
> cpus)
>

I think that PAT support will help a lot.
How about opening/mmaping /dev/mem, and setting uncacheable attribute there.
Actually it is even possible today with MTRRs.

Regards,
Maxim Levitsky