2003-02-08 20:08:58

by Corey Minyard

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

Eric W. Biederman wrote:

>Corey Minyard <[email protected]> writes:
>
>
>>Eric,
>>
>>I saw that you are working on kexec. I'm using and have hacked on a similar
>>piece of software named bootimg (and I'll be glad when yours is done and ready
>>and we can jettison bootimg). From the looks of the code, it looks like you
>>have seen bootimg, too. I looked through your patch, and I noticed a few
>>things. Hopefully it's the newest version of the patch.
>>
>>First was that you don't turn of DMA bus masters. There seemed to be some
>>discussion of this on lkml, but I didn't see anything in the patch for it. We
>>are actually having problems with bootimg and DMA bus master devices, so the
>>problem is real. And turning of DMA bus mastering for everything on the PCI bus
>>didn't seem to work, Ken Sumrall tried it, and at least the device in question
>>(a bcm5700) seemed to ignore the bit. We are looking at adding an ioctl or a
>>notifier list that will allow devices to register non-blocking calls to shut off
>>DMA. Is anything like that under consideration, or are you thinking of using
>>the reboot notifier for this, or what?
>>
>>
>
>The reboot notifier + device->shutdown(), are called. As you have noted
>the problem is not as easy as clearing the bus master bit, so I leave it
>up to the device driver. The device driver is responsible for placing
>the device into a quiescent state.
>
>Generally that code is present in the driver somewhere already, as it
>is needed for the rmmod case.
>
>In addition going through the normal shutdown path, downing interfaces
>etc, usually helps as well.
>
>So for when kexec is not used in a panic case this is easy.
>
The panic case is actually the most interesting for us. We are using
bootimg with the MCL coredump to take a kernel core to memory and pick
it up on the next boot. You cannot call most shutdown() functions from
a panic, since they will block.

>>The patch doesn't make sure it is running on processor zero for SMP machines.
>>You must do this on x86 machines, the kernel assumes it comes up on processor
>>zero. I assume this is true for other machines, too.
>>
>>
>
>I have a secondary patch. kexec-hwfixes, that does this. I am I need to review
>it a little closer and make certain the code is clean enough to go into
>the general purpose kernel. But I do have the code.
>
I have code that does this for bootimg, too, if you are interested, and
it has received extensive testing.

>>Hopefully I'm not looking at an old version of the patch, but these are
>>important things you need to handle.
>>
>>
>
>Yep. I am a little scatter brained on the maintenance side but I am handling
>them all.
>
>If you are after the kexec on panic case that is much, more
>interesting because it is quite possible we cannot afford to call some
>of those functions. But I am quite willing to discuss and work with
>people on what is really going on.
>
>For any more conversation though can we please cc linux-kernel?
>
>I like to keep things public so I don't have to answer the same
>question too many times.
>
No problem. As you have requested, lkml is copied.

Thanks,

-Corey


2003-02-09 18:30:12

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

Corey Minyard <[email protected]> writes:

> The panic case is actually the most interesting for us. We are using bootimg
> with the MCL coredump to take a kernel core to memory and pick it up on the next
> boot.

[snip]

With respect to DMA and SMP handling for kexec on panic that case is
much trickier. A lot of the normal methods simply don't apply because
by definition in a panic something is broken, and that something may
be the code we need to cleanly shutdown the hardware. But I an not
ready to sacrifice a method that works well in a properly working
kernel just because the panic case can't use it.

In getting it working I suggest we start with the easy cases, where
DMA and SMP are not big issues. And then we can have a working
framework.

I am still digesting the crash dump code I have seen, but as far as I
can tell what it does is to compress the contents of memory, for
writing out later.

To handle the hard cases for kexec on panic I would recommend the
following.

- Place the recovery code in a reserved area of memory that the normal
kernel will not touch, and actually run the code there. This
trivially solves the DMA problem because the hardware is not DMA'ing

- Setup the kernel that does the recovery so that the pool of memory
it uses for dynamic allocations is also in the reserved area of
memory so that it is equally free of DMA dangers.

- Modify the kernel that does the recovery so it can be run at
different physical address from the standard kernel, so it will not
need to be moved out of the reserved area of memory.

- Modify the kernel that does the recovery to not care about
which cpu in a SMP system it comes up on first.

- Modify the kernel that does the recovery so that it is very robust
in reinitializing devices. So it can cope with devices in a random
state. Though most devices can be handled by simply ignoring them.

- Possibly preserve in the reserved area a separate copy of the tables
ACPI/MP/etc that the kernel needs for coming up. I actually don't
think this needs to happen as the kernel preserves those in place
already.

At that point I believe a full memory core dump can be achieved
without needing to do anything except to jump to the other kernel
on panic. All of the memory can be preserved because the kexec case
would not have touched it.

I find this very attractive because it can be done with a very low
impact on the primary kernel whose panic we want to capture, plus it
is an extremely robust solution.

The one piece I don't know about is how to prioritize which pieces of
memory are written out first. It is certainly a desirable feature
but do we need that, if we can preserve everything? Or is it easy
enough to get the prioritizing information that we don't care.

Eric

2003-02-10 10:59:03

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: Kexec on 2.5.59 problems ?

I am using the OSDL versions of the kexec patches for
2.5.59 (plm 1442 and 1444) for lkcd-kexec based crash dump
work. So far I had only been trying the cases where
machine_kexec was being invoked directly from (safe)
panics, which worked, i.e. it could successfully kexec and
save dumps generated via artificially induced panics on
a system that's not doing very much (Not considering harder
cases or for the moment).

Surprisingly though, when I tried just a simple
kexec -e today (having loaded the kernel earlier on),
I ran into the following Oops, consistently:

I'm using kexec-tools-1.8, and this has worked for me
earlier. The test system is a 4way SMP machine.

Has anyone seen this as well ? (I'd already issued init 1
and unmounted filesystems by this point)

sh-2.05a# /sbin/kexec -e
Synchronizing SCSI caches:
Shutting down devices
Starting new kernel
Unable to handle kernel paging request at virtual address 361ae000
printing eip:
c011470a
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0060:[<c011470a>] Not tainted
EFLAGS: 00010003
EIP is at machine_kexec+0x14a/0x190
eax: 00000097 ebx: f7742260 ecx: 00000025 edx: 361ac000
esi: c0114750 edi: 361ae000 ebp: f7365e94 esp: f7365e80
ds: 007b es: 007b ss: 0068

Process kexec (pid: 1685, threadinfo=f7364000 task=f6290060)
Stack: 361ae000 361ac000 f7742260 f7364000 00000000 f7365fbc
c0126903 f7742260 c02a71af c03a9aa8 00000001 00000000 f7fe1640
f7793ec0 c1b3b120 f7364000 00000001 f7365edc c014dbef f7fe1668
f7fe1668 00000286 f7ff51e0 f7365efc

Call Trace:
[<c0126903>] sys_reboot+0x363/0x400
[<c014dbef>] invalidate_inode_buffers+0xf/0x90
[<c01633b0>] clear_inode+0x10/0xb0
[<c0238276>] sock_destroy_inode+0x16/0x20
[<c016149e>] dput+0x1e/0x170 i
[<c014cb56>] __fput+0x116/0x140 i
[<c014b38f>] filp_close+0xcf/0xe0 i
[<c014b43e>] sys_close+0x9e/0xd0 i
[<c01091c7>] syscall_call+0x7/0xb i

Code: f3 a5 a8 02 74 02 66 a5 a8 01 74 01 a4 e8 84 fe ff ff 6a 00

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India

2003-02-10 11:57:42

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

On Sun, Feb 09, 2003 at 11:39:27AM -0700, Eric W. Biederman wrote:
> Corey Minyard <[email protected]> writes:
>
> With respect to DMA and SMP handling for kexec on panic that case is
> much trickier. A lot of the normal methods simply don't apply because
> by definition in a panic something is broken, and that something may
> be the code we need to cleanly shutdown the hardware. But I an not
> ready to sacrifice a method that works well in a properly working
> kernel just because the panic case can't use it.
>
> In getting it working I suggest we start with the easy cases, where
> DMA and SMP are not big issues. And then we can have a working
> framework.

I'd agree. That was also the idea behind the patch we'd just posted
for LKCD. With a basic working framework in hand that works for
simpler cases, we can now keep working on addressing more and harder
situations bit by bit.

>
> I am still digesting the crash dump code I have seen, but as far as I
> can tell what it does is to compress the contents of memory, for
> writing out later.

Yes. It actually saves a formatted compressed dump in memory,
and later writes it out to disk as is.

>
> To handle the hard cases for kexec on panic I would recommend the
> following.
>
> - Place the recovery code in a reserved area of memory that the normal
> kernel will not touch, and actually run the code there. This
> trivially solves the DMA problem because the hardware is not DMA'ing
>
> - Setup the kernel that does the recovery so that the pool of memory
> it uses for dynamic allocations is also in the reserved area of
> memory so that it is equally free of DMA dangers.
>
> - Modify the kernel that does the recovery so it can be run at
> different physical address from the standard kernel, so it will not
> need to be moved out of the reserved area of memory.

Are you trying to address the possibility that DMA is overwriting
memory we are using in the recovery code, due to a runaway driver
or other code passing a wrong memory address to a device (e.g. in
a corrupted command area) ? I'm wondering if just reserving
an area of memory would help. As long as the address is visible/
accessible by the device (i.e. unless we have the h/w support to
apply protection at that level), can we really be safe in those
weird or rare cases ? Disabling the bus-master sounds like a
more dependable option for that (via device shutdown or reboot
notifiers as suitable) if it can be done.

Placing the recovery code in a safe reserved area (that the
running kernel may not know about or may be protected),
may reduce the possibility of the panic/buggy kernel overwriting
it, but will it help the DMA case ?

>
> - Modify the kernel that does the recovery to not care about
> which cpu in a SMP system it comes up on first.
>
> - Modify the kernel that does the recovery so that it is very robust
> in reinitializing devices. So it can cope with devices in a random
> state. Though most devices can be handled by simply ignoring them.
>
> - Possibly preserve in the reserved area a separate copy of the tables
> ACPI/MP/etc that the kernel needs for coming up. I actually don't
> think this needs to happen as the kernel preserves those in place
> already.
>
> At that point I believe a full memory core dump can be achieved
> without needing to do anything except to jump to the other kernel
> on panic. All of the memory can be preserved because the kexec case
> would not have touched it.
>
> I find this very attractive because it can be done with a very low
> impact on the primary kernel whose panic we want to capture, plus it
> is an extremely robust solution.
>
> The one piece I don't know about is how to prioritize which pieces of
> memory are written out first. It is certainly a desirable feature
> but do we need that, if we can preserve everything? Or is it easy
> enough to get the prioritizing information that we don't care.

LKCD has support for doing that - it provides for specifying dump
levels, to dump just a header, kernel pages, all in-use pages and
full-memory. This can be tuned/extended for some more intermediate
levels (e.g. header + stack traces for all cpus).

There is also some work-in-progress code for more granular
dump customisation as a future item.

While the patch I'd posted has been designed so that ideally
it should be possible to preserve everything, I'm still not
certain if the compression we get is good enough for all cases
(e.g a heavily loaded system with lots of non-redundant data)
-- we really need to play around with the implementation and
tune it. Secondly, for a large memory system, it could take a
bit of time to compress all pages, and we might just want to
dump potentially more relevant data (e.g kernel pages) for
some kind of problems. It was easy enough to do this with some
simple heuristics like dumping inuse pages which are nonlru.

Regards
Suparna


--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India

2003-02-10 13:46:56

by Corey Minyard

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Suparna Bhattacharya wrote:

|>I am still digesting the crash dump code I have seen, but as far as I
|>can tell what it does is to compress the contents of memory, for
|>writing out later.
|
|
|Yes. It actually saves a formatted compressed dump in memory,
|and later writes it out to disk as is.

MCL coredump does funny memory shuffling, too. It compresses
pages into a contiguous area of memory, and as it runs into output
pages that it has not yet compressed, it moves them into pages that
it has already compressed and keeps track of where everything is
located. That a lot of the complexity of MCL coredump.

|
|>To handle the hard cases for kexec on panic I would recommend the
|>following.
|>
|>- Place the recovery code in a reserved area of memory that the normal
|> kernel will not touch, and actually run the code there. This
|> trivially solves the DMA problem because the hardware is not DMA'ing
|>
|>- Setup the kernel that does the recovery so that the pool of memory
|> it uses for dynamic allocations is also in the reserved area of
|> memory so that it is equally free of DMA dangers.
|>
|>- Modify the kernel that does the recovery so it can be run at
|> different physical address from the standard kernel, so it will not
|> need to be moved out of the reserved area of memory.
|
|
|Are you trying to address the possibility that DMA is overwriting
|memory we are using in the recovery code, due to a runaway driver
|or other code passing a wrong memory address to a device (e.g. in
|a corrupted command area) ? I'm wondering if just reserving
|an area of memory would help. As long as the address is visible/
|accessible by the device (i.e. unless we have the h/w support to
|apply protection at that level), can we really be safe in those
|weird or rare cases ? Disabling the bus-master sounds like a
|more dependable option for that (via device shutdown or reboot
|notifiers as suitable) if it can be done.
|
|Placing the recovery code in a safe reserved area (that the
|running kernel may not know about or may be protected),
|may reduce the possibility of the panic/buggy kernel overwriting
|it, but will it help the DMA case ?

Eric, I'd suggest you go with your previous advice and start simple
and go one step at a time. You will never be able to build a system
that can perfectly protect from anything that can go wrong. So start
with the simple case that covers 95% of the problems. Build a
system first that lets the driver quiesce the chip, then think about
moving those functions and their data into special protected memory.

I've actually never seen a core dump with the MCL core dump that
had a memory corruption so bad it couldn't take the dump.

|
|
|While the patch I'd posted has been designed so that ideally
|it should be possible to preserve everything, I'm still not
|certain if the compression we get is good enough for all cases
|(e.g a heavily loaded system with lots of non-redundant data)
|-- we really need to play around with the implementation and
|tune it. Secondly, for a large memory system, it could take a
|bit of time to compress all pages, and we might just want to
|dump potentially more relevant data (e.g kernel pages) for
|some kind of problems. It was easy enough to do this with some
|simple heuristics like dumping inuse pages which are nonlru.

~From my experience, data is memory is very compressible
(moreso than the average text file). Perhaps some pieces are
not very compressible, but in the whole they are. Plus you don't
have to have that much compressions for this to work, just enough
to give you memory to boot the next kernel and save off a dump.
And speed is probably not a big issue here, since this should be a
very rare occurrance.

- -Corey
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE+R6+OmUvlb4BhfF4RAhW7AJ9ZUCbWk6TBLvbwYunyKMN0dAxf+QCff21/
WoOfzq4NrjYv3E0bOYhwSD8=
=T9Y9
-----END PGP SIGNATURE-----


2003-02-10 14:52:27

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

On Mon, Feb 10, 2003 at 07:56:35AM -0600, Corey Minyard wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Suparna Bhattacharya wrote:
>
> |Yes. It actually saves a formatted compressed dump in memory,
> |and later writes it out to disk as is.
>
> MCL coredump does funny memory shuffling, too. It compresses
> pages into a contiguous area of memory, and as it runs into output
> pages that it has not yet compressed, it moves them into pages that
> it has already compressed and keeps track of where everything is

AFAICR, the MCL coredump implementation I'd seen (and used as
a reference to model some of this code for lkcd) seemed to
save only a kernel dump (not user space pages), so it would
use the free and user pages as destination for compressed
dump. What you are describing sounds a little different and
closer to what we are doing. I'd be interested in takng a look
at the implementation you are working with if it actually
saves the whole memory by making use of pages it has already
compressed. Could you point me to the code ?

> located. That a lot of the complexity of MCL coredump.
>
> |
> |While the patch I'd posted has been designed so that ideally
> |it should be possible to preserve everything, I'm still not
> |certain if the compression we get is good enough for all cases
> |(e.g a heavily loaded system with lots of non-redundant data)
> |-- we really need to play around with the implementation and
> |tune it. Secondly, for a large memory system, it could take a
> |bit of time to compress all pages, and we might just want to
> |dump potentially more relevant data (e.g kernel pages) for
> |some kind of problems. It was easy enough to do this with some
> |simple heuristics like dumping inuse pages which are nonlru.
>
> ~From my experience, data is memory is very compressible
> (moreso than the average text file). Perhaps some pieces are
> not very compressible, but in the whole they are. Plus you don't

Well, it may just be a matter of how our implementation is tuned.
MCL compresses a much larger buffer at a time than we do at the
moment (we did it a page at a time to simplify some of the tracking
in the dump format), so that could be one factor to consider and
maybe rethink. Its a little early to say, though; I need to
investigate further.

> have to have that much compressions for this to work, just enough
> to give you memory to boot the next kernel and save off a dump.

Also has to be enough to not overwrite the current kernel (at
least the parts that the dump saving code is using or relying on)

> And speed is probably not a big issue here, since this should be a
> very rare occurrance.

Speed is secondary of course, but just good to keep in mind
for very large memory systems.

Regards
Suparna

Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India

2003-02-10 15:13:00

by Corey Minyard

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Suparna Bhattacharya wrote:

|On Mon, Feb 10, 2003 at 07:56:35AM -0600, Corey Minyard wrote:
|
|>-----BEGIN PGP SIGNED MESSAGE-----
|>Hash: SHA1
|>
|>Suparna Bhattacharya wrote:
|>
|>|Yes. It actually saves a formatted compressed dump in memory,
|>|and later writes it out to disk as is.
|>
|>MCL coredump does funny memory shuffling, too. It compresses
|>pages into a contiguous area of memory, and as it runs into output
|>pages that it has not yet compressed, it moves them into pages that
|>it has already compressed and keeps track of where everything is
|
|
|AFAICR, the MCL coredump implementation I'd seen (and used as
|a reference to model some of this code for lkcd) seemed to
|save only a kernel dump (not user space pages), so it would
|use the free and user pages as destination for compressed
|dump. What you are describing sounds a little different and
|closer to what we are doing. I'd be interested in takng a look
|at the implementation you are working with if it actually
|saves the whole memory by making use of pages it has already
|compressed. Could you point me to the code ?

I remembered incorrectly here. I was thinking of bootimg, which does to
some wierd
page shuffling. MCL coredump does not save in a contiguous region, it
keeps a free list
of pages it has alread compressed and allocates destination pages from
it's free list,
and stores those in a map.

- -Corey
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE+R8NemUvlb4BhfF4RApfrAJ4tWv3mU8N4TDYXaymM4FBXJurJ3ACfef4r
qHRXTq8OS/+fb7KSFqWMKiw=
=h6qs
-----END PGP SIGNATURE-----


2003-02-10 16:59:45

by Andy Pfiffer

[permalink] [raw]
Subject: Re: [Fastboot] Re: Kexec on 2.5.59 problems ?

On Mon, 2003-02-10 at 03:14, Suparna Bhattacharya wrote:
> I am using the OSDL versions of the kexec patches for
> 2.5.59 (plm 1442 and 1444) for lkcd-kexec based crash dump
> work.

<snip>

>
> Surprisingly though, when I tried just a simple
> kexec -e today (having loaded the kernel earlier on),
> I ran into the following Oops, consistently:
>
> I'm using kexec-tools-1.8, and this has worked for me
> earlier. The test system is a 4way SMP machine.
>
> Has anyone seen this as well ? (I'd already issued init 1
> and unmounted filesystems by this point)
>
> sh-2.05a# /sbin/kexec -e
> Synchronizing SCSI caches:
> Shutting down devices
> Starting new kernel
> Unable to handle kernel paging request at virtual address 361ae000
> printing eip:
> c011470a
> *pde = 00000000
> Oops: 0002
> CPU: 0
> EIP: 0060:[<c011470a>] Not tainted
> EFLAGS: 00010003
> EIP is at machine_kexec+0x14a/0x190
> eax: 00000097 ebx: f7742260 ecx: 00000025 edx: 361ac000
> esi: c0114750 edi: 361ae000 ebp: f7365e94 esp: f7365e80
> ds: 007b es: 007b ss: 0068
>
> Process kexec (pid: 1685, threadinfo=f7364000 task=f6290060)
> Stack: 361ae000 361ac000 f7742260 f7364000 00000000 f7365fbc
> c0126903 f7742260 c02a71af c03a9aa8 00000001 00000000 f7fe1640
> f7793ec0 c1b3b120 f7364000 00000001 f7365edc c014dbef f7fe1668
> f7fe1668 00000286 f7ff51e0 f7365efc
>
> Call Trace:
> [<c0126903>] sys_reboot+0x363/0x400
> [<c014dbef>] invalidate_inode_buffers+0xf/0x90
> [<c01633b0>] clear_inode+0x10/0xb0
> [<c0238276>] sock_destroy_inode+0x16/0x20
> [<c016149e>] dput+0x1e/0x170 i
> [<c014cb56>] __fput+0x116/0x140 i
> [<c014b38f>] filp_close+0xcf/0xe0 i
> [<c014b43e>] sys_close+0x9e/0xd0 i
> [<c01091c7>] syscall_call+0x7/0xb i
>
> Code: f3 a5 a8 02 74 02 66 a5 a8 01 74 01 a4 e8 84 fe ff ff 6a 00
>
> Regards
> Suparna

Yes, I have seen that exact or similar oops when trying kexec for 2.5.59
on a 2-way Xeon system. The exact same software configuration does
*not* generate that oops on a 1-way P3-800 system.

I've had some difficulty with the serial console on that system, so I
don't yet have an exact traceback and cannot confirm 100% that yours and
mine are identical.

It sure *looks* the same.

Andy


2003-02-10 17:47:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

Suparna Bhattacharya <[email protected]> writes:

> On Sun, Feb 09, 2003 at 11:39:27AM -0700, Eric W. Biederman wrote:
> > Corey Minyard <[email protected]> writes:
> >
> > With respect to DMA and SMP handling for kexec on panic that case is
> > much trickier. A lot of the normal methods simply don't apply because
> > by definition in a panic something is broken, and that something may
> > be the code we need to cleanly shutdown the hardware. But I an not
> > ready to sacrifice a method that works well in a properly working
> > kernel just because the panic case can't use it.
> >
> > In getting it working I suggest we start with the easy cases, where
> > DMA and SMP are not big issues. And then we can have a working
> > framework.
>
> I'd agree. That was also the idea behind the patch we'd just posted
> for LKCD. With a basic working framework in hand that works for
> simpler cases, we can now keep working on addressing more and harder
> situations bit by bit.

Agreed. I guess the primary question is can we trust the current
device shutdown + reboot notifier path or do we need to make some
large changes to avoid it.

> Are you trying to address the possibility that DMA is overwriting
> memory we are using in the recovery code, due to a runaway driver
> or other code passing a wrong memory address to a device (e.g. in
> a corrupted command area) ?

Not primarily. Instead I am trying to address the possibility that
DMA is overwriting the recovery code due to a device not being shutdown
properly. Though it would happen to cover many cases of the wrong
memory address being passed to a device.

> I'm wondering if just reserving
> an area of memory would help. As long as the address is visible/
> accessible by the device (i.e. unless we have the h/w support to
> apply protection at that level), can we really be safe in those
> weird or rare cases ? Disabling the bus-master sounds like a
> more dependable option for that (via device shutdown or reboot
> notifiers as suitable) if it can be done.

Basically using a reserved area of memory is an alternative to
device shutdown or calling the reboot notifiers. If the device shutdown
code is reliable enough we can go with that...

The other piece that a reserved area of memory is that you can
simplify the other cases because you don't need to do anything
before the dump because everything is preserved.

> Placing the recovery code in a safe reserved area (that the
> running kernel may not know about or may be protected),
> may reduce the possibility of the panic/buggy kernel overwriting
> it, but will it help the DMA case ?

Yes, for the same reasons. I am definitely not trying to address the
case of buggy hardware.

> > The one piece I don't know about is how to prioritize which pieces of
> > memory are written out first. It is certainly a desirable feature
> > but do we need that, if we can preserve everything? Or is it easy
> > enough to get the prioritizing information that we don't care.
>
> LKCD has support for doing that - it provides for specifying dump
> levels, to dump just a header, kernel pages, all in-use pages and
> full-memory. This can be tuned/extended for some more intermediate
> levels (e.g. header + stack traces for all cpus).

And that is why I though of it. I need to review how that portion
of the code is done. The one downside of the simplifications that
come with a reserved area of memory is they make knowing the set of
kernel allocations a challenge.

However in most cases all in-use pages ~= full-memory.
And the kernel pages can be computed statically. For more
I guess it would be necessary to pass information regarding
the current kernels data structures for tracking free memory
to the dumper. So the functionality does not need to be lost,
but providing it becomes a different problem.

> There is also some work-in-progress code for more granular
> dump customisation as a future item.
>
> While the patch I'd posted has been designed so that ideally
> it should be possible to preserve everything, I'm still not
> certain if the compression we get is good enough for all cases
> (e.g a heavily loaded system with lots of non-redundant data)
> -- we really need to play around with the implementation and
> tune it.

And I am certain that with a preserved memory area we can
preserve everything without compression.

> Secondly, for a large memory system, it could take a
> bit of time to compress all pages, and we might just want to
> dump potentially more relevant data (e.g kernel pages) for
> some kind of problems. It was easy enough to do this with some
> simple heuristics like dumping inuse pages which are nonlru.

I see. So you are definitely have some interesting heuristics
to pick which pages to dump. I hate to break that but..

Eric

2003-02-10 17:57:42

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] Re: Kexec on 2.5.59 problems ?

Andy Pfiffer <[email protected]> writes:

> On Mon, 2003-02-10 at 03:14, Suparna Bhattacharya wrote:
> > I am using the OSDL versions of the kexec patches for
> > 2.5.59 (plm 1442 and 1444) for lkcd-kexec based crash dump
> > work.
>
> <snip>
>
> >
> > Surprisingly though, when I tried just a simple
> > kexec -e today (having loaded the kernel earlier on),
> > I ran into the following Oops, consistently:
> >
> > I'm using kexec-tools-1.8, and this has worked for me
> > earlier. The test system is a 4way SMP machine.
> >
> > Has anyone seen this as well ? (I'd already issued init 1
> > and unmounted filesystems by this point)

Hmm. Would love to know which cpu this is on...

I think the primary candidate if this only occurs in smp is
the switch_mm. It may be that modifying the init_mm is not safe,
or it gets zapped somewhere else.

As soon as I get distractions in other directions under control
I will take a look.

Eric

2003-02-11 07:06:48

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [Fastboot] Re: Kexec on 2.5.59 problems ?

On Mon, Feb 10, 2003 at 11:07:06AM -0700, Eric W. Biederman wrote:
> Andy Pfiffer <[email protected]> writes:
>
> > On Mon, 2003-02-10 at 03:14, Suparna Bhattacharya wrote:
> > > Surprisingly though, when I tried just a simple
> > > kexec -e today (having loaded the kernel earlier on),
> > > I ran into the following Oops, consistently:
> > >
> > > I'm using kexec-tools-1.8, and this has worked for me
> > > earlier. The test system is a 4way SMP machine.
> > >
> > > Has anyone seen this as well ? (I'd already issued init 1
> > > and unmounted filesystems by this point)
>
> Hmm. Would love to know which cpu this is on...
>
> I think the primary candidate if this only occurs in smp is
> the switch_mm. It may be that modifying the init_mm is not safe,
> or it gets zapped somewhere else.
>

The following patch from Anton Blanchard's WIP kexec tree
for ppc64 seems to fix this for me. It just does a use_mm()
(routine from fs/aio.c) instead of switch_mm().

Andy could you try this out and see if it helps ?

The other change in Anton's tree that we should probably
include uses a separate kexec_mm rather than init_mm
for the mapping.

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India


Attachments:
(No filename) (1.24 kB)
kexec-usemm.patch (828.00 B)
Download all attachments

2003-02-11 12:40:03

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

On Mon, Feb 10, 2003 at 10:56:43AM -0700, Eric W. Biederman wrote:
> Suparna Bhattacharya <[email protected]> writes:
[snip]
>
> Not primarily. Instead I am trying to address the possibility that
> DMA is overwriting the recovery code due to a device not being shutdown
> properly. Though it would happen to cover many cases of the wrong
> memory address being passed to a device.
>
[snip]
>
> The other piece that a reserved area of memory is that you can
> simplify the other cases because you don't need to do anything
> before the dump because everything is preserved.
>
[snip]

> Yes, for the same reasons. I am definitely not trying to address the
> case of buggy hardware.
>
[snip]
>
> And I am certain that with a preserved memory area we can
> preserve everything without compression.
>

OK, I see where you are coming from. It is an interesting
possibility, if you know how to pull it off for various
architectures, and the working area that the new kernel needs
to do operate to the extent of issuing the writeout is not
too big (i.e. doesn't take away too much memory from the
operational kernel). Perhaps we could hide this memory from
the normal kernel virtual address space most of the time, so
its less susceptable to software corruption (i.e. besides
physical access via DMA).

At the same time we do want to quiesce / stop the DMA as
soon as possible to get a dump that reflects the
contents of memory at the concerned instant as closely as
possible. And in the buggy case we want to stop any
malfunctioning DMA (writes) immediately to minimize
damage.

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India

2003-02-11 13:25:10

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

On Tue, Feb 11, 2003 at 06:25:08PM +0530, Suparna Bhattacharya wrote:
> On Mon, Feb 10, 2003 at 10:56:43AM -0700, Eric W. Biederman wrote:
> > Suparna Bhattacharya <[email protected]> writes:
> [snip]
> >
> > Not primarily. Instead I am trying to address the possibility that
> > DMA is overwriting the recovery code due to a device not being shutdown
> > properly. Though it would happen to cover many cases of the wrong
> > memory address being passed to a device.
> >
>
> OK, I see where you are coming from. It is an interesting
> possibility, if you know how to pull it off for various
> architectures, and the working area that the new kernel needs
> to do operate to the extent of issuing the writeout is not
> too big (i.e. doesn't take away too much memory from the
> operational kernel). Perhaps we could hide this memory from
> the normal kernel virtual address space most of the time, so
> its less susceptable to software corruption (i.e. besides
> physical access via DMA).

For the sort of problems which Ken is seeing, maybe we can,
for a start, do without all the modifications to make the
new kernel run at a different address, if we can assume
that most i/o is likely is happen on dynamically allocated
buffers.

We could just reserve a memory area of reasonable size (how
much ?) which would be used by the new kernel for all its
allocations. We already have the infrastructure to tell the
new kernel which memory areas not to use, so its simple
enough to ask it exclude all but the reserved area.
By issuing the i/o as early as possible during bootup
(for lkcd all we need is the block device to be setup for
i/o requests), we can minimize the amount of memory to
reserve in this manner.

That might address a large percentage of the regular cases,
i.e. except where statically allocated buffers could be
targets for DMA. If we are using in-use (user) pages
for saving the dump, then there is a possibility of a dump
getting corrupted by a DMA, but there may be a way to
minimize that when we chose destination pages to use.

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India

2003-02-11 13:57:03

by Corey Minyard

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Suparna Bhattacharya wrote:

|
|For the sort of problems which Ken is seeing, maybe we can,
|for a start, do without all the modifications to make the
|new kernel run at a different address, if we can assume
|that most i/o is likely is happen on dynamically allocated
|buffers.
|
|We could just reserve a memory area of reasonable size (how
|much ?) which would be used by the new kernel for all its
|allocations. We already have the infrastructure to tell the
|new kernel which memory areas not to use, so its simple
|enough to ask it exclude all but the reserved area.
|By issuing the i/o as early as possible during bootup
|(for lkcd all we need is the block device to be setup for
|i/o requests), we can minimize the amount of memory to
|reserve in this manner.

DMA can occur almost anywhere. If you restrict the area of DMA, that
means you have to copy the contents to the final destination. I don't think
we want to do that in many cases.

|
|That might address a large percentage of the regular cases,
|i.e. except where statically allocated buffers could be
|targets for DMA. If we are using in-use (user) pages
|for saving the dump, then there is a possibility of a dump
|getting corrupted by a DMA, but there may be a way to
|minimize that when we chose destination pages to use.

Unless you have some way to mark pages as current DMA targets, you,
you won't be able to do this. And the problem Ken and I are seeing is
happening after the new kernel has booted. An old DMA operation is
occuring after the new kernel has booted. That means two kernels would
have to choose the same DMA target areas, and that's fairly unreasonable
to ask, IMHO.

The only reasonable way I can think of to do this is to quiesce the devices
before dumping memory or doing a kexec. It's not that hard to do, it's just
that a lot of DMA capable device drivers exist that don't do this.

- -Corey
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE+SQNzmUvlb4BhfF4RAnboAJ4rOL+Amh8F1EvahT9Uko/Y6tPXRwCfV2su
0g582Xllh4TGZ7wQ2YJSsDg=
=FaLb
-----END PGP SIGNATURE-----


2003-02-11 14:25:20

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

On Tue, Feb 11, 2003 at 08:06:44AM -0600, Corey Minyard wrote:
> |
> |We could just reserve a memory area of reasonable size (how
> |much ?) which would be used by the new kernel for all its
> |allocations. We already have the infrastructure to tell the
> |new kernel which memory areas not to use, so its simple
> |enough to ask it exclude all but the reserved area.
> |By issuing the i/o as early as possible during bootup
> |(for lkcd all we need is the block device to be setup for
> |i/o requests), we can minimize the amount of memory to
> |reserve in this manner.
>
> DMA can occur almost anywhere. If you restrict the area of DMA, that
> means you have to copy the contents to the final destination. I don't think
> we want to do that in many cases.

The scope here was just the case that Eric seemed to be
trying to address, the way I understood it, and hence a much
simpler subset of the problem at hand, since it is not really
tackling the rouge/buggy cases. There is no restriction on
where DMA can happen, just a block of memory area set aside
for the dormant kernel to use when it is instantiated.
So this is an area that the current kernel will not use or
touch and not specify as a DMA target during "regular"
operation.

> |
> |That might address a large percentage of the regular cases,
> |i.e. except where statically allocated buffers could be
> |targets for DMA. If we are using in-use (user) pages
> |for saving the dump, then there is a possibility of a dump
> |getting corrupted by a DMA, but there may be a way to
> |minimize that when we chose destination pages to use.
>
> Unless you have some way to mark pages as current DMA targets, you,
> you won't be able to do this. And the problem Ken and I are seeing is
> happening after the new kernel has booted. An old DMA operation is
> occuring after the new kernel has booted. That means two kernels would
> have to choose the same DMA target areas, and that's fairly unreasonable
> to ask, IMHO.

Not really, this isn't about matching DMA target areas. Its
about the new kernel ignoring memory that the old kernel was
using and only using the reserved area of memory which the
old kernel was expected to have left alone in normal operation.

This is not the entire spectrum of situations where any physical
address could be a potential DMA target, due to a buggy kernel
which could have passed any address to the device concerned.
For that case, of course, quiescing the devices seems like the
best way out so far.

So whether such reservation would solve the case you see
would depend on whether the old DMA operation is targetted at
a valid buffer in the old kernel, or if it is indeed a buggy
scenario where DMA is happening into an address it shouldn't
really be overwriting.

>
> The only reasonable way I can think of to do this is to quiesce the devices
> before dumping memory or doing a kexec. It's not that hard to do, it's just
> that a lot of DMA capable device drivers exist that don't do this.

Yes, this is indeed what we need eventually.
What would it take to get there ? The main difficulty is making
sure all device drivers do this ..

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India

2003-02-11 15:11:08

by Corey Minyard

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Suparna Bhattacharya wrote:

|On Tue, Feb 11, 2003 at 08:06:44AM -0600, Corey Minyard wrote:
|
|>|
|>|We could just reserve a memory area of reasonable size (how
|>|much ?) which would be used by the new kernel for all its
|>|allocations. We already have the infrastructure to tell the
|>|new kernel which memory areas not to use, so its simple
|>|enough to ask it exclude all but the reserved area.
|>|By issuing the i/o as early as possible during bootup
|>|(for lkcd all we need is the block device to be setup for
|>|i/o requests), we can minimize the amount of memory to
|>|reserve in this manner.
|>
|>DMA can occur almost anywhere. If you restrict the area of DMA, that
|>means you have to copy the contents to the final destination. I don't
think
|>we want to do that in many cases.
|
|
|The scope here was just the case that Eric seemed to be
|trying to address, the way I understood it, and hence a much
|simpler subset of the problem at hand, since it is not really
|tackling the rouge/buggy cases. There is no restriction on
|where DMA can happen, just a block of memory area set aside
|for the dormant kernel to use when it is instantiated.
|So this is an area that the current kernel will not use or
|touch and not specify as a DMA target during "regular"
|operation.

You don't understand. You don't *want* to set aside a block of memory
that's
reserved for DMA. You want to be able to DMA directly into any user
address.
Consider demand paging. The performance would suck if you DMA into some
fixed region then copied to the user address. Plus you then have another
resource you have to manage in the kernel. And you still have to change all
the drivers, buffer management, etc. to add a flag that says "I'm going
to use
this for DMA" to allocations. You might as well add the quiesce
function, it's
probably easier to do. And it doesn't help if you DMA to static memory
addresses.

I, too, would like a simpler solution. I just don't think this is it.

- -Corey
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE+SRTJmUvlb4BhfF4RAg2ZAJ9R52BdasmLGTMI6GmX+2j0CeLXPwCfQzfE
wQYjBHmyCThURH2hjZ83wfE=
=kZiP
-----END PGP SIGNATURE-----


2003-02-11 16:55:12

by Andy Pfiffer

[permalink] [raw]
Subject: Re: [Fastboot] Re: Kexec on 2.5.59 problems ?

On Mon, 2003-02-10 at 23:21, Suparna Bhattacharya wrote:
<snip>
> The following patch from Anton Blanchard's WIP kexec tree
> for ppc64 seems to fix this for me. It just does a use_mm()
> (routine from fs/aio.c) instead of switch_mm().
>
> Andy could you try this out and see if it helps ?
>
> The other change in Anton's tree that we should probably
> include uses a separate kexec_mm rather than init_mm
> for the mapping.
>
> Regards
> Suparna

Will do. --Andy

2003-02-11 23:37:13

by Andy Pfiffer

[permalink] [raw]
Subject: Re: [Fastboot] Re: Kexec on 2.5.59 problems ?

On Tue, 2003-02-11 at 09:04, Andy Pfiffer wrote:
> On Mon, 2003-02-10 at 23:21, Suparna Bhattacharya wrote:
> <snip>
> > The following patch from Anton Blanchard's WIP kexec tree
> > for ppc64 seems to fix this for me. It just does a use_mm()
> > (routine from fs/aio.c) instead of switch_mm().
> >
> > Andy could you try this out and see if it helps ?
> >
> > The other change in Anton's tree that we should probably
> > include uses a separate kexec_mm rather than init_mm
> > for the mapping.
> >
> > Regards
> > Suparna
>
> Will do. --Andy

Answer: hard lock-up after decompressing the kernel. I'll see if I can
get anything meaningful out of the system before it wedges.

Andy


2003-02-12 04:19:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

Corey Minyard <[email protected]> writes:

> Suparna Bhattacharya wrote:
>
> |On Tue, Feb 11, 2003 at 08:06:44AM -0600, Corey Minyard wrote:
> |
> |>|
> |>|We could just reserve a memory area of reasonable size (how
> |>|much ?) which would be used by the new kernel for all its
> |>|allocations. We already have the infrastructure to tell the
> |>|new kernel which memory areas not to use, so its simple
> |>|enough to ask it exclude all but the reserved area.
> |>|By issuing the i/o as early as possible during bootup
> |>|(for lkcd all we need is the block device to be setup for
> |>|i/o requests), we can minimize the amount of memory to
> |>|reserve in this manner.
> |>
> |>DMA can occur almost anywhere. If you restrict the area of DMA, that
> |>means you have to copy the contents to the final destination. I don't think
> |>we want to do that in many cases.
> |
> |
> |The scope here was just the case that Eric seemed to be
> |trying to address, the way I understood it, and hence a much
> |simpler subset of the problem at hand, since it is not really
> |tackling the rouge/buggy cases. There is no restriction on
> |where DMA can happen, just a block of memory area set aside
> |for the dormant kernel to use when it is instantiated.
> |So this is an area that the current kernel will not use or
> |touch and not specify as a DMA target during "regular"
> |operation.
>
> You don't understand. You don't *want* to set aside a block of memory that's
> reserved for DMA. You want to be able to DMA directly into any user address.
> Consider demand paging. The performance would suck if you DMA into some
> fixed region then copied to the user address. Plus you then have another
> resource you have to manage in the kernel. And you still have to change all
> the drivers, buffer management, etc. to add a flag that says "I'm going to use
> this for DMA" to allocations. You might as well add the quiesce function, it's
> probably easier to do. And it doesn't help if you DMA to static memory
> addresses.
>
> I, too, would like a simpler solution. I just don't think this is it.

You have it backwards. It is not about reserving a block of memory
for DMA. It is about reserving a block of memory to not do DMA in.
Something like 4MB or so.

The idea is not to let the original kernel touch the reserved block at all.
We just put the kernel that kexec will start in that block of memory.

Eric

2003-02-12 04:20:39

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] Re: Kexec on 2.5.59 problems ?

Andy Pfiffer <[email protected]> writes:

> On Tue, 2003-02-11 at 09:04, Andy Pfiffer wrote:
> > On Mon, 2003-02-10 at 23:21, Suparna Bhattacharya wrote:
> > <snip>
> > > The following patch from Anton Blanchard's WIP kexec tree
> > > for ppc64 seems to fix this for me. It just does a use_mm()
> > > (routine from fs/aio.c) instead of switch_mm().
> > >
> > > Andy could you try this out and see if it helps ?
> > >
> > > The other change in Anton's tree that we should probably
> > > include uses a separate kexec_mm rather than init_mm
> > > for the mapping.
> > >
> > > Regards
> > > Suparna
> >
> > Will do. --Andy
>
> Answer: hard lock-up after decompressing the kernel. I'll see if I can
> get anything meaningful out of the system before it wedges.

Which kernel is wedging. The kexec'd kernel. Or the kernel with
the patch?

Eric

2003-02-12 04:32:08

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

On Tue, Feb 11, 2003 at 09:20:42AM -0600, Corey Minyard wrote:
> |The scope here was just the case that Eric seemed to be
> |trying to address, the way I understood it, and hence a much
> |simpler subset of the problem at hand, since it is not really
> |tackling the rouge/buggy cases. There is no restriction on
> |where DMA can happen, just a block of memory area set aside
> |for the dormant kernel to use when it is instantiated.
> |So this is an area that the current kernel will not use or
> |touch and not specify as a DMA target during "regular"
> |operation.
>
> You don't understand. You don't *want* to set aside a block of memory
> that's
> reserved for DMA. You want to be able to DMA directly into any user
> address.
> Consider demand paging. The performance would suck if you DMA into some
> fixed region then copied to the user address. Plus you then have another
> resource you have to manage in the kernel. And you still have to change all
> the drivers, buffer management, etc. to add a flag that says "I'm going
> to use
> this for DMA" to allocations.

That is not what I'm suggesting. I wish I knew a better way
to put myself in your shoes and explain this from your context
without repeating myself.

I'm not talking about DMA'ing into a fixed region and copying
into user address.

There is just this (say) 4MB area that the current kernel thinks
is reserved/already allocated. When the recovery kernel comes up
it thinks its booted with just this 4MB of memory, quickly performs
the writeout of the dump to disk (which in the case of lkcd
happens straight from the kernel, _unlike_ MCL which needs
filesystems mounted and drives this from user space), and
then reboots itself the normal way (i.e. not via kexec).

So while the recovery kernel is running in a very constrained
way, it is not _meant_ to be dealing with user-space workloads
etc - just a disk writeout and a prompt reboot.

And there is no need to change drivers, buffer mgmt etc in any
of this. There is no explicit limitation on where to or not to
DMA from.

It is simply what Eric was suggesting, minus the need to build
the recovery kernel to be able to run from different physical
addresses.

That's about all !

Does this make things any clearer ?

> You might as well add the quiesce
> function, it's probably easier to do.

Yes if that can be done for all drivers, well and good.

> And it doesn't help if you DMA to static memory
> addresses.

Again I'm not suggesting we DMA into static memory addresses.
Quite the reverse actually.
The point was that this scheme wouldn't help in the cases where
DMA is happening to static memory addresses.

>
> I, too, would like a simpler solution. I just don't think
this is it.

This wasn't even intended to be a full solution as Eric
has already observed.

I think we must quiesce the devices. Just that for the
cases that this isn't happening yet, we are a little
better off than having nothing all all.

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India

2003-02-12 14:09:19

by Corey Minyard

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Eric W. Biederman wrote:

|Corey Minyard <[email protected]> writes:
|
|>
|>You don't understand. You don't *want* to set aside a block of memory
that's
|>reserved for DMA. You want to be able to DMA directly into any user
address.
|>Consider demand paging. The performance would suck if you DMA into some
|>fixed region then copied to the user address. Plus you then have another
|>resource you have to manage in the kernel. And you still have to
change all
|>the drivers, buffer management, etc. to add a flag that says "I'm
going to use
|>this for DMA" to allocations. You might as well add the quiesce
function, it's
|>probably easier to do. And it doesn't help if you DMA to static memory
|>addresses.
|>
|>I, too, would like a simpler solution. I just don't think this is it.
|
|
|You have it backwards. It is not about reserving a block of memory
|for DMA. It is about reserving a block of memory to not do DMA in.
|Something like 4MB or so.
|
|The idea is not to let the original kernel touch the reserved block at all.
|We just put the kernel that kexec will start in that block of memory.
|
|Eric
|
Ah, it makes much more sense now. Thank you. I still don't think it's
as easy as you think, though.
Because there's no designation on most memory allocations to give you
this information. There's
GFP_DMA, but according to the docs that's just for x86 ISA DMA devices.
You would
have to hunt down all the memory allocations, figure out of they are DMA
targets or not, and add a
flag for that. I still say it's easier to just add the function to the
drivers.

- -Corey
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE+SleKmUvlb4BhfF4RAjbMAJ0RANUJ6OsH0yvKEtfPBty1TPP2dgCfSl48
zjLWwW5Vf7Y/igLXUSdcpNQ=
=MT+8
-----END PGP SIGNATURE-----


2003-02-12 14:42:22

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

Corey Minyard <[email protected]> writes:

> Eric W. Biederman wrote:
>
> |Corey Minyard <[email protected]> writes:
> |
> |>
> |>You don't understand. You don't *want* to set aside a block of memory that's
> |>reserved for DMA. You want to be able to DMA directly into any user address.
> |>Consider demand paging. The performance would suck if you DMA into some
> |>fixed region then copied to the user address. Plus you then have another
> |>resource you have to manage in the kernel. And you still have to change all
> |>the drivers, buffer management, etc. to add a flag that says "I'm going to use
>
> |>this for DMA" to allocations. You might as well add the quiesce function,
> it's
>
> |>probably easier to do. And it doesn't help if you DMA to static memory
> |>addresses.
> |>
> |>I, too, would like a simpler solution. I just don't think this is it.
> |
> |
> |You have it backwards. It is not about reserving a block of memory
> |for DMA. It is about reserving a block of memory to not do DMA in.
> |Something like 4MB or so. |
> |The idea is not to let the original kernel touch the reserved block at all.
> |We just put the kernel that kexec will start in that block of memory.
> |
> |Eric
> |
> Ah, it makes much more sense now. Thank you. I still don't think it's as easy
> as you think, though.
> Because there's no designation on most memory allocations to give you this
> information. There's
> GFP_DMA, but according to the docs that's just for x86 ISA DMA devices. You
> would
> have to hunt down all the memory allocations, figure out of they are DMA targets
>
> or not, and add a
> flag for that. I still say it's easier to just add the function to the drivers.

It is trivial if you don't let alloc_pages give the memory to anyone for
any purpose.

Eric

2003-02-12 15:56:21

by Corey Minyard

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Eric W. Biederman wrote:

|Corey Minyard <[email protected]> writes:
|
|>Eric W. Biederman wrote:
|>
|>|Corey Minyard <[email protected]> writes:
|>|
|>|>
|>|>You don't understand. You don't *want* to set aside a block of
memory that's
|>|>reserved for DMA. You want to be able to DMA directly into any user
address.
|>|>Consider demand paging. The performance would suck if you DMA into some
|>|>fixed region then copied to the user address. Plus you then have
another
|>|>resource you have to manage in the kernel. And you still have to
change all
|>|>the drivers, buffer management, etc. to add a flag that says "I'm
going to use
|>
|>|>this for DMA" to allocations. You might as well add the quiesce
function,
|>it's
|>
|>|>probably easier to do. And it doesn't help if you DMA to static memory
|>|>addresses.
|>|>
|>|>I, too, would like a simpler solution. I just don't think this is it.
|>|
|>|
|>|You have it backwards. It is not about reserving a block of memory
|>|for DMA. It is about reserving a block of memory to not do DMA in.
|>|Something like 4MB or so. |
|>|The idea is not to let the original kernel touch the reserved block
at all.
|>|We just put the kernel that kexec will start in that block of memory.
|>|
|>|Eric
|>|
|>Ah, it makes much more sense now. Thank you. I still don't think
it's as easy
|>as you think, though.
|>Because there's no designation on most memory allocations to give you this
|>information. There's
|>GFP_DMA, but according to the docs that's just for x86 ISA DMA
devices. You
|>would
|>have to hunt down all the memory allocations, figure out of they are
DMA targets
|>
|>or not, and add a
|>flag for that. I still say it's easier to just add the function to
the drivers.
|
|
|It is trivial if you don't let alloc_pages give the memory to anyone for
|any purpose.

Ok, agreed, if you reserve a section of physical memory just for kexec
to copy it's kernel into, it will
prevent DMA from clobbering something from the time kexec copies the
kernel there to the time
decompressing starts.

Another thought. If you add a delay with all other processors and
interrupts off, the disk devices
will run out of things to do.

Once you add all the necessary quiesce functions, these can go away.

I do doubt these will make a big difference, though. The problem we
were seeing was with the
shared control structures in memory. The new kernel laid memory out a
little differently and
things like buffer pointers were overwritten with new data. This, in
turn, cause the device to
do random things. I would guess this is the most likely scenario, since
the time you are protecting
against with the memory layout is small compared to the time spent booting.

- -Corey
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE+SnDpmUvlb4BhfF4RAruzAJ9Wts5ovfYf4Ncl+trn755L6JCc6QCfZffG
xGMlv58qX1v3ue0iLwxMRaw=
=aH5T
-----END PGP SIGNATURE-----


2003-02-12 22:22:25

by Andy Pfiffer

[permalink] [raw]
Subject: Re: [Fastboot] Re: Kexec on 2.5.59 problems ?

On Tue, 2003-02-11 at 20:29, Eric W. Biederman wrote:
> Andy Pfiffer <[email protected]> writes:
>
> > On Tue, 2003-02-11 at 09:04, Andy Pfiffer wrote:
> > > On Mon, 2003-02-10 at 23:21, Suparna Bhattacharya wrote:
> > > <snip>
> > > > The following patch from Anton Blanchard's WIP kexec tree
> > > > for ppc64 seems to fix this for me. It just does a use_mm()
> > > > (routine from fs/aio.c) instead of switch_mm().
> > > >
> > > > Andy could you try this out and see if it helps ?
> > > >
<snip>
> > > > Regards
> > > > Suparna
> > >
> > > Will do. --Andy
> >
> > Answer: hard lock-up after decompressing the kernel. I'll see if I can
> > get anything meaningful out of the system before it wedges.
>
> Which kernel is wedging. The kexec'd kernel. Or the kernel with
> the patch?
>
> Eric

Correction: this patch is now working for me. While pruning my .config
to debug my serial console problem, kexec worked on a 2-way for me
several times in a row without failure. (I hadn't properly updated my
script that invokes kexec with my preferred command line arguments).

I'll add the patchlet to our PLM system, and try the entire package
again on 2.5.60 on a 2-way.

Andy





2003-02-13 09:35:21

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [Fastboot] Re: Kexec on 2.5.59 problems ?

On Wed, Feb 12, 2003 at 02:31:57PM -0800, Andy Pfiffer wrote:
> On Tue, 2003-02-11 at 20:29, Eric W. Biederman wrote:
> > Andy Pfiffer <[email protected]> writes:
> > > On Tue, 2003-02-11 at 09:04, Andy Pfiffer wrote:
> > > > On Mon, 2003-02-10 at 23:21, Suparna Bhattacharya wrote:
> > > > <snip>
> > > > > The following patch from Anton Blanchard's WIP kexec tree
> > > > > for ppc64 seems to fix this for me. It just does a use_mm()
> > > > > (routine from fs/aio.c) instead of switch_mm().
> > > > >
> > > > > Andy could you try this out and see if it helps ?
> > > > >
> <snip>
> > > > > Regards
> > > > > Suparna
> > > >
> > > > Will do. --Andy
> > >
> > > Answer: hard lock-up after decompressing the kernel. I'll see if I can
> > > get anything meaningful out of the system before it wedges.
> >
> > Which kernel is wedging. The kexec'd kernel. Or the kernel with
> > the patch?
> >
> > Eric
>
> Correction: this patch is now working for me. While pruning my .config
> to debug my serial console problem, kexec worked on a 2-way for me
> several times in a row without failure. (I hadn't properly updated my
> script that invokes kexec with my preferred command line arguments).

Great !
Eventually we should probably avoid init_mm altogether (on ppc64
at least, init_mm can't be used as Anton pointed out to me) and
setup a spare mm instead.

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India

2003-02-13 10:58:19

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

On Wed, Feb 12, 2003 at 10:06:02AM -0600, Corey Minyard wrote:
> |
> |It is trivial if you don't let alloc_pages give the memory to anyone for
> |any purpose.
>
> Ok, agreed, if you reserve a section of physical memory just for kexec
> to copy it's kernel into, it will
> prevent DMA from clobbering something from the time kexec copies the
> kernel there to the time
> decompressing starts.
>
> Another thought. If you add a delay with all other processors and
> interrupts off, the disk devices
> will run out of things to do.
>
> Once you add all the necessary quiesce functions, these can go away.
>
> I do doubt these will make a big difference, though. The problem we
> were seeing was with the
> shared control structures in memory. The new kernel laid memory out a
> little differently and
> things like buffer pointers were overwritten with new data. This, in
> turn, cause the device to
> do random things. I would guess this is the most likely scenario, since

The trick is that the new kernel's allocations are also from the
reserved area. (Using the same technique that MCL relies on to
avoid allocating from and stomping over the pages containing
crash dump).

So the memory used by the old and new kernel are mutually
exclusive.

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India

2003-02-13 15:01:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] Re: Kexec on 2.5.59 problems ?

Suparna Bhattacharya <[email protected]> writes:

> Great !
> Eventually we should probably avoid init_mm altogether (on ppc64
> at least, init_mm can't be used as Anton pointed out to me) and
> setup a spare mm instead.

What is the problem with init_mm? Besides the fact that using it
is now failing?

Eric

2003-02-14 03:04:32

by Werner Almesberger

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

Corey Minyard wrote:
> Another thought. If you add a delay with all other processors and
> interrupts off, the disk devices
> will run out of things to do.

But the network will be there, patiently waiting for its chance to
strike. Likewise, I guess: USB (e.g. move the mouse at the wrong
moment to crash the system).

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-02-14 14:10:20

by Corey Minyard

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Werner Almesberger wrote:

|Corey Minyard wrote:
|
|>Another thought. If you add a delay with all other processors and
|>interrupts off, the disk devices
|>will run out of things to do.
|
|
|But the network will be there, patiently waiting for its chance to
|strike. Likewise, I guess: USB (e.g. move the mouse at the wrong
|moment to crash the system).

Yes, we were talking about temporary stopgaps.

But, I had another idea. What about using power management? If you
suspended everything, would that be good enough. I looked at a few
drivers, and it seemed so.

- -Corey
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE+TPsOmUvlb4BhfF4RAic3AJ4qKgL0CHROXoyu30rWlhfzlBxOEgCfSzJ6
GeM4AJbZLaHv8GeD5N/uaHI=
=Dg5Q
-----END PGP SIGNATURE-----


2003-02-14 18:01:22

by Werner Almesberger

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

Corey Minyard wrote:
> Yes, we were talking about temporary stopgaps.

Yes, that's what Suparna and Eric were discussing :-)

> But, I had another idea. What about using power management? If you
> suspended everything, would that be good enough. I looked at a few
> drivers, and it seemed so.

As long as you don't need any form of synchronization to power
down a device, and if it comes up silent (i.e. no "sleep" mode,
in which it still has enough power to remember DMA lists and
such), that would work.

I'd suspect that power management requiries you to synchronize,
so we're back to square one.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-02-14 18:13:34

by Corey Minyard

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Werner Almesberger wrote:

|Corey Minyard wrote:
|
|>Yes, we were talking about temporary stopgaps.
|
|
|Yes, that's what Suparna and Eric were discussing :-)
|
|>But, I had another idea. What about using power management? If you
|>suspended everything, would that be good enough. I looked at a few
|>drivers, and it seemed so.
|
|
|As long as you don't need any form of synchronization to power
|down a device, and if it comes up silent (i.e. no "sleep" mode,
|in which it still has enough power to remember DMA lists and
|such), that would work.
|
|I'd suspect that power management requiries you to synchronize,
|so we're back to square one.

Yes, some do and some don't. You could define a new state for the
"suspend" call that says "just shut down with no locks". But the
infrastructure is already in the PCI code and others to do a suspend,
you could use that and take it out of all the CONFIG_PM ifdefs.

- -Corey
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE+TTQXmUvlb4BhfF4RArOXAJ4o5F7zgV6zZv9jAXBx5za2xvnZUgCfYj9A
byMvWFosxYMN6/0Ibmk7Ors=
=RLPc
-----END PGP SIGNATURE-----


2003-02-14 19:19:36

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

On Fri, 14 Feb 2003, Corey Minyard wrote:

> Yes, some do and some don't. You could define a new state for the
> "suspend" call that says "just shut down with no locks". But the
> infrastructure is already in the PCI code and others to do a suspend,
> you could use that and take it out of all the CONFIG_PM ifdefs.

I don't think suspending devices is safe at that stage since removing
devices and walking lists and freeing memory and disabling devices and...
kicks up quite a storm.

Zwane
--
function.linuxpower.ca

2003-02-14 19:35:54

by Werner Almesberger

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

Zwane Mwaikambo wrote:
> I don't think suspending devices is safe at that stage since removing
> devices and walking lists and freeing memory and disabling devices and...
> kicks up quite a storm.

If you *really* don't want to stop devices, you can use the
"reserved, non-DMA memory" approach, kexec the kernel that
records the crash dump, and then do a system-wide reset, or
such.

But if you don't have that - possibly considerable - amount
of memory to spare, you don't have much of a choice than to
stop devices. Of course, crash dumps don't need a neat and
clean shutdown, so you can avoid all the kfrees, and such.

(So adding a special mode to the power management code may
be too much overhead. Besides, sometimes, you can just pull
a reset line, and don't have to do anything even remotely
related to power management.)

Also, for each device you're using when dumping, you should
have some means to bring it into a defined state already.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-02-14 19:50:48

by Corey Minyard

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Werner Almesberger wrote:

|Zwane Mwaikambo wrote:
|
|>I don't think suspending devices is safe at that stage since removing
|>devices and walking lists and freeing memory and disabling devices and...
|>kicks up quite a storm.
|
|
|If you *really* don't want to stop devices, you can use the
|"reserved, non-DMA memory" approach, kexec the kernel that
|records the crash dump, and then do a system-wide reset, or
|such.
|
|But if you don't have that - possibly considerable - amount
|of memory to spare, you don't have much of a choice than to
|stop devices. Of course, crash dumps don't need a neat and
|clean shutdown, so you can avoid all the kfrees, and such.
|
|(So adding a special mode to the power management code may
|be too much overhead. Besides, sometimes, you can just pull
|a reset line, and don't have to do anything even remotely
|related to power management.)

True, I didn't mean the high-level power management code directly. But the
PCI API defines a suspend operation that could take a special mode for this.
Or maybe a new field in the PCI structure (and equivalent for other
things, if
there are any). But the suspend and resume operations should at least give
a good idea where its needed and how to use it.

- -Corey
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE+TUrdmUvlb4BhfF4RAn2aAJ40Ktn3x6Zygs1RMnAs/HLp5YqtHwCaA3kD
lRNA6aXFagCkjbE87e+DZCw=
=9wf1
-----END PGP SIGNATURE-----


2003-02-15 05:53:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

Corey Minyard <[email protected]> writes:

> Werner Almesberger wrote:
>
> |Zwane Mwaikambo wrote:
> |
> |>I don't think suspending devices is safe at that stage since removing
> |>devices and walking lists and freeing memory and disabling devices and...
> |>kicks up quite a storm.
> |
> |
> |If you *really* don't want to stop devices, you can use the
> |"reserved, non-DMA memory" approach, kexec the kernel that
> |records the crash dump, and then do a system-wide reset, or
> |such.
> |
> |But if you don't have that - possibly considerable - amount
> |of memory to spare, you don't have much of a choice than to
> |stop devices. Of course, crash dumps don't need a neat and
> |clean shutdown, so you can avoid all the kfrees, and such.
> |
> |(So adding a special mode to the power management code may
> |be too much overhead. Besides, sometimes, you can just pull
> |a reset line, and don't have to do anything even remotely
> |related to power management.)
>
> True, I didn't mean the high-level power management code directly. But the
> PCI API defines a suspend operation that could take a special mode for this.

The generic device api has a shutdown method for this. And in the non panic
case we use it. Not a lot of devices have it implemented but it exists.

And except that it doesn't have a restriction that it can't block is pretty
much what you want.

> Or maybe a new field in the PCI structure (and equivalent for other things, if
> there are any). But the suspend and resume operations should at least give
> a good idea where its needed and how to use it.

The API is already done...

We just don't trust the dying kernel enough to use it during a panic.

Eric

2003-02-16 16:12:58

by Corey Minyard

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Eric W. Biederman wrote:

|Corey Minyard <[email protected]> writes:
|
|>|
|>|(So adding a special mode to the power management code may
|>|be too much overhead. Besides, sometimes, you can just pull
|>|a reset line, and don't have to do anything even remotely
|>|related to power management.)
|>
|>True, I didn't mean the high-level power management code directly.
But the
|>PCI API defines a suspend operation that could take a special mode for
this.
|
|
|The generic device api has a shutdown method for this. And in the non
panic
|case we use it. Not a lot of devices have it implemented but it exists.
|
|And except that it doesn't have a restriction that it can't block is pretty
|much what you want.

That's a pretty big restriction. Plus, you can't claim spinlocks.

The panic shutdown is different from an orderly shutdown. What the
current shutdown does is probably not what you want.

|
|>Or maybe a new field in the PCI structure (and equivalent for other
things, if
|>there are any). But the suspend and resume operations should at least
give
|>a good idea where its needed and how to use it.
|
|
|The API is already done...

The API is not done for panics. There's no call that has the proper
semantics.

|
|
|We just don't trust the dying kernel enough to use it during a panic.

I don't understand this. If you can't trust a dying kernel to properly
shut down devices, how can you trust it to boot a new kernel? And (much
worse) if you don't shut down the devices, how can you trust the new
kernel to execute properly? I know there are levels of trust here, but
I'd much rather have the kernel lockup during the reboot than have a
chance of a new kernel booting that could behave incorrectly. In
general, the chance of behaving incorrectly is MUCH worse than a sure
lockup, especially in systems that must be reliable.

- -Corey
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE+T7rOIXnXXONXERcRAksfAJ9kVRD2S9OK5siBqAPMkbfi2iS2fgCeM3hw
Fjp2LXiNEURU+HNrByOGVBQ=
=5sxh
-----END PGP SIGNATURE-----


2003-02-16 21:38:41

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

Corey Minyard <[email protected]> writes:

> Eric W. Biederman wrote:
>
> |Corey Minyard <[email protected]> writes:
> |
> |>|
> |>|(So adding a special mode to the power management code may
> |>|be too much overhead. Besides, sometimes, you can just pull
> |>|a reset line, and don't have to do anything even remotely
> |>|related to power management.)
> |>
> |>True, I didn't mean the high-level power management code directly. But the
> |>PCI API defines a suspend operation that could take a special mode for this.
> |
> |
> |The generic device api has a shutdown method for this. And in the non panic
> |case we use it. Not a lot of devices have it implemented but it exists.
> |
> |And except that it doesn't have a restriction that it can't block is pretty
> |much what you want.
>
> That's a pretty big restriction. Plus, you can't claim spinlocks.
>
> The panic shutdown is different from an orderly shutdown. What the current
> shutdown does is probably not what you want.

I do not see a large difference between the desired semantics of an
orderly shutdown, and the desired semantics of a panic shutdown.

> |>Or maybe a new field in the PCI structure (and equivalent for other things, if
>
> |>there are any). But the suspend and resume operations should at least give
> |>a good idea where its needed and how to use it.
> |
> |
> |The API is already done...
>
> The API is not done for panics. There's no call that has the proper semantics.

device->shutdown() is new enough and unimplemented enough that adding a restriction
against blocking is a reasonable additional, restriction. If that is a reasonable
thing to do.

> |
> |
> |We just don't trust the dying kernel enough to use it during a panic.
>
> I don't understand this. If you can't trust a dying kernel to properly shut
> down devices, how can you trust it to boot a new kernel?

The kernel started during panic has one purpose, to record the state of
the system for analysis. So it need not support a fully functioning
user space.

By definition if a panic has happened something bad has happened, we assume
it is a software problem.

> And (much worse) if
> you don't shut down the devices, how can you trust the new kernel to execute
> properly?

Because the kernel to handle the panic only initializes those devices
it can reliably initialize from any state. And it is living in an
area of memory the old kernel did not allow DMA to.

> I know there are levels of trust here, but I'd much rather have the
> kernel lockup during the reboot than have a chance of a new kernel booting that
> could behave incorrectly.

The kexec on panic thing is not to replace a reboot. It is to
reliably capture the system state when something nasty happens, which
you cannot do after a reboot.

If the system can be made robust enough to use for other purposes
great, but that is not the goal.

> In general, the chance of behaving incorrectly is
> MUCH worse than a sure lockup, especially in systems that must be
> reliable.

Basically the panic logic does not change:
if (...) {
machine_kexec();
}
else {
machine_restart();
}

After an event like that you may need to restart the machine to be
100% reliable. Or much more likely it was a hardware failure and
hardware needs to be replaced.

But if it is a software failure kexec'ing a new kernel should provide
the capability so the software state at the failure can be captured so
the problem does not need to be reproduced for the developers.
Allowing the software to be corrected more quickly, and hopefully
correcting the problem before it would reoccur naturally.

Eric

2003-02-17 04:16:25

by Corey Minyard

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Eric W. Biederman wrote:

|Corey Minyard <[email protected]> writes:
|
|
|
|>Eric W. Biederman wrote:
|>
|>|Corey Minyard <[email protected]> writes:
|>|
|>|>|
|>|>|(So adding a special mode to the power management code may
|>|>|be too much overhead. Besides, sometimes, you can just pull
|>|>|a reset line, and don't have to do anything even remotely
|>|>|related to power management.)
|>|>
|>|>True, I didn't mean the high-level power management code directly.
But the
|>|>PCI API defines a suspend operation that could take a special mode
for this.
|>|
|>|
|>|The generic device api has a shutdown method for this. And in the
non panic
|>|case we use it. Not a lot of devices have it implemented but it exists.
|>|
|>|And except that it doesn't have a restriction that it can't block is
pretty
|>|much what you want.
|>
|>That's a pretty big restriction. Plus, you can't claim spinlocks.
|>
|>The panic shutdown is different from an orderly shutdown. What the
current
|>shutdown does is probably not what you want.
|>
|>
|
|I do not see a large difference between the desired semantics of an
|orderly shutdown, and the desired semantics of a panic shutdown.
|
An orderly shutdown will:

~ * claim locks and block as necessary
~ * free memory associated with the device
~ * flush device queues
~ * Fully shut down the device

An orderly shutdown should make sure the system remains sane after it
finishes and the data on the device is correct.

A panic shutdown should only disable DMA with as little code as possible
without locking, blocking, etc. No effort should be taken to keep the
system sane (beyond clobbering memory), since it's not sane to begin
with :-).

You may want to say that this shutdown will be the panic shutdown and not be
an orderly shutdown. That's fine, although I would suggest a name change.
I couldn't find any documentation on what the shutdown call was supposed
to do.

|
|
|
|Because the kernel to handle the panic only initializes those devices
|it can reliably initialize from any state. And it is living in an
|area of memory the old kernel did not allow DMA to.
|
Are you sure this will be ok? I'm not sure either way. How much memory
does
a kernel take to boot up and operate for this? If it's a few meg, it's
probably livable.
If it's a lot of memory, it's probably not going to be acceptable.

Plus, perhaps you would want to protect the output of the kernel dump
somehow.
That's going to be a lot more memory than you can reserve. And if you
can shut
off DMA, none of this should matter anyway.

The rest of what you said, about the panic kernel only taking the core
dump and
then rebooting, makes sense to me.

- -Corey
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE+UGRfIXnXXONXERcRArgvAJ96cqVaxZeA83KuR1kSXFKVRSnpIACfQ83W
gc5bibmlh4sPmmq6onPc5w0=
=bgpj
-----END PGP SIGNATURE-----


2003-02-17 07:08:33

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

Corey Minyard <[email protected]> writes:

> An orderly shutdown will:
>
> ~ * claim locks and block as necessary
> ~ * free memory associated with the device
> ~ * flush device queues
> ~ * Fully shut down the device

Most of this is in required before a module is removed, but quite
a bit of this does not belong in the shutdown method. As user space,
and higher level code can down interfaces and remount hardware read
only.

So by the time it gets to the device driver for the final shutdown you
get:

- Locks are irrelevant because there are no other users (by
definition).

- Queues are irrelevant as the device driver is not responsible for
those. Someone needs to call sync, down the network interface,
or the equivalent to push the last of the data through the device.

- Freeing memory is only needed if the kernel will persist, on a
reboot or a halt that is not the case.

> An orderly shutdown should make sure the system remains sane after it
> finishes and the data on the device is correct.

But there are multiple layers to how that should be accomplished.

> A panic shutdown should only disable DMA with as little code as possible
> without locking, blocking, etc. No effort should be taken to keep the
> system sane (beyond clobbering memory), since it's not sane to begin with :-).
>
> You may want to say that this shutdown will be the panic shutdown and not be
> an orderly shutdown. That's fine, although I would suggest a name change.
> I couldn't find any documentation on what the shutdown call was supposed to do.

Currently the shutdown method is supposed to do the minimal amount needed to
place a device in a quiescent state before a reboot or a halt. And it
should optionally place the device in a state from which it can be
reinitialized by the driver later.

Essentially that is turning off DMA. And setting a few registers
so that the device can be restarted later. It explicitly does
not included freeing resources.

There may be some blocking waiting for the device, but I do not see
blocking waiting for other users of the device as correct. By
definition all other users are gone, so any locks in that code protect
nothing.

So I do not see how that differs, in any significant way from what you
figure the panic handler should do. Though I do admit I have
questions if the code would be reliable in a panic situation.

> |Because the kernel to handle the panic only initializes those devices
> |it can reliably initialize from any state. And it is living in an
> |area of memory the old kernel did not allow DMA to.
> |

> Are you sure this will be ok? I'm not sure either way. How much memory does
> a kernel take to boot up and operate for this? If it's a few meg, it's probably
>
> livable.
> If it's a lot of memory, it's probably not going to be acceptable.

Somewhere from 2-8Meg I would guess is sufficient on x86. It
primarily depends on the size of the dump kernel, and the user space.
8Meg used to be enough to run X.

> Plus, perhaps you would want to protect the output of the kernel dump somehow.
> That's going to be a lot more memory than you can reserve. And if you can shut
> off DMA, none of this should matter anyway.

Protect it? It does not need to be generated until after the new
kernel takes over. So the dump never needs to live in ram.

As for turning off DMA on a panic. It has already been shown that DMA
can not be reliably turned off in device independent way. Something
is definitely broken, and the odds are it is a driver. If there is
special panic code in a driver it likely to be the least tested code
path, if it exists at all. And with so many random devices out there
I do not see how it can be shown that they all have correct panic code.

I do not see shutting down all DMA as feasible, which is why I am
attempting to avoid the issue. With a reserved area of memory,
the only thing I am left to worry about is the unlikely case some
device doing DMA to an incorrect location that just happens to be
where the recovery kernel sits.

> The rest of what you said, about the panic kernel only taking the core dump and
> then rebooting, makes sense to me.

Good.

Eric

2003-02-17 17:23:09

by Corey Minyard

[permalink] [raw]
Subject: Re: Kexec, DMA, and SMP

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So the semantics of the shutdown method is: "You are being called at
reboot or halt time, no other processes are running or will ever run,
quiesce the device, but do nothing else". Then obviously, it's exactly
what we need, if you can get the device driver writers to implement it.

It would be very nice to have documentation on this (and the rest of the
driver model, too). The docs in the kernel don't give a big picture.
In fact, just reading the docs give you no idea what a driver model is.
Does some other source of documentation exist?

device_shutdown() claims a semaphore for some reason, though. I suspect
it's not necessary.

- -Corey
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE+URzCIXnXXONXERcRAhmgAKCIan40sZy389m9FS/ESkH96v3efACgrwx5
hsU4LWh+FigmWx9RlejSix8=
=AMdY
-----END PGP SIGNATURE-----


2003-02-18 10:45:51

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [Fastboot] Re: Kexec on 2.5.59 problems ?

Here's the explanation from Anton about why using init_mm is
a problem on ppc64.

Regards
Suparna

----- Forwarded message from Anton Blanchard <[email protected]> -----

Date: Tue, 18 Feb 2003 20:56:23 +1100
From: Anton Blanchard <[email protected]>
To: Suparna Bhattacharya <[email protected]>
Subject: Re: Fw: Re: [Fastboot] Re: Kexec on 2.5.59 problems ?


On Thu, Feb 13, 2003 at 08:10:41AM -0700, Eric W. Biederman wrote:
> Suparna Bhattacharya <[email protected]> writes:
>
> > Great !
> > Eventually we should probably avoid init_mm altogether (on ppc64
> > at least, init_mm can't be used as Anton pointed out to me) and
> > setup a spare mm instead.
>
> What is the problem with init_mm? Besides the fact that using it
> is now failing?
>


Hi Suparna,

On ppc64 we have many 2^41B (2 TB) regions:

USER
KERNEL
VMALLOC
IO

Why 2TB? Well our three level linux pagetables can map 2TB. The kernel has
no pagetables, so we only need three sets of pagetables. As usual each
user task has its own set of pagetables. So that leaves vmalloc and IO.

For IO we create our own pgd, ioremap_pgd and for vmalloc we use init_mm.
Why not? Its not being used anywhere else... except for kexec.

So init_mm covers the region of:

0xD000000000000000 to 0xD000000000000000+2^41

And what kexec wants is a page under 4GB :)

Thats why we created another mm.


Could you please forward it on to the list?

Thanks!
Anton

----- End forwarded message -----

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India

2003-02-18 14:56:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] Re: Kexec on 2.5.59 problems ?

Suparna Bhattacharya <[email protected]> writes:

> Here's the explanation from Anton about why using init_mm is
> a problem on ppc64.

Thanks.

> Hi Suparna,
>
> On ppc64 we have many 2^41B (2 TB) regions:
>
> USER
> KERNEL
> VMALLOC
> IO
>
> Why 2TB? Well our three level linux pagetables can map 2TB. The kernel has
> no pagetables, so we only need three sets of pagetables. As usual each
> user task has its own set of pagetables. So that leaves vmalloc and IO.
>
> For IO we create our own pgd, ioremap_pgd and for vmalloc we use init_mm.
> Why not? Its not being used anywhere else... except for kexec.
>
> So init_mm covers the region of:
>
> 0xD000000000000000 to 0xD000000000000000+2^41
>
> And what kexec wants is a page under 4GB :)

In this case it definitely wants something identity mapped, which would
mean in the first 2TB region. On x86 the limit is 4GB because I only have
32bit pointers. On a 64bit arch that limit should go away.

> Thats why we created another mm.

That makes sense. I guess it boils down to the fact that init_mm
is special cased in a number of places and using it I am likely to get
me into trouble...

You would not happen to have code that creates a separate mm so I can
be lazy would you?

Eric