Hi Neil,
first some general words about the problem you discovered: The problem
is not caused by in-flight DMA. The problem is that the IOMMU hardware
has cached the old DTE entry for the device (including the old
page-table root pointer) and is using it still when the kdump kernel has
booted. We had this problem once and fixed it by flushing a DTE in the
IOMMU before it is used for the first time. This seems to be broken
now. Which kernel have you seen this on?
I am back in office next tuesday and will look into this problem too.
On Wed, Mar 31, 2010 at 04:27:45PM -0400, Neil Horman wrote:
> So I'm officially rescinding this patch.
Yeah, the right solution to this problem is to find out why every DTE is
not longer flushed before first use.
> It apparently just covered up the problem, rather than solved it
> outright. This is going to take some more thought on my part. I read
> the code a bit closer, and the amd iommu on boot up currently marks
> all its entries as valid and having a valid translation (because if
> they're marked as invalid they're passed through untranslated which
> strikes me as dangerous, since it means a dma address treated as a bus
> address could lead to memory corruption. The saving grace is that
> they are marked as non-readable and non-writeable, so any device doing
> a dma after the reinit should get logged (which it does), and then
> target aborted (which should effectively squash the translation)
Right. The default for all devices is to forbid DMA.
> I'm starting to wonder if:
>
> 1) some dmas are so long lived they start aliasing new dmas that get mapped in
> the kdump kernel leading to various erroneous behavior
At least not in this case. Even when this is true the DMA would target
memory of the crashed kernel and not the kdump area. This is not even
memory corruption because the device will write to memory the driver has
allocated for it.
> 2) a slew of target aborts to some hardware results in them being in an
> inconsistent state
Thats indeed true. I have seen that with ixgbe cards for example. They
seem to be really confused after an target abort.
> I'm going to try marking the dev table on shutdown such that all devices have no
> read/write permissions to see if that changes the situation. It should I think
> give me a pointer as to weather (1) or (2) is the more likely problem.
Probably not. You still need to flush the old entries out of the IOMMU.
Thanks,
Joerg
On Thu, Apr 01, 2010 at 04:29:02PM +0200, Joerg Roedel wrote:
> Hi Neil,
>
> first some general words about the problem you discovered: The problem
> is not caused by in-flight DMA. The problem is that the IOMMU hardware
> has cached the old DTE entry for the device (including the old
> page-table root pointer) and is using it still when the kdump kernel has
> booted. We had this problem once and fixed it by flushing a DTE in the
> IOMMU before it is used for the first time. This seems to be broken
> now. Which kernel have you seen this on?
>
First noted on 2.6.32 (the RHEL6 beta kernel), but I've reproduced with the
latest linus tree as well.
> I am back in office next tuesday and will look into this problem too.
>
Thank you.
> On Wed, Mar 31, 2010 at 04:27:45PM -0400, Neil Horman wrote:
> > So I'm officially rescinding this patch.
>
> Yeah, the right solution to this problem is to find out why every DTE is
> not longer flushed before first use.
>
Right, I've checked the commits that chris noted in his previous email and
they're in place, so I'm not sure how we're getting stale dte's
> > It apparently just covered up the problem, rather than solved it
> > outright. This is going to take some more thought on my part. I read
> > the code a bit closer, and the amd iommu on boot up currently marks
> > all its entries as valid and having a valid translation (because if
> > they're marked as invalid they're passed through untranslated which
> > strikes me as dangerous, since it means a dma address treated as a bus
> > address could lead to memory corruption. The saving grace is that
> > they are marked as non-readable and non-writeable, so any device doing
> > a dma after the reinit should get logged (which it does), and then
> > target aborted (which should effectively squash the translation)
>
> Right. The default for all devices is to forbid DMA.
>
Thanks, glad to know I read that right, took me a bit to understand it :)
> > I'm starting to wonder if:
> >
> > 1) some dmas are so long lived they start aliasing new dmas that get mapped in
> > the kdump kernel leading to various erroneous behavior
>
> At least not in this case. Even when this is true the DMA would target
> memory of the crashed kernel and not the kdump area. This is not even
> memory corruption because the device will write to memory the driver has
> allocated for it.
>
Yeah, I figured that old dma's going to old locations were ok, I was more
concerned that if an 'old' dma lived through our resetting of the iommu page
table, leading to us pointing an old dma address to a new physical address
within the new kernel memory space. Although, given the reset state of the
tables, for that to happen a dma would have to not attempt a memory transaction
until sometime later in the boot, which seems...unlikely to say the least, so I
agree this is almost certainly not happening.
> > 2) a slew of target aborts to some hardware results in them being in an
> > inconsistent state
>
> Thats indeed true. I have seen that with ixgbe cards for example. They
> seem to be really confused after an target abort.
>
Yeah, this part worries me, target aborts lead to various brain dead hardware
pieces. What are you thoughts on leaving the iommu on through a reboot to avoid
this issue (possibly resetting any pci device that encounters a target abort, as
noted in the error log on the iommu?
> > I'm going to try marking the dev table on shutdown such that all devices have no
> > read/write permissions to see if that changes the situation. It should I think
> > give me a pointer as to weather (1) or (2) is the more likely problem.
>
> Probably not. You still need to flush the old entries out of the IOMMU.
>
Yeah, after reading your explination above, I agree
Neil
>
On Thu, Apr 01, 2010 at 10:47:36AM -0400, Neil Horman wrote:
> On Thu, Apr 01, 2010 at 04:29:02PM +0200, Joerg Roedel wrote:
> > I am back in office next tuesday and will look into this problem too.
> >
> Thank you.
Just took a look and I think the problem is that the devices are
attached to domains before the IOMMU hardware is enabled. This happens
in the function prealloc_protection_domains(). The attach code issues
the dte-invalidate commands but they are not executed because the
hardware is off. I will verify this when I have access to hardware
again.
The possible fix will be to enable the hardware earlier in the
initialization path.
> > Right. The default for all devices is to forbid DMA.
> >
> Thanks, glad to know I read that right, took me a bit to understand it :)
I should probably add a comment :-)
> > Thats indeed true. I have seen that with ixgbe cards for example. They
> > seem to be really confused after an target abort.
> >
> Yeah, this part worries me, target aborts lead to various brain dead hardware
> pieces. What are you thoughts on leaving the iommu on through a reboot to avoid
> this issue (possibly resetting any pci device that encounters a target abort, as
> noted in the error log on the iommu?
This would only prevent possible data corruption. When the IOMMU is off
the devices will not get a target abort but will only write to different
physical memory locations. The window where a target abort can happen
starts when the kdump kernel re-enables the IOMMU and ends when the new
driver for that device attaches. This is a small window but there is not
a lot we can do to avoid this small time window.
Joerg
On Thu, Apr 01, 2010 at 05:56:43PM +0200, Joerg Roedel wrote:
> On Thu, Apr 01, 2010 at 10:47:36AM -0400, Neil Horman wrote:
> > On Thu, Apr 01, 2010 at 04:29:02PM +0200, Joerg Roedel wrote:
> > > I am back in office next tuesday and will look into this problem too.
> > >
> > Thank you.
>
> Just took a look and I think the problem is that the devices are
> attached to domains before the IOMMU hardware is enabled. This happens
> in the function prealloc_protection_domains(). The attach code issues
> the dte-invalidate commands but they are not executed because the
> hardware is off. I will verify this when I have access to hardware
> again.
> The possible fix will be to enable the hardware earlier in the
> initialization path.
>
That sounds like a reasonable theory, I'll try hack something together shortly.
> > > Right. The default for all devices is to forbid DMA.
> > >
> > Thanks, glad to know I read that right, took me a bit to understand it :)
>
> I should probably add a comment :-)
>
> > > Thats indeed true. I have seen that with ixgbe cards for example. They
> > > seem to be really confused after an target abort.
> > >
> > Yeah, this part worries me, target aborts lead to various brain dead hardware
> > pieces. What are you thoughts on leaving the iommu on through a reboot to avoid
> > this issue (possibly resetting any pci device that encounters a target abort, as
> > noted in the error log on the iommu?
>
> This would only prevent possible data corruption. When the IOMMU is off
> the devices will not get a target abort but will only write to different
> physical memory locations. The window where a target abort can happen
> starts when the kdump kernel re-enables the IOMMU and ends when the new
> driver for that device attaches. This is a small window but there is not
> a lot we can do to avoid this small time window.
>
Can you explain this a bit further please? From what I read, when the iommu is
disabled, AIUI it does no translations. That means that any dma addresses which
the driver mapped via the iommu prior to a crash that are stored in devices will
just get strobed on the bus without any translation. If those dma address do
not lay on top of any physical ram, won't that lead to bus errors, and
transaction aborts? Worse, if those dma addresses do lie on top of real
physical addresses, won't we get corruption in various places? Or am I missing
part of how that works?
Regards
Neil
> Joerg
>
>
> _______________________________________________
> kexec mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/kexec
On Thu, Apr 01, 2010 at 01:11:49PM -0400, Neil Horman wrote:
> On Thu, Apr 01, 2010 at 05:56:43PM +0200, Joerg Roedel wrote:
> > The possible fix will be to enable the hardware earlier in the
> > initialization path.
> >
> That sounds like a reasonable theory, I'll try hack something together
> shortly.
Great. So the problem might be already fixed when I am back in the
office ;-)
> > This would only prevent possible data corruption. When the IOMMU is off
> > the devices will not get a target abort but will only write to different
> > physical memory locations. The window where a target abort can happen
> > starts when the kdump kernel re-enables the IOMMU and ends when the new
> > driver for that device attaches. This is a small window but there is not
> > a lot we can do to avoid this small time window.
> >
> Can you explain this a bit further please? From what I read, when the iommu is
> disabled, AIUI it does no translations. That means that any dma addresses which
> the driver mapped via the iommu prior to a crash that are stored in devices will
> just get strobed on the bus without any translation. If those dma address do
> not lay on top of any physical ram, won't that lead to bus errors, and
> transaction aborts? Worse, if those dma addresses do lie on top of real
> physical addresses, won't we get corruption in various places? Or am I missing
> part of how that works?
Hm, the device address may not be a valid host physical address, thats
true. But the problem with the small time-window when the IOMMU hardware
is re-programmed from the kdump kernel still exists.
I need to think about other possible side-effects of leaving the IOMMU
enabled on shutdown^Wboot into a kdump kernel.
Joerg
On Thu, Apr 01, 2010 at 10:14:34PM +0200, Joerg Roedel wrote:
> On Thu, Apr 01, 2010 at 01:11:49PM -0400, Neil Horman wrote:
> > On Thu, Apr 01, 2010 at 05:56:43PM +0200, Joerg Roedel wrote:
>
> > > The possible fix will be to enable the hardware earlier in the
> > > initialization path.
> > >
> > That sounds like a reasonable theory, I'll try hack something together
> > shortly.
>
> Great. So the problem might be already fixed when I am back in the
> office ;-)
>
Don't hold your breath, but I'll try my best :)
> > > This would only prevent possible data corruption. When the IOMMU is off
> > > the devices will not get a target abort but will only write to different
> > > physical memory locations. The window where a target abort can happen
> > > starts when the kdump kernel re-enables the IOMMU and ends when the new
> > > driver for that device attaches. This is a small window but there is not
> > > a lot we can do to avoid this small time window.
> > >
> > Can you explain this a bit further please? From what I read, when the iommu is
> > disabled, AIUI it does no translations. That means that any dma addresses which
> > the driver mapped via the iommu prior to a crash that are stored in devices will
> > just get strobed on the bus without any translation. If those dma address do
> > not lay on top of any physical ram, won't that lead to bus errors, and
> > transaction aborts? Worse, if those dma addresses do lie on top of real
> > physical addresses, won't we get corruption in various places? Or am I missing
> > part of how that works?
>
> Hm, the device address may not be a valid host physical address, thats
> true. But the problem with the small time-window when the IOMMU hardware
> is re-programmed from the kdump kernel still exists.
> I need to think about other possible side-effects of leaving the IOMMU
> enabled on shutdown^Wboot into a kdump kernel.
>
I think its an interesting angle to consider. Thats why I was talking about
cloning the old tables in the new kdump kernel and using the error log to filter
out entries that we could safely assume were complete until enough of the iommu
page tables were free, so that we could continue to hobble along in the kdump
kernel until we got to a proper reboot. All just thought experiment of course.
I'll try tinkering with your idea above first.
Neil
> Joerg
>
>
* Neil Horman ([email protected]) wrote:
> On Thu, Apr 01, 2010 at 10:14:34PM +0200, Joerg Roedel wrote:
> > On Thu, Apr 01, 2010 at 01:11:49PM -0400, Neil Horman wrote:
> > > On Thu, Apr 01, 2010 at 05:56:43PM +0200, Joerg Roedel wrote:
> >
> > > > The possible fix will be to enable the hardware earlier in the
> > > > initialization path.
> > > >
> > > That sounds like a reasonable theory, I'll try hack something together
> > > shortly.
> >
> > Great. So the problem might be already fixed when I am back in the
> > office ;-)
> >
> Don't hold your breath, but I'll try my best :)
The problem we had before was a combo of not clearing PDE and not having
the command buffer set up properly. Now we have basically the same
problem. The attach is running too early, so the invalidate commands
are falling on deaf ears.
thanks,
-chris
Hit another kdump problem as reported by Neil Horman. When initializaing
the IOMMU, we attach devices to their domains before the IOMMU is
fully (re)initialized. Attaching a device will issue some important
invalidations. In the context of the newly kexec'd kdump kernel, the
IOMMU may have stale cached data from the original kernel. Because we
do the attach too early, the invalidation commands are placed in the new
command buffer before the IOMMU is updated w/ that buffer. This leaves
the stale entries in the kdump context and can renders device unusable.
Simply enable the IOMMU before we do the attach.
Cc: Neil Horman <[email protected]>
Cc: Vivek Goyal <[email protected]>
Signed-off-by: Chris Wright <[email protected]>
---
arch/x86/kernel/amd_iommu_init.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/arch/x86/kernel/amd_iommu_init.c
+++ b/arch/x86/kernel/amd_iommu_init.c
@@ -1288,6 +1288,8 @@ static int __init amd_iommu_init(void)
if (ret)
goto free;
+ enable_iommus();
+
if (iommu_pass_through)
ret = amd_iommu_init_passthrough();
else
@@ -1300,8 +1302,6 @@ static int __init amd_iommu_init(void)
amd_iommu_init_notifier();
- enable_iommus();
-
if (iommu_pass_through)
goto out;
To catch future potential issues we can add a warning whenever we issue
a command before the command buffer is fully initialized.
Signed-off-by: Chris Wright <[email protected]>
---
Or not...this is just if you think it's useful ;-)
arch/x86/include/asm/amd_iommu_types.h | 1 +
arch/x86/kernel/amd_iommu.c | 1 +
arch/x86/kernel/amd_iommu_init.c | 5 +++--
3 files changed, 5 insertions(+), 2 deletions(-)
--- a/arch/x86/include/asm/amd_iommu_types.h
+++ b/arch/x86/include/asm/amd_iommu_types.h
@@ -140,6 +140,7 @@
/* constants to configure the command buffer */
#define CMD_BUFFER_SIZE 8192
+#define CMD_BUFFER_UNINITIALIZED 1
#define CMD_BUFFER_ENTRIES 512
#define MMIO_CMD_SIZE_SHIFT 56
#define MMIO_CMD_SIZE_512 (0x9ULL << MMIO_CMD_SIZE_SHIFT)
--- a/arch/x86/kernel/amd_iommu.c
+++ b/arch/x86/kernel/amd_iommu.c
@@ -392,6 +392,7 @@ static int __iommu_queue_command(struct
u32 tail, head;
u8 *target;
+ WARN_ON(iommu->cmd_buf_size & CMD_BUFFER_UNINITIALIZED);;
tail = readl(iommu->mmio_base + MMIO_CMD_TAIL_OFFSET);
target = iommu->cmd_buf + tail;
memcpy_toio(target, cmd, sizeof(*cmd));
--- a/arch/x86/kernel/amd_iommu_init.c
+++ b/arch/x86/kernel/amd_iommu_init.c
@@ -436,7 +436,7 @@ static u8 * __init alloc_command_buffer(
if (cmd_buf == NULL)
return NULL;
- iommu->cmd_buf_size = CMD_BUFFER_SIZE;
+ iommu->cmd_buf_size = CMD_BUFFER_SIZE | CMD_BUFFER_UNINITIALIZED;
return cmd_buf;
}
@@ -453,6 +453,7 @@ void amd_iommu_reset_cmd_buffer(struct a
writel(0x00, iommu->mmio_base + MMIO_CMD_TAIL_OFFSET);
iommu_feature_enable(iommu, CONTROL_CMDBUF_EN);
+ iommu->cmd_buf_size &= ~(CMD_BUFFER_UNINITIALIZED);
}
/*
@@ -477,7 +478,7 @@ static void iommu_enable_command_buffer(
static void __init free_command_buffer(struct amd_iommu *iommu)
{
free_pages((unsigned long)iommu->cmd_buf,
- get_order(iommu->cmd_buf_size));
+ get_order(iommu->cmd_buf_size & ~(CMD_BUFFER_UNINITIALIZED)));
}
/* allocates the memory where the IOMMU will log its events to */
On Thu, Apr 01, 2010 at 06:23:53PM -0700, Chris Wright wrote:
> Hit another kdump problem as reported by Neil Horman. When initializaing
> the IOMMU, we attach devices to their domains before the IOMMU is
> fully (re)initialized. Attaching a device will issue some important
> invalidations. In the context of the newly kexec'd kdump kernel, the
> IOMMU may have stale cached data from the original kernel. Because we
> do the attach too early, the invalidation commands are placed in the new
> command buffer before the IOMMU is updated w/ that buffer. This leaves
> the stale entries in the kdump context and can renders device unusable.
> Simply enable the IOMMU before we do the attach.
>
> Cc: Neil Horman <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> Signed-off-by: Chris Wright <[email protected]>
I'll test this out this weekend, thanks Chris!
Neil
> ---
> arch/x86/kernel/amd_iommu_init.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> --- a/arch/x86/kernel/amd_iommu_init.c
> +++ b/arch/x86/kernel/amd_iommu_init.c
> @@ -1288,6 +1288,8 @@ static int __init amd_iommu_init(void)
> if (ret)
> goto free;
>
> + enable_iommus();
> +
> if (iommu_pass_through)
> ret = amd_iommu_init_passthrough();
> else
> @@ -1300,8 +1302,6 @@ static int __init amd_iommu_init(void)
>
> amd_iommu_init_notifier();
>
> - enable_iommus();
> -
> if (iommu_pass_through)
> goto out;
>
* Neil Horman ([email protected]) wrote:
> On Thu, Apr 01, 2010 at 06:23:53PM -0700, Chris Wright wrote:
> > Hit another kdump problem as reported by Neil Horman. When initializaing
> > the IOMMU, we attach devices to their domains before the IOMMU is
> > fully (re)initialized. Attaching a device will issue some important
> > invalidations. In the context of the newly kexec'd kdump kernel, the
> > IOMMU may have stale cached data from the original kernel. Because we
> > do the attach too early, the invalidation commands are placed in the new
> > command buffer before the IOMMU is updated w/ that buffer. This leaves
> > the stale entries in the kdump context and can renders device unusable.
> > Simply enable the IOMMU before we do the attach.
> >
> > Cc: Neil Horman <[email protected]>
> > Cc: Vivek Goyal <[email protected]>
> > Signed-off-by: Chris Wright <[email protected]>
>
> I'll test this out this weekend, thanks Chris!
Great, thanks! I tested w/ both default and iommu=pt. Both worked,
didn't spot any regressions. But additional testing is very welcome.
thanks,
-chris
On Thu, Apr 01, 2010 at 06:23:53PM -0700, Chris Wright wrote:
> Hit another kdump problem as reported by Neil Horman. When initializaing
> the IOMMU, we attach devices to their domains before the IOMMU is
> fully (re)initialized. Attaching a device will issue some important
> invalidations. In the context of the newly kexec'd kdump kernel, the
> IOMMU may have stale cached data from the original kernel. Because we
> do the attach too early, the invalidation commands are placed in the new
> command buffer before the IOMMU is updated w/ that buffer. This leaves
> the stale entries in the kdump context and can renders device unusable.
> Simply enable the IOMMU before we do the attach.
>
> Cc: Neil Horman <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> Signed-off-by: Chris Wright <[email protected]>
> ---
> arch/x86/kernel/amd_iommu_init.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> --- a/arch/x86/kernel/amd_iommu_init.c
> +++ b/arch/x86/kernel/amd_iommu_init.c
> @@ -1288,6 +1288,8 @@ static int __init amd_iommu_init(void)
> if (ret)
> goto free;
>
> + enable_iommus();
> +
> if (iommu_pass_through)
> ret = amd_iommu_init_passthrough();
> else
> @@ -1300,8 +1302,6 @@ static int __init amd_iommu_init(void)
>
> amd_iommu_init_notifier();
>
> - enable_iommus();
> -
> if (iommu_pass_through)
> goto out;
Ok, good to know this fixes the problem. One issue: If the
initialization of the domains fails the iommu hardware needs to be
disabled again in the free path.
Thanks,
Joerg
On Thu, Apr 01, 2010 at 06:23:53PM -0700, Chris Wright wrote:
> Hit another kdump problem as reported by Neil Horman. When initializaing
> the IOMMU, we attach devices to their domains before the IOMMU is
> fully (re)initialized. Attaching a device will issue some important
> invalidations. In the context of the newly kexec'd kdump kernel, the
> IOMMU may have stale cached data from the original kernel. Because we
> do the attach too early, the invalidation commands are placed in the new
> command buffer before the IOMMU is updated w/ that buffer. This leaves
> the stale entries in the kdump context and can renders device unusable.
> Simply enable the IOMMU before we do the attach.
>
> Cc: Neil Horman <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> Signed-off-by: Chris Wright <[email protected]>
> ---
> arch/x86/kernel/amd_iommu_init.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> --- a/arch/x86/kernel/amd_iommu_init.c
> +++ b/arch/x86/kernel/amd_iommu_init.c
> @@ -1288,6 +1288,8 @@ static int __init amd_iommu_init(void)
> if (ret)
> goto free;
>
> + enable_iommus();
> +
> if (iommu_pass_through)
> ret = amd_iommu_init_passthrough();
> else
> @@ -1300,8 +1302,6 @@ static int __init amd_iommu_init(void)
>
> amd_iommu_init_notifier();
>
> - enable_iommus();
> -
Ok, so now we do enable_iommu() before we attach devices and flush DTE and
domain PDE, IO TLB. That makes sense.
Following is just for my education purposes. Trying to understand better
the impact of in-flight DMAs. So IIUC, in kudmp context following seems to be
sequence of events.
1. kernel crashes, we leave IOMMU enabled.
2. boot into capture kernel and we call enable_iommus(). This function
first disables iommu, sets up new device table and enables iommus
again.
a. So during this small window when iommu is disabled and we enable
it back, any inflight DMA will passthrough possibly to an
unintended physical address as translation is disabled and it
can corrupt the kdump kenrel.
b. Even after enabling the iommu, I guess we will continue to
use cached DTE, and translation information to handle any
in-flight DMA. The difference is that now iommus are enabled
so any in-flight DMA should go to the address as intended in
first kenrel and should not corrupt anything.
3. Once iommus are enabled again, we allocated and initilize protection
domains. We attach devices to domains. In the process we flush the
DTE, PDE and IO TLBs.
c. Looks like do_attach->set_dte_entry(), by default gives write
permission (IW) to all the devices. I am assuming that at
this point of time translation is enabled and possibly unity
mapped. If that's the case, any in-flight DMA will not be
allowed to happen at unity mapped address and this can possibly
corrupt the kernel?
I understand this patch should fix the case when in second kernel a
device is not doing DMA because of possibly cached DTE, and translation
information. But looks like in-flight DMA issues will still need a closer
look. But that is a separate issue and needs to be addressed in separate
set of patches.
Thanks
Vivek
* Vivek Goyal ([email protected]) wrote:
> Following is just for my education purposes. Trying to understand better
> the impact of in-flight DMAs. So IIUC, in kudmp context following seems to be
> sequence of events.
>
> 1. kernel crashes, we leave IOMMU enabled.
No, we actually disable the IOMMU
panic()
crash_exec()
machine_crash_shutdown()
native_machine_crash_shutdown()
x86_platform.iommu_shutdown() == amd_iommu_init.c::disable_iommus()
> 2. boot into capture kernel and we call enable_iommus(). This function
> first disables iommu, sets up new device table and enables iommus
> again.
> a. So during this small window when iommu is disabled and we enable
> it back, any inflight DMA will passthrough possibly to an
> unintended physical address as translation is disabled and it
> can corrupt the kdump kenrel.
Yes, should be possible.
> b. Even after enabling the iommu, I guess we will continue to
> use cached DTE, and translation information to handle any
> in-flight DMA. The difference is that now iommus are enabled
> so any in-flight DMA should go to the address as intended in
> first kenrel and should not corrupt anything.
Cached DTE only gets you domainID and pt root, so results depend upon
other caches too.
> 3. Once iommus are enabled again, we allocated and initilize protection
> domains. We attach devices to domains. In the process we flush the
> DTE, PDE and IO TLBs.
Yes, but this is immediately after 2b above. We can't send the invalidate
commands until the iommu is enabled.
> c. Looks like do_attach->set_dte_entry(), by default gives write
> permission (IW) to all the devices. I am assuming that at
> this point of time translation is enabled and possibly unity
> mapped.
Unlikely to have full unity map, it's typically just for a range,
and this is also typically for devices that have interaction w/ BIOS
(typical examples are usb/SMM for keyboard and integrated graphics).
And the mapping is to reserved memory areas (BIOS regions). On my test
box, for example, there are none of the BIOS specified ranges.
> If that's the case, any in-flight DMA will not be
> allowed to happen at unity mapped address and this can possibly
> corrupt the kernel?
I don't understand what you mean there.
> I understand this patch should fix the case when in second kernel a
> device is not doing DMA because of possibly cached DTE, and translation
> information. But looks like in-flight DMA issues will still need a closer
> look. But that is a separate issue and needs to be addressed in separate
> set of patches.
All of the in-flight DMA corrupts kdump kernel memory issues can exist
w/out an IOMMU. On the one hand, the IOMMU actually reduces the window
of opportunity, but on the other it adds possible translation issue.
About the only thing we can do here is modify the crashing kernel to
update the DTE to disable dma, invalidate, and leave IOMMU enabled.
This is all in the context of crashing, so any further errors here mean
we probably won't get to kdump kernel.
The kexec'd kernel will still need to defensively prepare things as it
always does since it can't trust anything that came before it.
thanks,
-chris
Chris Wright <[email protected]> writes:
> * Vivek Goyal ([email protected]) wrote:
>> Following is just for my education purposes. Trying to understand better
>> the impact of in-flight DMAs. So IIUC, in kudmp context following seems to be
>> sequence of events.
>>
>> 1. kernel crashes, we leave IOMMU enabled.
>
> No, we actually disable the IOMMU
>
> panic()
> crash_exec()
> machine_crash_shutdown()
> native_machine_crash_shutdown()
> x86_platform.iommu_shutdown() == amd_iommu_init.c::disable_iommus()
That call exists but is there any good reason for it?
Nothing should be on that can be reasonably done anywhere else.
This round of bug fixing and testing would be a good time to remove
that extraneous call.
>> 2. boot into capture kernel and we call enable_iommus(). This function
>> first disables iommu, sets up new device table and enables iommus
>> again.
>> a. So during this small window when iommu is disabled and we enable
>> it back, any inflight DMA will passthrough possibly to an
>> unintended physical address as translation is disabled and it
>> can corrupt the kdump kenrel.
>
> Yes, should be possible.
>
>> b. Even after enabling the iommu, I guess we will continue to
>> use cached DTE, and translation information to handle any
>> in-flight DMA. The difference is that now iommus are enabled
>> so any in-flight DMA should go to the address as intended in
>> first kenrel and should not corrupt anything.
>
> Cached DTE only gets you domainID and pt root, so results depend upon
> other caches too.
>
>> 3. Once iommus are enabled again, we allocated and initilize protection
>> domains. We attach devices to domains. In the process we flush the
>> DTE, PDE and IO TLBs.
>
> Yes, but this is immediately after 2b above. We can't send the invalidate
> commands until the iommu is enabled.
>
>> c. Looks like do_attach->set_dte_entry(), by default gives write
>> permission (IW) to all the devices. I am assuming that at
>> this point of time translation is enabled and possibly unity
>> mapped.
>
> Unlikely to have full unity map, it's typically just for a range,
> and this is also typically for devices that have interaction w/ BIOS
> (typical examples are usb/SMM for keyboard and integrated graphics).
> And the mapping is to reserved memory areas (BIOS regions). On my test
> box, for example, there are none of the BIOS specified ranges.
>
>> If that's the case, any in-flight DMA will not be
>> allowed to happen at unity mapped address and this can possibly
>> corrupt the kernel?
>
> I don't understand what you mean there.
>
>> I understand this patch should fix the case when in second kernel a
>> device is not doing DMA because of possibly cached DTE, and translation
>> information. But looks like in-flight DMA issues will still need a closer
>> look. But that is a separate issue and needs to be addressed in separate
>> set of patches.
>
> All of the in-flight DMA corrupts kdump kernel memory issues can exist
> w/out an IOMMU. On the one hand, the IOMMU actually reduces the window
> of opportunity, but on the other it adds possible translation issue.
?
If disabling the IOMMU changes the destination in RAM of in-flight DMA
the issues are not at all the same.
> About the only thing we can do here is modify the crashing kernel to
> update the DTE to disable dma, invalidate, and leave IOMMU enabled.
> This is all in the context of crashing, so any further errors here mean
> we probably won't get to kdump kernel.
Why even touch the IOMMU?
> The kexec'd kernel will still need to defensively prepare things as it
> always does since it can't trust anything that came before it.
Yes.
Eric
* Eric W. Biederman ([email protected]) wrote:
> Chris Wright <[email protected]> writes:
>
> > * Vivek Goyal ([email protected]) wrote:
> >> Following is just for my education purposes. Trying to understand better
> >> the impact of in-flight DMAs. So IIUC, in kudmp context following seems to be
> >> sequence of events.
> >>
> >> 1. kernel crashes, we leave IOMMU enabled.
> >
> > No, we actually disable the IOMMU
> >
> > panic()
> > crash_exec()
> > machine_crash_shutdown()
> > native_machine_crash_shutdown()
> > x86_platform.iommu_shutdown() == amd_iommu_init.c::disable_iommus()
>
> That call exists but is there any good reason for it?
It was originally added it for this very case, although I don't think it
should be there. I too am a proponent of removing it, because I think
all work should be done in the newly kexec'd kernel.
> Nothing should be on that can be reasonably done anywhere else.
> This round of bug fixing and testing would be a good time to remove
> that extraneous call.
>
> >> 2. boot into capture kernel and we call enable_iommus(). This function
> >> first disables iommu, sets up new device table and enables iommus
> >> again.
> >> a. So during this small window when iommu is disabled and we enable
> >> it back, any inflight DMA will passthrough possibly to an
> >> unintended physical address as translation is disabled and it
> >> can corrupt the kdump kenrel.
> >
> > Yes, should be possible.
> >
> >> b. Even after enabling the iommu, I guess we will continue to
> >> use cached DTE, and translation information to handle any
> >> in-flight DMA. The difference is that now iommus are enabled
> >> so any in-flight DMA should go to the address as intended in
> >> first kenrel and should not corrupt anything.
> >
> > Cached DTE only gets you domainID and pt root, so results depend upon
> > other caches too.
> >
> >> 3. Once iommus are enabled again, we allocated and initilize protection
> >> domains. We attach devices to domains. In the process we flush the
> >> DTE, PDE and IO TLBs.
> >
> > Yes, but this is immediately after 2b above. We can't send the invalidate
> > commands until the iommu is enabled.
> >
> >> c. Looks like do_attach->set_dte_entry(), by default gives write
> >> permission (IW) to all the devices. I am assuming that at
> >> this point of time translation is enabled and possibly unity
> >> mapped.
> >
> > Unlikely to have full unity map, it's typically just for a range,
> > and this is also typically for devices that have interaction w/ BIOS
> > (typical examples are usb/SMM for keyboard and integrated graphics).
> > And the mapping is to reserved memory areas (BIOS regions). On my test
> > box, for example, there are none of the BIOS specified ranges.
> >
> >> If that's the case, any in-flight DMA will not be
> >> allowed to happen at unity mapped address and this can possibly
> >> corrupt the kernel?
> >
> > I don't understand what you mean there.
> >
> >> I understand this patch should fix the case when in second kernel a
> >> device is not doing DMA because of possibly cached DTE, and translation
> >> information. But looks like in-flight DMA issues will still need a closer
> >> look. But that is a separate issue and needs to be addressed in separate
> >> set of patches.
> >
> > All of the in-flight DMA corrupts kdump kernel memory issues can exist
> > w/out an IOMMU. On the one hand, the IOMMU actually reduces the window
> > of opportunity, but on the other it adds possible translation issue.
>
> ?
> If disabling the IOMMU changes the destination in RAM of in-flight DMA
> the issues are not at all the same.
>
> > About the only thing we can do here is modify the crashing kernel to
> > update the DTE to disable dma, invalidate, and leave IOMMU enabled.
> > This is all in the context of crashing, so any further errors here mean
> > we probably won't get to kdump kernel.
>
> Why even touch the IOMMU?
I agree, and was trying to show that it's risky to do that while crashing.
Removing the IOMMU disable may still allow some DMA to succeed.
To actually close that off, you have to force all DMA transactions
to fail. Personally, I'm skeptical that the tradeoff is worth it.
thanks,
-chris
* Joerg Roedel ([email protected]) wrote:
> Ok, good to know this fixes the problem. One issue: If the
> initialization of the domains fails the iommu hardware needs to be
> disabled again in the free path.
Sure, thanks for catching. Will resend shortly
thanks,
-chris
On Fri, Apr 02, 2010 at 11:59:32AM -0400, Vivek Goyal wrote:
> 1. kernel crashes, we leave IOMMU enabled.
True for everything except gart and amd iommu.
> a. So during this small window when iommu is disabled and we enable
> it back, any inflight DMA will passthrough possibly to an
> unintended physical address as translation is disabled and it
> can corrupt the kdump kenrel.
Right.
> b. Even after enabling the iommu, I guess we will continue to
> use cached DTE, and translation information to handle any
> in-flight DMA. The difference is that now iommus are enabled
> so any in-flight DMA should go to the address as intended in
> first kenrel and should not corrupt anything.
Right.
>
> 3. Once iommus are enabled again, we allocated and initilize protection
> domains. We attach devices to domains. In the process we flush the
> DTE, PDE and IO TLBs.
>
> c. Looks like do_attach->set_dte_entry(), by default gives write
> permission (IW) to all the devices. I am assuming that at
> this point of time translation is enabled and possibly unity
> mapped.
No, The IW bit in the DTE must be set because all write permission bits
(DTE and page tabled) are ANDed to determine if a device can write to a
particular address. So as long as the paging mode is unequal to zero the
hardware will walk the page-table first to find out if the device has
write permission. With paging mode == 0 your statement about read-write
unity-mapping is true. This is used for a pass-through domain (iommu=pt)
btw.
Joerg
On Sat, Apr 03, 2010 at 07:38:36PM +0200, Joerg Roedel wrote:
> On Fri, Apr 02, 2010 at 11:59:32AM -0400, Vivek Goyal wrote:
> > 1. kernel crashes, we leave IOMMU enabled.
>
> True for everything except gart and amd iommu.
>
> > a. So during this small window when iommu is disabled and we enable
> > it back, any inflight DMA will passthrough possibly to an
> > unintended physical address as translation is disabled and it
> > can corrupt the kdump kenrel.
>
> Right.
>
> > b. Even after enabling the iommu, I guess we will continue to
> > use cached DTE, and translation information to handle any
> > in-flight DMA. The difference is that now iommus are enabled
> > so any in-flight DMA should go to the address as intended in
> > first kenrel and should not corrupt anything.
>
> Right.
>
> >
> > 3. Once iommus are enabled again, we allocated and initilize protection
> > domains. We attach devices to domains. In the process we flush the
> > DTE, PDE and IO TLBs.
> >
> > c. Looks like do_attach->set_dte_entry(), by default gives write
> > permission (IW) to all the devices. I am assuming that at
> > this point of time translation is enabled and possibly unity
> > mapped.
>
> No, The IW bit in the DTE must be set because all write permission bits
> (DTE and page tabled) are ANDed to determine if a device can write to a
> particular address. So as long as the paging mode is unequal to zero the
> hardware will walk the page-table first to find out if the device has
> write permission.
And by default valid PTEs are not present (except for some unity mappings
as specified by ACPI tables), so we will end the transaction with
IO_PAGE_FAULT? I am assuming that we will not set unity mappings for
kernel reserved area and so either an in-flight DMA will not be allowed
and IO_PAGE_FAULT will be logged or it will be allowed to some unity
mapping which is not mapped to kdump kernel area hence no corruption of
capture kernel?
> With paging mode == 0 your statement about read-write
> unity-mapping is true. This is used for a pass-through domain (iommu=pt)
> btw.
Ok, so in case of pass through, I think one just needs to make sure that
don't use iommu=pt in second kernel if one did not use iommu=pt in first kernel.
Otherwise you can redirect the the in-flight DMAs in second kernel to an
entirely unintended physical memory.
So following seems to be the summary.
- Don't disable AMD IOMMU after crash in machine_crash_shutdown(), because
disabling it can direct in-flight DMAs to unintended physical meory
areas and can corrupt other data structures.
- Once the iommu is enabled in second kernel, most likely in-flight DMAs
will end with IO_PAGE_FAULT (iommu!=pt). Only selective unity mapping
areas will be setup based on ACPI tables and these should be BIOS region
and should not overlap with kdump reserved memory. iommu=pt should also
be safe if iommu=pt was used in first kernel also.
- Only small window where in-flight DMA can corrupt things is when we
are initializing iommu in second kernel. (We first disable iommu and then
enable it back). During this small period translation will be disabled and
some IO can go to unintended address. And there does not seem to be any easy
way to plug this hole.
Have I got it right?
Thanks
Vivek
On Mon, Apr 05, 2010 at 10:17:50AM -0400, Vivek Goyal wrote:
> And by default valid PTEs are not present (except for some unity mappings
> as specified by ACPI tables), so we will end the transaction with
> IO_PAGE_FAULT? I am assuming that we will not set unity mappings for
> kernel reserved area and so either an in-flight DMA will not be allowed
> and IO_PAGE_FAULT will be logged or it will be allowed to some unity
> mapping which is not mapped to kdump kernel area hence no corruption of
> capture kernel?
Right. The unity-mappings are typically used for devices that are
controled by the BIOS and define memory regions owned by the BIOS. So
Linux will not use the unity mapped regions anyway, not in the first
kernel and not in the kdump kernel.
> > With paging mode == 0 your statement about read-write
> > unity-mapping is true. This is used for a pass-through domain (iommu=pt)
> > btw.
>
> Ok, so in case of pass through, I think one just needs to make sure that
> don't use iommu=pt in second kernel if one did not use iommu=pt in first kernel.
> Otherwise you can redirect the the in-flight DMAs in second kernel to an
> entirely unintended physical memory.
The kdump kernel should use the same setting as the plain kernel.
> So following seems to be the summary.
>
> - Don't disable AMD IOMMU after crash in machine_crash_shutdown(), because
> disabling it can direct in-flight DMAs to unintended physical meory
> areas and can corrupt other data structures.
Right, that really seems to be the best solution.
> - Once the iommu is enabled in second kernel, most likely in-flight DMAs
> will end with IO_PAGE_FAULT (iommu!=pt). Only selective unity mapping
> areas will be setup based on ACPI tables and these should be BIOS region
> and should not overlap with kdump reserved memory. iommu=pt should also
> be safe if iommu=pt was used in first kernel also.
Right. With Chris' patches the DTE entries of newly attached domains are
flushed at IOMMU initialization in the kdump kernel. So the new data
structures are in place and used by the hardware.
> - Only small window where in-flight DMA can corrupt things is when we
> are initializing iommu in second kernel. (We first disable iommu and then
> enable it back). During this small period translation will be disabled and
> some IO can go to unintended address. And there does not seem to be any easy
> way to plug this hole.
Right.
> Have I got it right?
Yes :-)
Joerg
On Thu, Apr 01, 2010 at 06:23:53PM -0700, Chris Wright wrote:
> Hit another kdump problem as reported by Neil Horman. When initializaing
> the IOMMU, we attach devices to their domains before the IOMMU is
> fully (re)initialized. Attaching a device will issue some important
> invalidations. In the context of the newly kexec'd kdump kernel, the
> IOMMU may have stale cached data from the original kernel. Because we
> do the attach too early, the invalidation commands are placed in the new
> command buffer before the IOMMU is updated w/ that buffer. This leaves
> the stale entries in the kdump context and can renders device unusable.
> Simply enable the IOMMU before we do the attach.
>
> Cc: Neil Horman <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> Signed-off-by: Chris Wright <[email protected]>
> ---
> arch/x86/kernel/amd_iommu_init.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> --- a/arch/x86/kernel/amd_iommu_init.c
> +++ b/arch/x86/kernel/amd_iommu_init.c
> @@ -1288,6 +1288,8 @@ static int __init amd_iommu_init(void)
> if (ret)
> goto free;
>
> + enable_iommus();
> +
> if (iommu_pass_through)
> ret = amd_iommu_init_passthrough();
> else
> @@ -1300,8 +1302,6 @@ static int __init amd_iommu_init(void)
>
> amd_iommu_init_notifier();
>
> - enable_iommus();
> -
> if (iommu_pass_through)
> goto out;
>
>
FWIW, this patch series fixed the kdump issues I was having on my AMD system
here. Thanks Chris
Acked-by: Neil Horman <[email protected]>