From: Borislav Petkov <[email protected]>
Hi,
I've already sent you the bigger part last week, thinking that the merge
window is getting close but it looks like we'll have an rc8 and then
some. Nice.
Anyway, here's Joerg's stuff which has been running in our SLES kernels
without an issue for a while now.
Joerg Roedel (3):
swiotlb: Warn on allocation failure in swiotlb_alloc_coherent()
x86/swiotlb: Try coherent allocations with __GFP_NOWARN
x86/crash: Allocate enough low memory when crashkernel=high
arch/x86/kernel/pci-swiotlb.c | 7 +++++++
arch/x86/kernel/setup.c | 13 ++++++++-----
lib/swiotlb.c | 11 +++++++++--
3 files changed, 24 insertions(+), 7 deletions(-)
--
2.3.5
From: Joerg Roedel <[email protected]>
Print a warning when all allocation tries have been failed
and the function is about to return NULL. This prepares for
calling the function with __GFP_NOWARN to suppress
allocation failure warnings before all fall-backs have
failed.
Signed-off-by: Joerg Roedel <[email protected]>
Acked-by: Konrad Rzeszutek Wilk <[email protected]>
Acked-by: Baoquan He <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Dave Young <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jörg Rödel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: [email protected]
Cc: x86-ml <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
---
lib/swiotlb.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 4abda074ea45..e0e921218f0b 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -655,7 +655,7 @@ swiotlb_alloc_coherent(struct device *hwdev, size_t size,
*/
phys_addr_t paddr = map_single(hwdev, 0, size, DMA_FROM_DEVICE);
if (paddr == SWIOTLB_MAP_ERROR)
- return NULL;
+ goto err_warn;
ret = phys_to_virt(paddr);
dev_addr = phys_to_dma(hwdev, paddr);
@@ -669,7 +669,7 @@ swiotlb_alloc_coherent(struct device *hwdev, size_t size,
/* DMA_TO_DEVICE to avoid memcpy in unmap_single */
swiotlb_tbl_unmap_single(hwdev, paddr,
size, DMA_TO_DEVICE);
- return NULL;
+ goto err_warn;
}
}
@@ -677,6 +677,13 @@ swiotlb_alloc_coherent(struct device *hwdev, size_t size,
memset(ret, 0, size);
return ret;
+
+err_warn:
+ pr_warn("swiotlb: coherent allocation failed for device %s size=%zu\n",
+ dev_name(hwdev), size);
+ dump_stack();
+
+ return NULL;
}
EXPORT_SYMBOL(swiotlb_alloc_coherent);
--
2.3.5
From: Joerg Roedel <[email protected]>
When we boot a kdump kernel in high memory, there is by
default only 72MB of low memory available. The swiotlb code
takes 64MB of it (by default) so that there are only 8MB
left to allocate from. On systems with many devices this
causes page allocator warnings from dma_generic_alloc_coherent():
systemd-udevd: page allocation failure: order:0, mode:0x280d4
CPU: 0 PID: 197 Comm: systemd-udevd Tainted: G W 3.12.28-4-default #1
Hardware name: HP ProLiant DL980 G7, BIOS P66 07/30/2012
ffff8800781335e0 ffffffff8150b1db 00000000000280d4 ffffffff8113af90
0000000000000000 0000000000000000 ffff88007efdbb00 0000000100000000
0000000000000000 0000000000000000 0000000000000000 0000000000000001
Call Trace:
dump_trace+0x7d/0x2d0
show_stack_log_lvl+0x94/0x170
show_stack+0x21/0x50
dump_stack+0x41/0x51
warn_alloc_failed+0xf0/0x160
__alloc_pages_slowpath+0x72f/0x796
__alloc_pages_nodemask+0x1ea/0x210
dma_generic_alloc_coherent+0x96/0x140
x86_swiotlb_alloc_coherent+0x1c/0x50
ttm_dma_pool_alloc_new_pages+0xab/0x320 [ttm]
ttm_dma_populate+0x3ce/0x640 [ttm]
ttm_tt_bind+0x36/0x60 [ttm]
ttm_bo_handle_move_mem+0x55f/0x5c0 [ttm]
ttm_bo_move_buffer+0x105/0x130 [ttm]
ttm_bo_validate+0xc1/0x130 [ttm]
ttm_bo_init+0x24b/0x400 [ttm]
radeon_bo_create+0x16c/0x200 [radeon]
radeon_ring_init+0x11e/0x2b0 [radeon]
r100_cp_init+0x123/0x5b0 [radeon]
r100_startup+0x194/0x230 [radeon]
r100_init+0x223/0x410 [radeon]
radeon_device_init+0x6af/0x830 [radeon]
radeon_driver_load_kms+0x89/0x180 [radeon]
drm_get_pci_dev+0x121/0x2f0 [drm]
local_pci_probe+0x39/0x60
pci_device_probe+0xa9/0x120
driver_probe_device+0x9d/0x3d0
__driver_attach+0x8b/0x90
bus_for_each_dev+0x5b/0x90
bus_add_driver+0x1f8/0x2c0
driver_register+0x5b/0xe0
do_one_initcall+0xf2/0x1a0
load_module+0x1207/0x1c70
SYSC_finit_module+0x75/0xa0
system_call_fastpath+0x16/0x1b
0x7fac533d2788
After these warnings the code enters a fallback path and
allocated directly from the swiotlb aperture in the end.
So remove these warnings as this is not a fatal error.
Signed-off-by: Joerg Roedel <[email protected]>
Acked-by: Baoquan He <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Dave Young <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jörg Rödel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: [email protected]
Cc: x86-ml <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Simplify, reflow comment. ]
Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/kernel/pci-swiotlb.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index 77dd0ad58be4..adf0392d549a 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -20,6 +20,13 @@ void *x86_swiotlb_alloc_coherent(struct device *hwdev, size_t size,
{
void *vaddr;
+ /*
+ * Don't print a warning when the first allocation attempt fails.
+ * swiotlb_alloc_coherent() will print a warning when the DMA
+ * memory allocation ultimately failed.
+ */
+ flags |= __GFP_NOWARN;
+
vaddr = dma_generic_alloc_coherent(hwdev, size, dma_handle, flags,
attrs);
if (vaddr)
--
2.3.5
From: Joerg Roedel <[email protected]>
When the crash kernel is loaded above 4GiB in memory, the
first kernel allocates only 72MiB of low-memory for the DMA
requirements of the second kernel. On systems with many
devices this is not enough and causes device driver
initialization errors and failed crash dumps. Testing by
SUSE and Redhat has shown that 256MiB is a good default
value for now and the discussion has lead to this value as
well. So set this default value to 256MiB to make sure there
is enough memory available for DMA.
Signed-off-by: Joerg Roedel <[email protected]>
Acked-by: Baoquan He <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Dave Young <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jörg Rödel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: [email protected]
Cc: x86-ml <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Reflow comment. ]
Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/kernel/setup.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d74ac33290ae..5a697e56634c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -531,12 +531,15 @@ static void __init reserve_crashkernel_low(void)
if (ret != 0) {
/*
* two parts from lib/swiotlb.c:
- * swiotlb size: user specified with swiotlb= or default.
- * swiotlb overflow buffer: now is hardcoded to 32k.
- * We round it to 8M for other buffers that
- * may need to stay low too.
+ * - swiotlb size: user-specified with swiotlb= or default.
+ *
+ * -swiotlb overflow buffer: now hardcoded to 32k.
+ *
+ * We round it to 8M for other buffers that may need to stay low
+ * too. Also make sure we allocate enough extra low memory so
+ * that we don't run out of DMA buffers for 32-bit devices.
*/
- low_size = swiotlb_size_or_default() + (8UL<<20);
+ low_size = max(swiotlb_size_or_default() + (8UL<<20), 256UL<<20);
auto_set = true;
} else {
/* passed with crashkernel=0,low ? */
--
2.3.5
On Mon, Jun 8, 2015 at 12:49 AM, Borislav Petkov <[email protected]> wrote:
> From: Joerg Roedel <[email protected]>
>
> When the crash kernel is loaded above 4GiB in memory, the
> first kernel allocates only 72MiB of low-memory for the DMA
> requirements of the second kernel. On systems with many
> devices this is not enough and causes device driver
> initialization errors and failed crash dumps. Testing by
> SUSE and Redhat has shown that 256MiB is a good default
> value for now and the discussion has lead to this value as
> well. So set this default value to 256MiB to make sure there
> is enough memory available for DMA.
Is the 256MiB too big?
BTW, the affected system does not support iommu?
on my intel test box with 6T and 16 pcie cards does not hit the 72MiB
limit.
So the affected system is 12T and more cards?
>
> Signed-off-by: Joerg Roedel <[email protected]>
> Acked-by: Baoquan He <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Dave Young <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jörg Rödel <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> Cc: [email protected]
> Cc: x86-ml <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> [ Reflow comment. ]
> Signed-off-by: Borislav Petkov <[email protected]>
> ---
> arch/x86/kernel/setup.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index d74ac33290ae..5a697e56634c 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -531,12 +531,15 @@ static void __init reserve_crashkernel_low(void)
> if (ret != 0) {
> /*
> * two parts from lib/swiotlb.c:
> - * swiotlb size: user specified with swiotlb= or default.
> - * swiotlb overflow buffer: now is hardcoded to 32k.
> - * We round it to 8M for other buffers that
> - * may need to stay low too.
> + * - swiotlb size: user-specified with swiotlb= or default.
> + *
> + * -swiotlb overflow buffer: now hardcoded to 32k.
why do you need to treat them differently ?
"- swiotlb" and "-swiotlb"
> + *
Why do you need to put extra blank line here? that 8M round up
is for swiotlb overflow buffer.
> + * We round it to 8M for other buffers that may need to stay low
> + * too. Also make sure we allocate enough extra low memory so
> + * that we don't run out of DMA buffers for 32-bit devices.
> */
> - low_size = swiotlb_size_or_default() + (8UL<<20);
> + low_size = max(swiotlb_size_or_default() + (8UL<<20), 256UL<<20);
> auto_set = true;
> } else {
> /* passed with crashkernel=0,low ? */
> --
On 06/08/15 at 05:20pm, Yinghai Lu wrote:
> On Mon, Jun 8, 2015 at 12:49 AM, Borislav Petkov <[email protected]> wrote:
> > From: Joerg Roedel <[email protected]>
> >
> > When the crash kernel is loaded above 4GiB in memory, the
> > first kernel allocates only 72MiB of low-memory for the DMA
> > requirements of the second kernel. On systems with many
> > devices this is not enough and causes device driver
> > initialization errors and failed crash dumps. Testing by
> > SUSE and Redhat has shown that 256MiB is a good default
> > value for now and the discussion has lead to this value as
> > well. So set this default value to 256MiB to make sure there
> > is enough memory available for DMA.
Hi Yinghai,
>
> Is the 256MiB too big?
>
> BTW, the affected system does not support iommu?
I guess the affected system should not support hw iommu or turned off hw
iommu since hw iommu always screw up kdump.
>
> on my intel test box with 6T and 16 pcie cards does not hit the 72MiB
> limit.
In fact in this case, it doesn't matter how much memory the system has
since most of them is above 4G and only 72M is reserved for dma/swiotlb
in kdump kernel. And it doesn't matter much how many pci/pcie cards the
system has since it matters how greedy those devices expects dma
allocation.
Before Joerg posted this patchset v1, our customer complained about
this too. That means this is not individual case. I think it makes sense
to increase the default low-memory to make kdump succeed, it's
definitely better than letting user get a failed kdump and then know
they need specify more low-memory manually. If user clearly know their
system doesn't need so much low-memory, specifying crashkernel=xx,low
works for them.
>
> So the affected system is 12T and more cards?
In the case of my bug, it's a system with 12T and many cards.
Thanks
Baoquan
>
> >
> > Signed-off-by: Joerg Roedel <[email protected]>
> > Acked-by: Baoquan He <[email protected]>
> > Cc: Baoquan He <[email protected]>
> > Cc: Dave Young <[email protected]>
> > Cc: "H. Peter Anvin" <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: J?rg R?del <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Vivek Goyal <[email protected]>
> > Cc: [email protected]
> > Cc: x86-ml <[email protected]>
> > Link: http://lkml.kernel.org/r/[email protected]
> > [ Reflow comment. ]
> > Signed-off-by: Borislav Petkov <[email protected]>
> > ---
> > arch/x86/kernel/setup.c | 13 ++++++++-----
> > 1 file changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index d74ac33290ae..5a697e56634c 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -531,12 +531,15 @@ static void __init reserve_crashkernel_low(void)
> > if (ret != 0) {
> > /*
> > * two parts from lib/swiotlb.c:
> > - * swiotlb size: user specified with swiotlb= or default.
> > - * swiotlb overflow buffer: now is hardcoded to 32k.
> > - * We round it to 8M for other buffers that
> > - * may need to stay low too.
> > + * - swiotlb size: user-specified with swiotlb= or default.
> > + *
> > + * -swiotlb overflow buffer: now hardcoded to 32k.
>
> why do you need to treat them differently ?
> "- swiotlb" and "-swiotlb"
>
> > + *
>
> Why do you need to put extra blank line here? that 8M round up
> is for swiotlb overflow buffer.
>
> > + * We round it to 8M for other buffers that may need to stay low
> > + * too. Also make sure we allocate enough extra low memory so
> > + * that we don't run out of DMA buffers for 32-bit devices.
> > */
> > - low_size = swiotlb_size_or_default() + (8UL<<20);
> > + low_size = max(swiotlb_size_or_default() + (8UL<<20), 256UL<<20);
> > auto_set = true;
> > } else {
> > /* passed with crashkernel=0,low ? */
> > --
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Wed, Jun 10, 2015 at 12:04 AM, Baoquan He <[email protected]> wrote:
>
> In fact in this case, it doesn't matter how much memory the system has
> since most of them is above 4G and only 72M is reserved for dma/swiotlb
> in kdump kernel. And it doesn't matter much how many pci/pcie cards the
> system has since it matters how greedy those devices expects dma
> allocation.
>
> Before Joerg posted this patchset v1, our customer complained about
> this too. That means this is not individual case. I think it makes sense
> to increase the default low-memory to make kdump succeed, it's
> definitely better than letting user get a failed kdump and then know
> they need specify more low-memory manually. If user clearly know their
> system doesn't need so much low-memory, specifying crashkernel=xx,low
> works for them.
Now intel 4socket or 8socket system mostly have [2G,4G) for mmio low.
also I saw one system had [1.5G, 4G) for mmio.
So under 4G low ram is tight, with changing default to 256M is just lazy
and punishing the other system that does not need that.
those systems do not need to have "crashkernel=xx,low" before.
should have other smart way to detect and update the default value.
even dmi blacklist way is better than just changing default value.
Thanks
Yinghai
On 06/10/15 at 10:48am, Yinghai Lu wrote:
> On Wed, Jun 10, 2015 at 12:04 AM, Baoquan He <[email protected]> wrote:
> >
> > In fact in this case, it doesn't matter how much memory the system has
> > since most of them is above 4G and only 72M is reserved for dma/swiotlb
> > in kdump kernel. And it doesn't matter much how many pci/pcie cards the
> > system has since it matters how greedy those devices expects dma
> > allocation.
> >
> > Before Joerg posted this patchset v1, our customer complained about
> > this too. That means this is not individual case. I think it makes sense
> > to increase the default low-memory to make kdump succeed, it's
> > definitely better than letting user get a failed kdump and then know
> > they need specify more low-memory manually. If user clearly know their
> > system doesn't need so much low-memory, specifying crashkernel=xx,low
> > works for them.
>
> Now intel 4socket or 8socket system mostly have [2G,4G) for mmio low.
> also I saw one system had [1.5G, 4G) for mmio.
Really? How does it work supporting so many pcie devices but with
so little DMA32 space? Do they support dma above 4G, or hw iommu is
default? Surely for both tight low-memory is not a worry.
>
> So under 4G low ram is tight, with changing default to 256M is just lazy
> and punishing the other system that does not need that.
> those systems do not need to have "crashkernel=xx,low" before.
OK. Seems defaulting to whom is controversial. Defaulting to 72M kdump
on some systems will die. Defaulting to 256M some special systems will
be tight on dma/dma32 memory, does their kdump die? If not I vote for
256M. 256M is from Joerg's experimental data, I remember he said his
machine need around 200M, then why not 256M. Otherwise it need be
increased sooner or later.
>
> should have other smart way to detect and update the default value.
> even dmi blacklist way is better than just changing default value.
I am not sure about this. All systems expecting more low-memory need be
checked out?
>
> Thanks
>
> Yinghai
On Wed, Jun 10, 2015 at 10:48:14AM -0700, Yinghai Lu wrote:
> So under 4G low ram is tight, with changing default to 256M is just lazy
> and punishing the other system that does not need that.
> those systems do not need to have "crashkernel=xx,low" before.
The definition of 'default' is that it should be a good value to work
(ideally) everywhere. The previous value did not fulfill this
definition, as on systems without (an enabled) hardware IOMMU and some
32bit-DMA-only devices the crashkernel quickly ran out of low-mem with
it.
And the new default value is not just one that we guessed as being a
better one. Both Baoquan and me did testing on affected systems and
after a discussion we agreed on it.
> should have other smart way to detect and update the default value.
> even dmi blacklist way is better than just changing default value.
DMI information is not enough, it is about the availability of a
hardware IOMMU and the amount and kind of devices needing low-mem for
DMA. This means introducing heuristics, which is often the worst
solution, because heuristics are complex and if we fix it for one
machine we might break many others. So changing the default to a more
workable value is the best solution imho.
But if you have a better idea I am happy to review your patches.
Joerg