The patch 74d332c13b21 changes alloc_netdev_mqs to use vzalloc if kzalloc
fails (later patches change it to kvzalloc).
The problem with this is that if the vzalloc function is actually used,
virtio_net doesn't work (because it expects that the extra memory should
be accessible with DMA-API and memory allocated with vzalloc isn't).
This patch changes it back to kzalloc and adds a warning if the allocated
size is too large (the allocation is unreliable in this case).
Signed-off-by: Mikulas Patocka <[email protected]>
Fixes: 74d332c13b21 ("net: extend net_device allocation to vmalloc()")
---
net/core/dev.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
Index: linux-2.6/net/core/dev.c
===================================================================
--- linux-2.6.orig/net/core/dev.c 2018-04-16 21:08:36.000000000 +0200
+++ linux-2.6/net/core/dev.c 2018-04-18 16:24:43.000000000 +0200
@@ -8366,7 +8366,8 @@ struct net_device *alloc_netdev_mqs(int
/* ensure 32-byte alignment of whole construct */
alloc_size += NETDEV_ALIGN - 1;
- p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+ WARN_ON(alloc_size > PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
+ p = kzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
if (!p)
return NULL;
On 04/18/2018 07:34 AM, Mikulas Patocka wrote:
> The patch 74d332c13b21 changes alloc_netdev_mqs to use vzalloc if kzalloc
> fails (later patches change it to kvzalloc).
>
> The problem with this is that if the vzalloc function is actually used,
> virtio_net doesn't work (because it expects that the extra memory should
> be accessible with DMA-API and memory allocated with vzalloc isn't).
>
> This patch changes it back to kzalloc and adds a warning if the allocated
> size is too large (the allocation is unreliable in this case).
>
> Signed-off-by: Mikulas Patocka <[email protected]>
> Fixes: 74d332c13b21 ("net: extend net_device allocation to vmalloc()")
>
> ---
> net/core/dev.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/net/core/dev.c
> ===================================================================
> --- linux-2.6.orig/net/core/dev.c 2018-04-16 21:08:36.000000000 +0200
> +++ linux-2.6/net/core/dev.c 2018-04-18 16:24:43.000000000 +0200
> @@ -8366,7 +8366,8 @@ struct net_device *alloc_netdev_mqs(int
> /* ensure 32-byte alignment of whole construct */
> alloc_size += NETDEV_ALIGN - 1;
>
> - p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> + WARN_ON(alloc_size > PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
> + p = kzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> if (!p)
> return NULL;
>
>
Since when a net_device needs to be in DMA zone ???
I would rather fix virtio_net, this looks very suspect to me.
Each virtio_net should probably allocate the exact amount of DMA-memory it wants,
instead of expecting core networking stack to have a huge chunk of DMA-memory for everything.
On Wed, 18 Apr 2018, Eric Dumazet wrote:
>
>
> On 04/18/2018 07:34 AM, Mikulas Patocka wrote:
> > The patch 74d332c13b21 changes alloc_netdev_mqs to use vzalloc if kzalloc
> > fails (later patches change it to kvzalloc).
> >
> > The problem with this is that if the vzalloc function is actually used,
> > virtio_net doesn't work (because it expects that the extra memory should
> > be accessible with DMA-API and memory allocated with vzalloc isn't).
> >
> > This patch changes it back to kzalloc and adds a warning if the allocated
> > size is too large (the allocation is unreliable in this case).
> >
> > Signed-off-by: Mikulas Patocka <[email protected]>
> > Fixes: 74d332c13b21 ("net: extend net_device allocation to vmalloc()")
> >
> > ---
> > net/core/dev.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > Index: linux-2.6/net/core/dev.c
> > ===================================================================
> > --- linux-2.6.orig/net/core/dev.c 2018-04-16 21:08:36.000000000 +0200
> > +++ linux-2.6/net/core/dev.c 2018-04-18 16:24:43.000000000 +0200
> > @@ -8366,7 +8366,8 @@ struct net_device *alloc_netdev_mqs(int
> > /* ensure 32-byte alignment of whole construct */
> > alloc_size += NETDEV_ALIGN - 1;
> >
> > - p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> > + WARN_ON(alloc_size > PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
> > + p = kzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> > if (!p)
> > return NULL;
> >
> >
>
> Since when a net_device needs to be in DMA zone ???
>
> I would rather fix virtio_net, this looks very suspect to me.
>
> Each virtio_net should probably allocate the exact amount of DMA-memory it wants,
> instead of expecting core networking stack to have a huge chunk of DMA-memory for everything.
The structure net_device is followed by arbitrary driver-specific data
(accessible with the function netdev_priv). And for virtio-net, these
driver-specific data must be in DMA memory.
Mikulas
On 04/18/2018 09:44 AM, Mikulas Patocka wrote:
>
>
> On Wed, 18 Apr 2018, Eric Dumazet wrote:
>
>>
>>
>> On 04/18/2018 07:34 AM, Mikulas Patocka wrote:
>>> The patch 74d332c13b21 changes alloc_netdev_mqs to use vzalloc if kzalloc
>>> fails (later patches change it to kvzalloc).
>>>
>>> The problem with this is that if the vzalloc function is actually used,
>>> virtio_net doesn't work (because it expects that the extra memory should
>>> be accessible with DMA-API and memory allocated with vzalloc isn't).
>>>
>>> This patch changes it back to kzalloc and adds a warning if the allocated
>>> size is too large (the allocation is unreliable in this case).
>>>
>>> Signed-off-by: Mikulas Patocka <[email protected]>
>>> Fixes: 74d332c13b21 ("net: extend net_device allocation to vmalloc()")
>>>
>>> ---
>>> net/core/dev.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> Index: linux-2.6/net/core/dev.c
>>> ===================================================================
>>> --- linux-2.6.orig/net/core/dev.c 2018-04-16 21:08:36.000000000 +0200
>>> +++ linux-2.6/net/core/dev.c 2018-04-18 16:24:43.000000000 +0200
>>> @@ -8366,7 +8366,8 @@ struct net_device *alloc_netdev_mqs(int
>>> /* ensure 32-byte alignment of whole construct */
>>> alloc_size += NETDEV_ALIGN - 1;
>>>
>>> - p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
>>> + WARN_ON(alloc_size > PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
>>> + p = kzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
>>> if (!p)
>>> return NULL;
>>>
>>>
>>
>> Since when a net_device needs to be in DMA zone ???
>>
>> I would rather fix virtio_net, this looks very suspect to me.
>>
>> Each virtio_net should probably allocate the exact amount of DMA-memory it wants,
>> instead of expecting core networking stack to have a huge chunk of DMA-memory for everything.
>
> The structure net_device is followed by arbitrary driver-specific data
> (accessible with the function netdev_priv). And for virtio-net, these
> driver-specific data must be in DMA memory.
I get that, but how is the original xenvif problem will be solved ?
Your patch would add a bug in some other driver(s)
I suggest that virtio_net clearly identifies which part needs a specific allocation
and does its itself, instead of abusing the netdev_priv storage.
Ie use a pointer to a block of memory, allocated by virtio_net, for virtio_net.
From: Eric Dumazet <[email protected]>
Date: Wed, 18 Apr 2018 09:05:54 -0700
> Each virtio_net should probably allocate the exact amount of
> DMA-memory it wants, instead of expecting core networking stack to
> have a huge chunk of DMA-memory for everything.
Yes, if you need DMA'able memory, allocate DMA'able memory separately
and hang it off of the netdev instead of assuming you can just DMA
in/out of it.
From: Mikulas Patocka <[email protected]>
Date: Wed, 18 Apr 2018 12:44:25 -0400 (EDT)
> The structure net_device is followed by arbitrary driver-specific data
> (accessible with the function netdev_priv). And for virtio-net, these
> driver-specific data must be in DMA memory.
And we are saying that this assumption is wrong and needs to be
corrected.
From: Eric Dumazet <[email protected]>
Date: Wed, 18 Apr 2018 09:51:25 -0700
> I suggest that virtio_net clearly identifies which part needs a specific allocation
> and does its itself, instead of abusing the netdev_priv storage.
>
> Ie use a pointer to a block of memory, allocated by virtio_net, for virtio_net.
+1
On Wed, 18 Apr 2018, Eric Dumazet wrote:
>
>
> On 04/18/2018 09:44 AM, Mikulas Patocka wrote:
> >
> >
> > On Wed, 18 Apr 2018, Eric Dumazet wrote:
> >
> >>
> >>
> >> On 04/18/2018 07:34 AM, Mikulas Patocka wrote:
> >>> The patch 74d332c13b21 changes alloc_netdev_mqs to use vzalloc if kzalloc
> >>> fails (later patches change it to kvzalloc).
> >>>
> >>> The problem with this is that if the vzalloc function is actually used,
> >>> virtio_net doesn't work (because it expects that the extra memory should
> >>> be accessible with DMA-API and memory allocated with vzalloc isn't).
> >>>
> >>> This patch changes it back to kzalloc and adds a warning if the allocated
> >>> size is too large (the allocation is unreliable in this case).
> >>>
> >>> Signed-off-by: Mikulas Patocka <[email protected]>
> >>> Fixes: 74d332c13b21 ("net: extend net_device allocation to vmalloc()")
> >>>
> >>> ---
> >>> net/core/dev.c | 3 ++-
> >>> 1 file changed, 2 insertions(+), 1 deletion(-)
> >>>
> >>> Index: linux-2.6/net/core/dev.c
> >>> ===================================================================
> >>> --- linux-2.6.orig/net/core/dev.c 2018-04-16 21:08:36.000000000 +0200
> >>> +++ linux-2.6/net/core/dev.c 2018-04-18 16:24:43.000000000 +0200
> >>> @@ -8366,7 +8366,8 @@ struct net_device *alloc_netdev_mqs(int
> >>> /* ensure 32-byte alignment of whole construct */
> >>> alloc_size += NETDEV_ALIGN - 1;
> >>>
> >>> - p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> >>> + WARN_ON(alloc_size > PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
> >>> + p = kzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> >>> if (!p)
> >>> return NULL;
> >>>
> >>>
> >>
> >> Since when a net_device needs to be in DMA zone ???
> >>
> >> I would rather fix virtio_net, this looks very suspect to me.
> >>
> >> Each virtio_net should probably allocate the exact amount of DMA-memory it wants,
> >> instead of expecting core networking stack to have a huge chunk of DMA-memory for everything.
> >
> > The structure net_device is followed by arbitrary driver-specific data
> > (accessible with the function netdev_priv). And for virtio-net, these
> > driver-specific data must be in DMA memory.
>
> I get that, but how is the original xenvif problem will be solved ?
>
> Your patch would add a bug in some other driver(s)
>
> I suggest that virtio_net clearly identifies which part needs a specific allocation
> and does its itself, instead of abusing the netdev_priv storage.
>
> Ie use a pointer to a block of memory, allocated by virtio_net, for virtio_net.
There are drivers that need to do DMA to driver-specific area.
And there are drivers that need driver-specific area larger than kmalloc
limit.
These are conflicting requirements and one of those drivers must be
changed.
I suggest to change the drivers that need large driver-specific area.
That's why I added the WARN_ON, so that they can be identified.
Mikulas
On Wed, 18 Apr 2018, David Miller wrote:
> From: Mikulas Patocka <[email protected]>
> Date: Wed, 18 Apr 2018 12:44:25 -0400 (EDT)
>
> > The structure net_device is followed by arbitrary driver-specific data
> > (accessible with the function netdev_priv). And for virtio-net, these
> > driver-specific data must be in DMA memory.
>
> And we are saying that this assumption is wrong and needs to be
> corrected.
So, try to find all the networking drivers that to DMA to the private
area.
The problem here is that kvzalloc usually returns DMA-able area, but it
may return non-DMA area rarely, if the memory is too fragmented. So, we
are in a situation, where some networking drivers will randomly fail. Go
and find them.
Mikulas
On Wed, 18 Apr 2018, David Miller wrote:
> From: Eric Dumazet <[email protected]>
> Date: Wed, 18 Apr 2018 09:51:25 -0700
>
> > I suggest that virtio_net clearly identifies which part needs a specific allocation
> > and does its itself, instead of abusing the netdev_priv storage.
> >
> > Ie use a pointer to a block of memory, allocated by virtio_net, for virtio_net.
>
> +1
And what about the other networking drivers that may do the same thing?
Mikulas
On Wed, Apr 18, 2018 at 09:05:54AM -0700, Eric Dumazet wrote:
>
>
> On 04/18/2018 07:34 AM, Mikulas Patocka wrote:
> > The patch 74d332c13b21 changes alloc_netdev_mqs to use vzalloc if kzalloc
> > fails (later patches change it to kvzalloc).
> >
> > The problem with this is that if the vzalloc function is actually used,
> > virtio_net doesn't work (because it expects that the extra memory should
> > be accessible with DMA-API and memory allocated with vzalloc isn't).
> >
> > This patch changes it back to kzalloc and adds a warning if the allocated
> > size is too large (the allocation is unreliable in this case).
> >
> > Signed-off-by: Mikulas Patocka <[email protected]>
> > Fixes: 74d332c13b21 ("net: extend net_device allocation to vmalloc()")
> >
> > ---
> > net/core/dev.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > Index: linux-2.6/net/core/dev.c
> > ===================================================================
> > --- linux-2.6.orig/net/core/dev.c 2018-04-16 21:08:36.000000000 +0200
> > +++ linux-2.6/net/core/dev.c 2018-04-18 16:24:43.000000000 +0200
> > @@ -8366,7 +8366,8 @@ struct net_device *alloc_netdev_mqs(int
> > /* ensure 32-byte alignment of whole construct */
> > alloc_size += NETDEV_ALIGN - 1;
> >
> > - p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> > + WARN_ON(alloc_size > PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
> > + p = kzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> > if (!p)
> > return NULL;
> >
> >
>
> Since when a net_device needs to be in DMA zone ???
It's likely that we are not the only device like this.
It would be better to find a way to find devices like this.
Imagine you want to pass some data to card.
Natural thing is to just put it in a variable and start DMA.
However DMA API disallows stack access nowdays,
so it's natural to put this within struct device.
See e.g.
commit a725ee3e44e39dab1ec82cc745899a785d2a555e
Author: Andy Lutomirski <[email protected]>
Date: Mon Jul 18 15:34:49 2016 -0700
virtio-net: Remove more stack DMA
> I would rather fix virtio_net, this looks very suspect to me.
It's been done for years. I'm fine with changing virtio-net and
allocating DMA memory separately but I am not sure it's appropriate on
net.
And OTOH, shouldn't drivers avoid allocating such huge device structs?
Abusing vmalloc won't work well on 32 bit platforms.
> Each virtio_net should probably allocate the exact amount of DMA-memory it wants,
> instead of expecting core networking stack to have a huge chunk of DMA-memory for everything.
It's not a DMA memory at all (not a synchronous memory) and it is not
huge. It's a small chunk of regular memory that is mapped for DMA for a
short while, then unmapped.
--
MST
On Wed, Apr 18, 2018 at 01:47:21PM -0400, David Miller wrote:
> From: Eric Dumazet <[email protected]>
> Date: Wed, 18 Apr 2018 09:51:25 -0700
>
> > I suggest that virtio_net clearly identifies which part needs a specific allocation
> > and does its itself, instead of abusing the netdev_priv storage.
> >
> > Ie use a pointer to a block of memory, allocated by virtio_net, for virtio_net.
>
> +1
I can do this, but just FYI it's all of 16 bytes which is only mapped
for DMA temporarily - and not all of it - a byte here, a byte there.
The amount of hoops one has to jump through just to get 1 byte from
device nowdays is annoying.
--
MST
On 04/18/2018 10:55 AM, Michael S. Tsirkin wrote:
> Imagine you want to pass some data to card.
> Natural thing is to just put it in a variable and start DMA.
> However DMA API disallows stack access nowdays,
> so it's natural to put this within struct device.
>
> See e.g.
>
> commit a725ee3e44e39dab1ec82cc745899a785d2a555e
> Author: Andy Lutomirski <[email protected]>
> Date: Mon Jul 18 15:34:49 2016 -0700
>
> virtio-net: Remove more stack DMA
>
Andy just moved the problem to another one, since at that time we already
had vmalloc() fallback for at least 2 years.
Note that my original patch had :
p = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
if (!p)
p = vzalloc(alloc_size);
So really, normal (less than PAGE_SIZE) allocations would have almost-zero-chance to end up to vmalloc(one_page)
On Wed, Apr 18, 2018 at 01:38:43PM -0700, Eric Dumazet wrote:
>
>
> On 04/18/2018 10:55 AM, Michael S. Tsirkin wrote:
>
> > Imagine you want to pass some data to card.
> > Natural thing is to just put it in a variable and start DMA.
> > However DMA API disallows stack access nowdays,
> > so it's natural to put this within struct device.
> >
> > See e.g.
> >
> > commit a725ee3e44e39dab1ec82cc745899a785d2a555e
> > Author: Andy Lutomirski <[email protected]>
> > Date: Mon Jul 18 15:34:49 2016 -0700
> >
> > virtio-net: Remove more stack DMA
> >
>
> Andy just moved the problem to another one, since at that time we already
> had vmalloc() fallback for at least 2 years.
>
> Note that my original patch had :
>
> p = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
> if (!p)
> p = vzalloc(alloc_size);
>
> So really, normal (less than PAGE_SIZE) allocations would have almost-zero-chance to end up to vmalloc(one_page)
Thanks Eric, I'll fix virtio.
--
MST
On 04/19/2018 09:12 AM, Mikulas Patocka wrote:
>
>
> These bugs are hard to reproduce because vmalloc falls back to kmalloc
> only if memory is fragmented.
>
This sentence is wrong.
.... because kvmalloc() falls back to vmalloc() ...
On Thu, Apr 19, 2018 at 12:12:38PM -0400, Mikulas Patocka wrote:
>
>
> On Wed, 18 Apr 2018, Mikulas Patocka wrote:
>
> >
> >
> > On Wed, 18 Apr 2018, David Miller wrote:
> >
> > > From: Mikulas Patocka <[email protected]>
> > > Date: Wed, 18 Apr 2018 12:44:25 -0400 (EDT)
> > >
> > > > The structure net_device is followed by arbitrary driver-specific data
> > > > (accessible with the function netdev_priv). And for virtio-net, these
> > > > driver-specific data must be in DMA memory.
> > >
> > > And we are saying that this assumption is wrong and needs to be
> > > corrected.
> >
> > So, try to find all the networking drivers that to DMA to the private
> > area.
> >
> > The problem here is that kvzalloc usually returns DMA-able area, but it
> > may return non-DMA area rarely, if the memory is too fragmented. So, we
> > are in a situation, where some networking drivers will randomly fail. Go
> > and find them.
> >
> > Mikulas
>
> Her I submit a patch that makes kvmalloc always use vmalloc if
> CONFIG_DEBUG_VM is defined.
>
>
>
>
> From: Mikulas Patocka <[email protected]>
> Subject: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
>
> The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> kmalloc fails.
>
> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code.
>
> These bugs are hard to reproduce because vmalloc falls back to kmalloc
> only if memory is fragmented.
>
> In order to detect these bugs reliably I submit this patch that changes
> kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
>
> Signed-off-by: Mikulas Patocka <[email protected]>
Maybe make it conditional on CONFIG_DEBUG_SG too?
Otherwise I think you just trigger a hard to debug memory corruption.
> ---
> mm/util.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> Index: linux-2.6/mm/util.c
> ===================================================================
> --- linux-2.6.orig/mm/util.c 2018-04-18 15:46:23.000000000 +0200
> +++ linux-2.6/mm/util.c 2018-04-18 16:00:43.000000000 +0200
> @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
> */
> void *kvmalloc_node(size_t size, gfp_t flags, int node)
> {
> +#ifndef CONFIG_DEBUG_VM
> gfp_t kmalloc_flags = flags;
> void *ret;
>
> @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> */
> if (ret || size <= PAGE_SIZE)
> return ret;
> +#endif
>
> return __vmalloc_node_flags_caller(size, node, flags,
> __builtin_return_address(0));
On Wed, 18 Apr 2018, Mikulas Patocka wrote:
>
>
> On Wed, 18 Apr 2018, David Miller wrote:
>
> > From: Mikulas Patocka <[email protected]>
> > Date: Wed, 18 Apr 2018 12:44:25 -0400 (EDT)
> >
> > > The structure net_device is followed by arbitrary driver-specific data
> > > (accessible with the function netdev_priv). And for virtio-net, these
> > > driver-specific data must be in DMA memory.
> >
> > And we are saying that this assumption is wrong and needs to be
> > corrected.
>
> So, try to find all the networking drivers that to DMA to the private
> area.
>
> The problem here is that kvzalloc usually returns DMA-able area, but it
> may return non-DMA area rarely, if the memory is too fragmented. So, we
> are in a situation, where some networking drivers will randomly fail. Go
> and find them.
>
> Mikulas
Her I submit a patch that makes kvmalloc always use vmalloc if
CONFIG_DEBUG_VM is defined.
From: Mikulas Patocka <[email protected]>
Subject: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
The kvmalloc function tries to use kmalloc and falls back to vmalloc if
kmalloc fails.
Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
uses DMA-API on the returned memory or frees it with kfree. Such bugs were
found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
code.
These bugs are hard to reproduce because vmalloc falls back to kmalloc
only if memory is fragmented.
In order to detect these bugs reliably I submit this patch that changes
kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
Signed-off-by: Mikulas Patocka <[email protected]>
---
mm/util.c | 2 ++
1 file changed, 2 insertions(+)
Index: linux-2.6/mm/util.c
===================================================================
--- linux-2.6.orig/mm/util.c 2018-04-18 15:46:23.000000000 +0200
+++ linux-2.6/mm/util.c 2018-04-18 16:00:43.000000000 +0200
@@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
*/
void *kvmalloc_node(size_t size, gfp_t flags, int node)
{
+#ifndef CONFIG_DEBUG_VM
gfp_t kmalloc_flags = flags;
void *ret;
@@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
*/
if (ret || size <= PAGE_SIZE)
return ret;
+#endif
return __vmalloc_node_flags_caller(size, node, flags,
__builtin_return_address(0));
On Thu, 19 Apr 2018, Eric Dumazet wrote:
>
>
> On 04/19/2018 09:12 AM, Mikulas Patocka wrote:
> >
> >
> > These bugs are hard to reproduce because vmalloc falls back to kmalloc
> > only if memory is fragmented.
> >
>
> This sentence is wrong.
>
> .... because kvmalloc() falls back to vmalloc() ...
Yes. There should be "falls back to vmalloc()".
Mikulas
On 04/19/2018 06:12 PM, Mikulas Patocka wrote:
> From: Mikulas Patocka <[email protected]>
> Subject: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
>
> The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> kmalloc fails.
>
> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code.
>
> These bugs are hard to reproduce because vmalloc falls back to kmalloc
> only if memory is fragmented.
>
> In order to detect these bugs reliably I submit this patch that changes
> kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
>
> Signed-off-by: Mikulas Patocka <[email protected]>
Hmm AFAIK Fedora uses CONFIG_DEBUG_VM in their kernels. Sure you want to
impose this on all users? Seems too much for DEBUG_VM to me. Maybe it
should be hidden under some error injection config?
> ---
> mm/util.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> Index: linux-2.6/mm/util.c
> ===================================================================
> --- linux-2.6.orig/mm/util.c 2018-04-18 15:46:23.000000000 +0200
> +++ linux-2.6/mm/util.c 2018-04-18 16:00:43.000000000 +0200
> @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
> */
> void *kvmalloc_node(size_t size, gfp_t flags, int node)
> {
> +#ifndef CONFIG_DEBUG_VM
> gfp_t kmalloc_flags = flags;
> void *ret;
>
> @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> */
> if (ret || size <= PAGE_SIZE)
> return ret;
> +#endif
Did you verify that vmalloc does the right thing for sub-page sizes?
Shouldn't those be exempted?
> return __vmalloc_node_flags_caller(size, node, flags,
> __builtin_return_address(0));
>
On Thu, 19 Apr 2018 12:12:38 -0400 (EDT) Mikulas Patocka <[email protected]> wrote:
> The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> kmalloc fails.
>
> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code.
>
> These bugs are hard to reproduce because vmalloc falls back to kmalloc
> only if memory is fragmented.
Yes, that's nasty.
> In order to detect these bugs reliably I submit this patch that changes
> kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
>
> ...
>
> --- linux-2.6.orig/mm/util.c 2018-04-18 15:46:23.000000000 +0200
> +++ linux-2.6/mm/util.c 2018-04-18 16:00:43.000000000 +0200
> @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
> */
> void *kvmalloc_node(size_t size, gfp_t flags, int node)
> {
> +#ifndef CONFIG_DEBUG_VM
> gfp_t kmalloc_flags = flags;
> void *ret;
>
> @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> */
> if (ret || size <= PAGE_SIZE)
> return ret;
> +#endif
>
> return __vmalloc_node_flags_caller(size, node, flags,
> __builtin_return_address(0));
Well, it doesn't have to be done at compile-time, does it? We could
add a knob (in debugfs, presumably) which enables this at runtime.
That's far more user-friendly.
On Thu, 19 Apr 2018, Andrew Morton wrote:
> On Thu, 19 Apr 2018 12:12:38 -0400 (EDT) Mikulas Patocka <[email protected]> wrote:
>
> > The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> > kmalloc fails.
> >
> > Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> > uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> > found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> > code.
> >
> > These bugs are hard to reproduce because vmalloc falls back to kmalloc
> > only if memory is fragmented.
>
> Yes, that's nasty.
>
> > In order to detect these bugs reliably I submit this patch that changes
> > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> >
> > ...
> >
> > --- linux-2.6.orig/mm/util.c 2018-04-18 15:46:23.000000000 +0200
> > +++ linux-2.6/mm/util.c 2018-04-18 16:00:43.000000000 +0200
> > @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
> > */
> > void *kvmalloc_node(size_t size, gfp_t flags, int node)
> > {
> > +#ifndef CONFIG_DEBUG_VM
> > gfp_t kmalloc_flags = flags;
> > void *ret;
> >
> > @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> > */
> > if (ret || size <= PAGE_SIZE)
> > return ret;
> > +#endif
> >
> > return __vmalloc_node_flags_caller(size, node, flags,
> > __builtin_return_address(0));
>
> Well, it doesn't have to be done at compile-time, does it? We could
> add a knob (in debugfs, presumably) which enables this at runtime.
> That's far more user-friendly.
But who will turn it on in debugfs? It should be default for debugging
kernels, so that users using them would report the error.
Conditioning it on CONFIG_DEBUG_SG is better than CONFIG_DEBUG_VM, it will
print a stacktrace where the incorrect use happened.
Mikulas
On Thu, 19 Apr 2018, Michael S. Tsirkin wrote:
> Maybe make it conditional on CONFIG_DEBUG_SG too?
> Otherwise I think you just trigger a hard to debug memory corruption.
OK, here I resend the patch with CONFIG_DEBUG_SG. With CONFIG_DEBUG_SG,
the DMA API will print a stacktrace where the misuse happened, so it's
much easier to debug than with CONFIG_DEBUG_VM.
Fedora doesn't use CONFIG_DEBUG_SG in its default kernel (it only uses it
in the debugging kernel), so users won't be hurt by this.
From: Mikulas Patocka <[email protected]>
Subject: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
The kvmalloc function tries to use kmalloc and falls back to vmalloc if
kmalloc fails.
Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
uses DMA-API on the returned memory or frees it with kfree. Such bugs were
found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
code.
These bugs are hard to reproduce because vmalloc falls back to kmalloc
only if memory is fragmented.
In order to detect these bugs reliably I submit this patch that changes
kvmalloc to always use vmalloc if CONFIG_DEBUG_SG is turned on.
Signed-off-by: Mikulas Patocka <[email protected]>
---
mm/util.c | 2 ++
1 file changed, 2 insertions(+)
Index: linux-2.6/mm/util.c
===================================================================
--- linux-2.6.orig/mm/util.c 2018-04-18 15:46:23.000000000 +0200
+++ linux-2.6/mm/util.c 2018-04-19 23:14:14.000000000 +0200
@@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
*/
void *kvmalloc_node(size_t size, gfp_t flags, int node)
{
+#ifndef CONFIG_DEBUG_SG
gfp_t kmalloc_flags = flags;
void *ret;
@@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
*/
if (ret || size <= PAGE_SIZE)
return ret;
+#endif
return __vmalloc_node_flags_caller(size, node, flags,
__builtin_return_address(0));
On Thu, 19 Apr 2018 17:19:20 -0400 (EDT) Mikulas Patocka <[email protected]> wrote:
> > > In order to detect these bugs reliably I submit this patch that changes
> > > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> > >
> > > ...
> > >
> > > --- linux-2.6.orig/mm/util.c 2018-04-18 15:46:23.000000000 +0200
> > > +++ linux-2.6/mm/util.c 2018-04-18 16:00:43.000000000 +0200
> > > @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
> > > */
> > > void *kvmalloc_node(size_t size, gfp_t flags, int node)
> > > {
> > > +#ifndef CONFIG_DEBUG_VM
> > > gfp_t kmalloc_flags = flags;
> > > void *ret;
> > >
> > > @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > */
> > > if (ret || size <= PAGE_SIZE)
> > > return ret;
> > > +#endif
> > >
> > > return __vmalloc_node_flags_caller(size, node, flags,
> > > __builtin_return_address(0));
> >
> > Well, it doesn't have to be done at compile-time, does it? We could
> > add a knob (in debugfs, presumably) which enables this at runtime.
> > That's far more user-friendly.
>
> But who will turn it on in debugfs?
But who will turn it on in Kconfig? Just a handful of developers. We
could add SONFIG_DEBUG_SG to the list in
Documentation/process/submit-checklist.rst, but nobody reads that.
Also, a whole bunch of defconfigs set CONFIG_DEBUG_SG=y and some
googling indicates that they aren't the only ones...
> It should be default for debugging
> kernels, so that users using them would report the error.
Well. This isn't the first time we've wanted to enable expensive (or
noisy) debugging things in debug kernels, by any means.
So how could we define a debug kernel in which it's OK to enable such
things?
- Could be "it's an -rc kernel". But then we'd be enabling a bunch of
untested code when Linus cuts a release.
- Could be "it's an -rc kernel with SUBLEVEL <= 5". But then we risk
unexpected things happening when Linux cuts -rc6, which still isn't
good.
- How about "it's an -rc kernel with odd-numbered SUBLEVEL and
SUBLEVEL <= 5". That way everybody who runs -rc1, -rc3 and -rc5 will
have kvmalloc debugging enabled. That's potentially nasty because
vmalloc is much slower than kmalloc. But kvmalloc() is only used for
large and probably infrequent allocations, so it's probably OK.
I wonder how we get at SUBLEVEL from within .c.
On Thu, Apr 19, 2018 at 12:12:38PM -0400, Mikulas Patocka wrote:
> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code.
Maybe it's time to have the SG code handle vmalloced pages? This is
becoming more and more common with vmapped stacks (and some of our
workarounds are hideous -- allocate 4 bytes with kmalloc because we can't
DMA onto the stack any more?). We already have a few places which do
handle sgs of vmalloced addresses, such as the nx crypto driver:
if (is_vmalloc_addr(start_addr))
sg_addr = page_to_phys(vmalloc_to_page(start_addr))
+ offset_in_page(sg_addr);
else
sg_addr = __pa(sg_addr);
and videobuf:
pg = vmalloc_to_page(virt);
if (NULL == pg)
goto err;
BUG_ON(page_to_pfn(pg) >= (1 << (32 - PAGE_SHIFT)));
sg_set_page(&sglist[i], pg, PAGE_SIZE, 0);
Yes, there's the potential that we have to produce two SG entries for a
virtually contiguous region if it crosses a page boundary, and our APIs
aren't set up right to make it happen. But this is something we should
consider fixing ... otherwise we'll end up with dozens of driver hacks.
The videobuf implementation was already copy-and-pasted into the saa7146
driver, for example.
On Thu, 19 Apr 2018, Andrew Morton wrote:
> On Thu, 19 Apr 2018 17:19:20 -0400 (EDT) Mikulas Patocka <[email protected]> wrote:
>
> > > > In order to detect these bugs reliably I submit this patch that changes
> > > > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> > > >
> > > > ...
> > > >
> > > > --- linux-2.6.orig/mm/util.c 2018-04-18 15:46:23.000000000 +0200
> > > > +++ linux-2.6/mm/util.c 2018-04-18 16:00:43.000000000 +0200
> > > > @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
> > > > */
> > > > void *kvmalloc_node(size_t size, gfp_t flags, int node)
> > > > {
> > > > +#ifndef CONFIG_DEBUG_VM
> > > > gfp_t kmalloc_flags = flags;
> > > > void *ret;
> > > >
> > > > @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > > */
> > > > if (ret || size <= PAGE_SIZE)
> > > > return ret;
> > > > +#endif
> > > >
> > > > return __vmalloc_node_flags_caller(size, node, flags,
> > > > __builtin_return_address(0));
> > >
> > > Well, it doesn't have to be done at compile-time, does it? We could
> > > add a knob (in debugfs, presumably) which enables this at runtime.
> > > That's far more user-friendly.
> >
> > But who will turn it on in debugfs?
>
> But who will turn it on in Kconfig? Just a handful of developers. We
So, it won't receive much testing.
I've never played with those debugfs files (because I didn't need it), and
most users also won't play with it. Having a debugfs option is like having
no option at all.
> could add SONFIG_DEBUG_SG to the list in
> Documentation/process/submit-checklist.rst, but nobody reads that.
>
> Also, a whole bunch of defconfigs set CONFIG_DEBUG_SG=y and some
> googling indicates that they aren't the only ones...
>
> > It should be default for debugging
> > kernels, so that users using them would report the error.
>
> Well. This isn't the first time we've wanted to enable expensive (or
> noisy) debugging things in debug kernels, by any means.
>
> So how could we define a debug kernel in which it's OK to enable such
> things?
Debug kernel is what distributions distribute as debug kernel - i.e. RHEL
or Fedora have debugging kernels. So it needs to be bound to an option
that is turned on in these kernels - so that any user who boots the
debugging kernel triggers the bug.
> - Could be "it's an -rc kernel". But then we'd be enabling a bunch of
> untested code when Linus cuts a release.
>
> - Could be "it's an -rc kernel with SUBLEVEL <= 5". But then we risk
> unexpected things happening when Linux cuts -rc6, which still isn't
> good.
>
> - How about "it's an -rc kernel with odd-numbered SUBLEVEL and
> SUBLEVEL <= 5". That way everybody who runs -rc1, -rc3 and -rc5 will
> have kvmalloc debugging enabled. That's potentially nasty because
> vmalloc is much slower than kmalloc. But kvmalloc() is only used for
> large and probably infrequent allocations, so it's probably OK.
>
> I wonder how we get at SUBLEVEL from within .c.
Don't bind it to rc level, bind it to some debugging configuration option.
Mikulas
On Fri, 20 Apr 2018, Matthew Wilcox wrote:
> On Thu, Apr 19, 2018 at 12:12:38PM -0400, Mikulas Patocka wrote:
> > Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> > uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> > found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> > code.
>
> Maybe it's time to have the SG code handle vmalloced pages? This is
> becoming more and more common with vmapped stacks (and some of our
> workarounds are hideous -- allocate 4 bytes with kmalloc because we can't
> DMA onto the stack any more?). We already have a few places which do
> handle sgs of vmalloced addresses, such as the nx crypto driver:
>
> if (is_vmalloc_addr(start_addr))
> sg_addr = page_to_phys(vmalloc_to_page(start_addr))
> + offset_in_page(sg_addr);
> else
> sg_addr = __pa(sg_addr);
>
> and videobuf:
>
> pg = vmalloc_to_page(virt);
> if (NULL == pg)
> goto err;
> BUG_ON(page_to_pfn(pg) >= (1 << (32 - PAGE_SHIFT)));
> sg_set_page(&sglist[i], pg, PAGE_SIZE, 0);
>
> Yes, there's the potential that we have to produce two SG entries for a
> virtually contiguous region if it crosses a page boundary, and our APIs
> aren't set up right to make it happen. But this is something we should
> consider fixing ... otherwise we'll end up with dozens of driver hacks.
> The videobuf implementation was already copy-and-pasted into the saa7146
> driver, for example.
What if the device requires physically contiguous area and the vmalloc
area crosses a page? Will you use a bounce buffer? Where do you allocate
the bounce buffer from? What if you run out of bounce buffers?
Mikulkas
On Thu 19-04-18 12:12:38, Mikulas Patocka wrote:
[...]
> From: Mikulas Patocka <[email protected]>
> Subject: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
>
> The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> kmalloc fails.
>
> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code.
>
> These bugs are hard to reproduce because vmalloc falls back to kmalloc
> only if memory is fragmented.
>
> In order to detect these bugs reliably I submit this patch that changes
> kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
No way. This is just wrong! First of all, you will explode most likely
on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
enabled quite often.
> Signed-off-by: Mikulas Patocka <[email protected]>
Nacked-by: Michal Hocko <[email protected]>
> ---
> mm/util.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> Index: linux-2.6/mm/util.c
> ===================================================================
> --- linux-2.6.orig/mm/util.c 2018-04-18 15:46:23.000000000 +0200
> +++ linux-2.6/mm/util.c 2018-04-18 16:00:43.000000000 +0200
> @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
> */
> void *kvmalloc_node(size_t size, gfp_t flags, int node)
> {
> +#ifndef CONFIG_DEBUG_VM
> gfp_t kmalloc_flags = flags;
> void *ret;
>
> @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> */
> if (ret || size <= PAGE_SIZE)
> return ret;
> +#endif
>
> return __vmalloc_node_flags_caller(size, node, flags,
> __builtin_return_address(0));
--
Michal Hocko
SUSE Labs
On Fri, Apr 20, 2018 at 03:08:52PM +0200, Michal Hocko wrote:
> > In order to detect these bugs reliably I submit this patch that changes
> > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
>
> No way. This is just wrong! First of all, you will explode most likely
> on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> enabled quite often.
I think it'll still suit Mikulas' debugging needs if we always use
vmalloc for sizes above PAGE_SIZE?
On Fri 20-04-18 06:41:36, Matthew Wilcox wrote:
> On Fri, Apr 20, 2018 at 03:08:52PM +0200, Michal Hocko wrote:
> > > In order to detect these bugs reliably I submit this patch that changes
> > > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> >
> > No way. This is just wrong! First of all, you will explode most likely
> > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> > enabled quite often.
>
> I think it'll still suit Mikulas' debugging needs if we always use
> vmalloc for sizes above PAGE_SIZE?
Even if that was the case then this doesn't sounds like CONFIG_DEBUG_VM
material. We do not want a completely different behavior when the config
is enabled. If we really need some better fallback testing coverage
then the fault injection, as suggested by Vlastimil, sounds much more
reasonable to me
--
Michal Hocko
SUSE Labs
On Fri, 20 Apr 2018, Michal Hocko wrote:
> On Thu 19-04-18 12:12:38, Mikulas Patocka wrote:
> [...]
> > From: Mikulas Patocka <[email protected]>
> > Subject: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
> >
> > The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> > kmalloc fails.
> >
> > Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> > uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> > found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> > code.
> >
> > These bugs are hard to reproduce because vmalloc falls back to kmalloc
> > only if memory is fragmented.
> >
> > In order to detect these bugs reliably I submit this patch that changes
> > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
>
> No way. This is just wrong! First of all, you will explode most likely
> on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> enabled quite often.
You're an evil person who doesn't want to fix bugs.
You refused to fix vmalloc(GFP_NOIO) misbehavior a year ago (did you make
some progress with it since that time?) and you refuse to fix kvmalloc
misuses.
I tried this patch on text-only virtual machine and /proc/vmallocinfo
shows 614kB more memory. I tried it on a desktop machine with the chrome
browser open and /proc/vmallocinfo space is increased by 7MB. So no - this
won't exhaust memory and kill the machine.
Arguing that this increases memory consumption is as bogus as arguing that
CONFIG_LOCKDEP increses memory consumption. No one is forcing you to
enable CONFIG_LOCKDEP and no one is forcing you to enable this kvmalloc
test too.
Mikulas
On Fri, 20 Apr 2018, Michal Hocko wrote:
> On Fri 20-04-18 06:41:36, Matthew Wilcox wrote:
> > On Fri, Apr 20, 2018 at 03:08:52PM +0200, Michal Hocko wrote:
> > > > In order to detect these bugs reliably I submit this patch that changes
> > > > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> > >
> > > No way. This is just wrong! First of all, you will explode most likely
> > > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> > > enabled quite often.
> >
> > I think it'll still suit Mikulas' debugging needs if we always use
> > vmalloc for sizes above PAGE_SIZE?
>
> Even if that was the case then this doesn't sounds like CONFIG_DEBUG_VM
> material. We do not want a completely different behavior when the config
>
> --
> Michal Hocko
> SUSE Labs
I'm not arguing that it must be turned on exactly by CONFIG_DEBUG_VM. It
may be turned on some other option that is enabled in debug kernels
(CONFIG_DEBUG_SG may be a better option, because you'll get meaningful
stracktraces from DMA API then).
> is enabled. If we really need some better fallback testing coverage
> then the fault injection, as suggested by Vlastimil, sounds much more
> reasonable to me
People who test kernels will install the kernel-debug package, reboot to
the debug kernel and run their testsuites. They won't turn on magic
options in debugfs or use some hideous kernel commandline arguments. If
the kvmalloc test isn't in the debug kernel, then the testing crew won't
test it - that's it.
Mikulas
On Fri, Apr 20, 2018 at 04:54:53PM -0400, Mikulas Patocka wrote:
> On Fri, 20 Apr 2018, Michal Hocko wrote:
> > No way. This is just wrong! First of all, you will explode most likely
> > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> > enabled quite often.
>
> You're an evil person who doesn't want to fix bugs.
Steady on. There's no need for that. Michal isn't evil. Please
apologise.
> You refused to fix vmalloc(GFP_NOIO) misbehavior a year ago (did you make
> some progress with it since that time?) and you refuse to fix kvmalloc
> misuses.
I understand you're frustrated, but this is not the way to get the problems
fixed.
> I tried this patch on text-only virtual machine and /proc/vmallocinfo
> shows 614kB more memory. I tried it on a desktop machine with the chrome
> browser open and /proc/vmallocinfo space is increased by 7MB. So no - this
> won't exhaust memory and kill the machine.
This is good data, thank you for providing it.
> Arguing that this increases memory consumption is as bogus as arguing that
> CONFIG_LOCKDEP increses memory consumption. No one is forcing you to
> enable CONFIG_LOCKDEP and no one is forcing you to enable this kvmalloc
> test too.
I think there's a real problem which is that CONFIG_DEBUG_VM is too broad.
It inserts code in a *lot* of places, some of which is quite expensive.
We would do better to split it into more granular pieces ... although
an explosion of configuration options isn't great either. Maybe just
CONFIG_DEBUG_VM and CONFIG_DEBUG_VM_EXPENSIVE.
Michal may be wrong, but he's not evil.
On Fri, 20 Apr 2018, Matthew Wilcox wrote:
> On Fri, Apr 20, 2018 at 04:54:53PM -0400, Mikulas Patocka wrote:
> > On Fri, 20 Apr 2018, Michal Hocko wrote:
> > > No way. This is just wrong! First of all, you will explode most likely
> > > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> > > enabled quite often.
> >
> > You're an evil person who doesn't want to fix bugs.
>
> Steady on. There's no need for that. Michal isn't evil. Please
> apologise.
I see this attitude from Michal again and again.
He didn't want to fix vmalloc(GFP_NOIO), he didn't want to fix alloc_pages
sleeping when __GFP_NORETRY is used. So what should I say? Fix them and
you won't be evil :-)
(he could also fix the oom killer, so that it is triggered when
free_memory+cache+free_swap goes beyond a threshold and not when you loop
too long in the allocator)
> > You refused to fix vmalloc(GFP_NOIO) misbehavior a year ago (did you make
> > some progress with it since that time?) and you refuse to fix kvmalloc
> > misuses.
>
> I understand you're frustrated, but this is not the way to get the problems
> fixed.
>
> > I tried this patch on text-only virtual machine and /proc/vmallocinfo
> > shows 614kB more memory. I tried it on a desktop machine with the chrome
> > browser open and /proc/vmallocinfo space is increased by 7MB. So no - this
> > won't exhaust memory and kill the machine.
>
> This is good data, thank you for providing it.
>
> > Arguing that this increases memory consumption is as bogus as arguing that
> > CONFIG_LOCKDEP increses memory consumption. No one is forcing you to
> > enable CONFIG_LOCKDEP and no one is forcing you to enable this kvmalloc
> > test too.
>
> I think there's a real problem which is that CONFIG_DEBUG_VM is too broad.
> It inserts code in a *lot* of places, some of which is quite expensive.
> We would do better to split it into more granular pieces ... although
> an explosion of configuration options isn't great either. Maybe just
> CONFIG_DEBUG_VM and CONFIG_DEBUG_VM_EXPENSIVE.
>
> Michal may be wrong, but he's not evil.
I already said that we can change it from CONFIG_DEBUG_VM to
CONFIG_DEBUG_SG - or to whatever other option you may want, just to make
sure that it is enabled in distro debug kernels by default.
Mikulas
On Fri, Apr 20, 2018 at 05:21:26PM -0400, Mikulas Patocka wrote:
> On Fri, 20 Apr 2018, Matthew Wilcox wrote:
> > On Fri, Apr 20, 2018 at 04:54:53PM -0400, Mikulas Patocka wrote:
> > > On Fri, 20 Apr 2018, Michal Hocko wrote:
> > > > No way. This is just wrong! First of all, you will explode most likely
> > > > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> > > > enabled quite often.
> > >
> > > You're an evil person who doesn't want to fix bugs.
> >
> > Steady on. There's no need for that. Michal isn't evil. Please
> > apologise.
>
> I see this attitude from Michal again and again.
Fine; then *say that*. I also see Michal saying "No" a lot. Sometimes
I agree with him, sometimes I don't. I think he genuinely wants the best
code in the kernel, and saying "No" is part of it.
> He didn't want to fix vmalloc(GFP_NOIO)
I don't remember that conversation, so I don't know whether I agree with
his reasoning or not. But we are supposed to be moving away from GFP_NOIO
towards marking regions with memalloc_noio_save() / restore. If you do
that, you won't need vmalloc(GFP_NOIO).
> he didn't want to fix alloc_pages sleeping when __GFP_NORETRY is used.
The GFP flags are a mess, still.
> So what should I say? Fix them and
> you won't be evil :-)
No, you should reserve calling somebody evil for truly evil things.
> (he could also fix the oom killer, so that it is triggered when
> free_memory+cache+free_swap goes beyond a threshold and not when you loop
> too long in the allocator)
... that also doesn't make somebody evil.
> I already said that we can change it from CONFIG_DEBUG_VM to
> CONFIG_DEBUG_SG - or to whatever other option you may want, just to make
> sure that it is enabled in distro debug kernels by default.
Yes, and I think that's the right idea. So send a v2 and ignore the
replies that are clearly relating to an earlier version of the patch.
Not everybody reads every mail in the thread before responding to one they
find interesting. Yes, ideally, one would, but sometimes one doesn't.
On Sat 21-04-18 07:47:57, Matthew Wilcox wrote:
> On Fri, Apr 20, 2018 at 05:21:26PM -0400, Mikulas Patocka wrote:
> > On Fri, 20 Apr 2018, Matthew Wilcox wrote:
> > > On Fri, Apr 20, 2018 at 04:54:53PM -0400, Mikulas Patocka wrote:
> > > > On Fri, 20 Apr 2018, Michal Hocko wrote:
> > > > > No way. This is just wrong! First of all, you will explode most likely
> > > > > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> > > > > enabled quite often.
> > > >
> > > > You're an evil person who doesn't want to fix bugs.
> > >
> > > Steady on. There's no need for that. Michal isn't evil. Please
> > > apologise.
> >
> > I see this attitude from Michal again and again.
>
> Fine; then *say that*. I also see Michal saying "No" a lot. Sometimes
> I agree with him, sometimes I don't. I think he genuinely wants the best
> code in the kernel, and saying "No" is part of it.
Exactly! We have a lot of legacy herritage and random semantic because
we were too eager to merge stuff. I think it is time to stop that and
think twice before merging someothing. If you call that evil then I am
fine to be evil.
> > He didn't want to fix vmalloc(GFP_NOIO)
>
> I don't remember that conversation, so I don't know whether I agree with
> his reasoning or not. But we are supposed to be moving away from GFP_NOIO
> towards marking regions with memalloc_noio_save() / restore. If you do
> that, you won't need vmalloc(GFP_NOIO).
It was basically to detect GFP_NOIO context _inside_ vmalloc and use the
scope API to enforce it there. Does it solve potential problems? Yes it
does. Does it solve any existing report, no I am not aware of any. Is
it a good fix longterm? Absolutely no, because the scope API should be
used _at the place_ where the scope starts rather than a random utility
function. If we are going the easier way now, we will never teach users
to use the API properly. And I am willing to risk to keep a broken
code which we have for years rather than allow a random hack that will
seemingly fix it.
Btw. I was pretty much explicit with this reasoning when rejecting the
patch. Do you still call that evil?
> > he didn't want to fix alloc_pages sleeping when __GFP_NORETRY is used.
>
> The GFP flags are a mess, still.
I do not remember that one but __GFP_NORETRY is _allowed_ to sleep. And
yes I do _agree_ gfp flags are a mess which is really hard to get fixed
because they are lacking a good design from the very beginning. Fixing
some of those issues today is a completely PITA.
> > So what should I say? Fix them and
> > you won't be evil :-)
>
> No, you should reserve calling somebody evil for truly evil things.
>
> > (he could also fix the oom killer, so that it is triggered when
> > free_memory+cache+free_swap goes beyond a threshold and not when you loop
> > too long in the allocator)
>
> ... that also doesn't make somebody evil.
And again, it is way much more easier to claim that something will get
fixed when the reality is much more complicated. I've tried to explain
those risks as well.
> > I already said that we can change it from CONFIG_DEBUG_VM to
> > CONFIG_DEBUG_SG - or to whatever other option you may want, just to make
> > sure that it is enabled in distro debug kernels by default.
>
> Yes, and I think that's the right idea. So send a v2 and ignore the
> replies that are clearly relating to an earlier version of the patch.
> Not everybody reads every mail in the thread before responding to one they
> find interesting. Yes, ideally, one would, but sometimes one doesn't.
And look here. This is yet another ad-hoc idea. We have many users of
kvmalloc which have no relation to SG, yet you are going to control
their behavior by CONFIG_DEBUG_SG? No way! (yeah evil again)
Really, we have a fault injection framework and this sounds like
something to hook in there.
--
Michal Hocko
SUSE Labs
On Sat, 21 Apr 2018, Matthew Wilcox wrote:
> On Fri, Apr 20, 2018 at 05:21:26PM -0400, Mikulas Patocka wrote:
> > On Fri, 20 Apr 2018, Matthew Wilcox wrote:
> > > On Fri, Apr 20, 2018 at 04:54:53PM -0400, Mikulas Patocka wrote:
> > > > On Fri, 20 Apr 2018, Michal Hocko wrote:
> > > > > No way. This is just wrong! First of all, you will explode most likely
> > > > > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> > > > > enabled quite often.
> > > >
> > > > You're an evil person who doesn't want to fix bugs.
> > >
> > > Steady on. There's no need for that. Michal isn't evil. Please
> > > apologise.
> >
> > I see this attitude from Michal again and again.
>
> Fine; then *say that*. I also see Michal saying "No" a lot. Sometimes
> I agree with him, sometimes I don't. I think he genuinely wants the best
> code in the kernel, and saying "No" is part of it.
>
> > He didn't want to fix vmalloc(GFP_NOIO)
>
> I don't remember that conversation, so I don't know whether I agree with
> his reasoning or not. But we are supposed to be moving away from GFP_NOIO
> towards marking regions with memalloc_noio_save() / restore. If you do
> that, you won't need vmalloc(GFP_NOIO).
He said the same thing a year ago. And there was small progress. 6 out of
27 __vmalloc calls were converted to memalloc_noio_save in a year - 5 in
infiniband and 1 in btrfs. (the whole discussion is here
http://lkml.iu.edu/hypermail/linux/kernel/1706.3/04681.html )
He refuses 15-line patch to fix GFP_NOIO bug because he believes that in 4
years, the kernel will be refactored and GFP_NOIO will be eliminated. Why
does he have veto over this part of the code? I'd much rather argue with
people who have constructive comments about fixing bugs than with him.
(note that even if the refactoring eventually succeeds, it will not be
backported to stable branches. The small vmalloc patch could be
backported)
> > he didn't want to fix alloc_pages sleeping when __GFP_NORETRY is used.
>
> The GFP flags are a mess, still.
That's the problem - the flag doesn't have a clear contract and the
developers change behavior ad hoc according to bug reports.
> > So what should I say? Fix them and you won't be evil :-)
>
> No, you should reserve calling somebody evil for truly evil things.
How would you call it? Michal falsely believes that a 15-line patch would
prevent him from doing long-term refactoring work and so he refuses it.
> > I already said that we can change it from CONFIG_DEBUG_VM to
> > CONFIG_DEBUG_SG - or to whatever other option you may want, just to make
> > sure that it is enabled in distro debug kernels by default.
>
> Yes, and I think that's the right idea. So send a v2 and ignore the
> replies that are clearly relating to an earlier version of the patch.
> Not everybody reads every mail in the thread before responding to one they
> find interesting. Yes, ideally, one would, but sometimes one doesn't.
I sent the CONFIG_DEBUG_SG patch before (I wonder why he didn't repond to
it). I'll send a third version of the patch that actually randomly chooses
between kmalloc and vmalloc, because some abuses can only be detected with
kmalloc and we should test both.
For bisecting, it is better to always fallback to vmalloc, but for general
testing, it is better to test both branches.
Mikulas
On Sun, 22 Apr 2018, Michal Hocko wrote:
> On Sat 21-04-18 07:47:57, Matthew Wilcox wrote:
>
> > > He didn't want to fix vmalloc(GFP_NOIO)
> >
> > I don't remember that conversation, so I don't know whether I agree with
> > his reasoning or not. But we are supposed to be moving away from GFP_NOIO
> > towards marking regions with memalloc_noio_save() / restore. If you do
> > that, you won't need vmalloc(GFP_NOIO).
>
> It was basically to detect GFP_NOIO context _inside_ vmalloc and use the
> scope API to enforce it there. Does it solve potential problems? Yes it
> does. Does it solve any existing report, no I am not aware of any. Is
> it a good fix longterm? Absolutely no, because the scope API should be
> used _at the place_ where the scope starts rather than a random utility
> function. If we are going the easier way now, we will never teach users
> to use the API properly. And I am willing to risk to keep a broken
> code which we have for years rather than allow a random hack that will
> seemingly fix it.
>
> Btw. I was pretty much explicit with this reasoning when rejecting the
> patch. Do you still call that evil?
You are making nonsensical excuses.
That patch doesn't prevent you from refactoring the kernel and eliminating
GFP_NOIO in the long term. You can apply the patch and then continue
working on GFP_NOIO refactoring as well as before.
> > > he didn't want to fix alloc_pages sleeping when __GFP_NORETRY is used.
> >
> > The GFP flags are a mess, still.
>
> I do not remember that one but __GFP_NORETRY is _allowed_ to sleep. And
> yes I do _agree_ gfp flags are a mess which is really hard to get fixed
> because they are lacking a good design from the very beginning. Fixing
> some of those issues today is a completely PITA.
It may sleep, but if it sleeps regularly, it slows down swapping (because
the swapping code does mempool_alloc and mempool_alloc does __GFP_NORETRY
allocation). And there were two INTENTIONAL sleeps with schedule_timeout.
You removed one and left the other, claiming that __GFP_NORETRY allocation
should sleep.
> > > I already said that we can change it from CONFIG_DEBUG_VM to
> > > CONFIG_DEBUG_SG - or to whatever other option you may want, just to make
> > > sure that it is enabled in distro debug kernels by default.
> >
> > Yes, and I think that's the right idea. So send a v2 and ignore the
> > replies that are clearly relating to an earlier version of the patch.
> > Not everybody reads every mail in the thread before responding to one they
> > find interesting. Yes, ideally, one would, but sometimes one doesn't.
>
> And look here. This is yet another ad-hoc idea. We have many users of
> kvmalloc which have no relation to SG, yet you are going to control
> their behavior by CONFIG_DEBUG_SG? No way! (yeah evil again)
Why aren't you constructive and pick up pick up the CONFIG flag?
> Really, we have a fault injection framework and this sounds like
> something to hook in there.
The testing people won't set it up. They install the "kernel-debug"
package and run the tests in it.
If you introduce a hidden option that no one knows about, no one will use
it.
> --
> Michal Hocko
> SUSE Labs
Mikulas
On Mon 23-04-18 10:06:08, Mikulas Patocka wrote:
>
>
> On Sat, 21 Apr 2018, Matthew Wilcox wrote:
>
> > On Fri, Apr 20, 2018 at 05:21:26PM -0400, Mikulas Patocka wrote:
> > > On Fri, 20 Apr 2018, Matthew Wilcox wrote:
> > > > On Fri, Apr 20, 2018 at 04:54:53PM -0400, Mikulas Patocka wrote:
> > > > > On Fri, 20 Apr 2018, Michal Hocko wrote:
> > > > > > No way. This is just wrong! First of all, you will explode most likely
> > > > > > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> > > > > > enabled quite often.
> > > > >
> > > > > You're an evil person who doesn't want to fix bugs.
> > > >
> > > > Steady on. There's no need for that. Michal isn't evil. Please
> > > > apologise.
> > >
> > > I see this attitude from Michal again and again.
> >
> > Fine; then *say that*. I also see Michal saying "No" a lot. Sometimes
> > I agree with him, sometimes I don't. I think he genuinely wants the best
> > code in the kernel, and saying "No" is part of it.
> >
> > > He didn't want to fix vmalloc(GFP_NOIO)
> >
> > I don't remember that conversation, so I don't know whether I agree with
> > his reasoning or not. But we are supposed to be moving away from GFP_NOIO
> > towards marking regions with memalloc_noio_save() / restore. If you do
> > that, you won't need vmalloc(GFP_NOIO).
>
> He said the same thing a year ago. And there was small progress. 6 out of
> 27 __vmalloc calls were converted to memalloc_noio_save in a year - 5 in
> infiniband and 1 in btrfs. (the whole discussion is here
> http://lkml.iu.edu/hypermail/linux/kernel/1706.3/04681.html )
Well this is not that easy. It requires a cooperation from maintainers.
I can only do as much. I've posted patches in the past and actively
bringing up this topic at LSFMM last two years...
> He refuses 15-line patch to fix GFP_NOIO bug because he believes that in 4
> years, the kernel will be refactored and GFP_NOIO will be eliminated. Why
> does he have veto over this part of the code? I'd much rather argue with
> people who have constructive comments about fixing bugs than with him.
I didn't NACK the patch AFAIR. I've said it is not a good idea longterm.
I would be much more willing to change my mind if you would back your
patch by a real bug report. Hacks are acceptable when we have a real
issue in hands. But if we want to fix potential issue then better make
it properly.
[...]
> I sent the CONFIG_DEBUG_SG patch before (I wonder why he didn't repond to
> it). I'll send a third version of the patch that actually randomly chooses
> between kmalloc and vmalloc, because some abuses can only be detected with
> kmalloc and we should test both.
>
> For bisecting, it is better to always fallback to vmalloc, but for general
> testing, it is better to test both branches.
Agreed!
--
Michal Hocko
SUSE Labs
On Fri, Apr 20, 2018 at 08:20:23AM -0400, Mikulas Patocka wrote:
>
>
> On Fri, 20 Apr 2018, Matthew Wilcox wrote:
>
> > On Thu, Apr 19, 2018 at 12:12:38PM -0400, Mikulas Patocka wrote:
> > > Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> > > uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> > > found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> > > code.
> >
> > Maybe it's time to have the SG code handle vmalloced pages? This is
> > becoming more and more common with vmapped stacks (and some of our
> > workarounds are hideous -- allocate 4 bytes with kmalloc because we can't
> > DMA onto the stack any more?). We already have a few places which do
> > handle sgs of vmalloced addresses, such as the nx crypto driver:
> >
> > if (is_vmalloc_addr(start_addr))
> > sg_addr = page_to_phys(vmalloc_to_page(start_addr))
> > + offset_in_page(sg_addr);
> > else
> > sg_addr = __pa(sg_addr);
> >
> > and videobuf:
> >
> > pg = vmalloc_to_page(virt);
> > if (NULL == pg)
> > goto err;
> > BUG_ON(page_to_pfn(pg) >= (1 << (32 - PAGE_SHIFT)));
> > sg_set_page(&sglist[i], pg, PAGE_SIZE, 0);
> >
> > Yes, there's the potential that we have to produce two SG entries for a
> > virtually contiguous region if it crosses a page boundary, and our APIs
> > aren't set up right to make it happen. But this is something we should
> > consider fixing ... otherwise we'll end up with dozens of driver hacks.
> > The videobuf implementation was already copy-and-pasted into the saa7146
> > driver, for example.
>
> What if the device requires physically contiguous area and the vmalloc
> area crosses a page? Will you use a bounce buffer? Where do you allocate
> the bounce buffer from? What if you run out of bounce buffers?
>
> Mikulkas
I agree with Matthew here.
4 byte variables are typically size aligned so won't cross a boundary.
That's enough for virtio at least. People using structs can force
alignment.
We could wrap access in a macro (sizeof(x) >= alignof(x)) to help
guarantee that.
--
MST
On Mon 23-04-18 10:24:02, Mikulas Patocka wrote:
[...]
I am not going to comment on your continuous accusations. We can discuss
patches but general rants make very limited sense.
> > > > I already said that we can change it from CONFIG_DEBUG_VM to
> > > > CONFIG_DEBUG_SG - or to whatever other option you may want, just to make
> > > > sure that it is enabled in distro debug kernels by default.
> > >
> > > Yes, and I think that's the right idea. So send a v2 and ignore the
> > > replies that are clearly relating to an earlier version of the patch.
> > > Not everybody reads every mail in the thread before responding to one they
> > > find interesting. Yes, ideally, one would, but sometimes one doesn't.
> >
> > And look here. This is yet another ad-hoc idea. We have many users of
> > kvmalloc which have no relation to SG, yet you are going to control
> > their behavior by CONFIG_DEBUG_SG? No way! (yeah evil again)
>
> Why aren't you constructive and pick up pick up the CONFIG flag?
Because config doesn't make that much of a sense. You do not want a
permanent vmalloc fallback unconditionally. Debugging option which
changes the behavior completely is not useful IMNHO. Besides that who is
going to enable it?
> > Really, we have a fault injection framework and this sounds like
> > something to hook in there.
>
> The testing people won't set it up. They install the "kernel-debug"
> package and run the tests in it.
>
> If you introduce a hidden option that no one knows about, no one will use
> it.
then make sure people know about it. Fuzzers already do test fault
injections.
--
Michal Hocko
SUSE Labs
On Mon, 23 Apr 2018, Michal Hocko wrote:
> On Mon 23-04-18 10:24:02, Mikulas Patocka wrote:
>
> > > Really, we have a fault injection framework and this sounds like
> > > something to hook in there.
> >
> > The testing people won't set it up. They install the "kernel-debug"
> > package and run the tests in it.
> >
> > If you introduce a hidden option that no one knows about, no one will use
> > it.
>
> then make sure people know about it. Fuzzers already do test fault
> injections.
I think that in the long term we can introduce a kernel parameter like
"debug_level=1", "debug_level=2", "debug_level=3" that will turn on
debugging features across all kernel subsystems and we can teach users to
use it to diagnose problems.
But it won't work if every subsystem has different debug parameters. There
are 192 distinct filenames in debugfs, if we add 193rd one, harly anyone
notices it.
Mikulas
The kvmalloc function tries to use kmalloc and falls back to vmalloc if
kmalloc fails.
Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
uses DMA-API on the returned memory or frees it with kfree. Such bugs were
found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
code.
These bugs are hard to reproduce because kvmalloc falls back to vmalloc
only if memory is fragmented.
In order to detect these bugs reliably I submit this patch that changes
kvmalloc to fall back to vmalloc with 1/2 probability if CONFIG_DEBUG_SG
is turned on. CONFIG_DEBUG_SG is used, because it makes the DMA API layer
verify the addresses passed to it, and so the user will get a reliable
stacktrace.
Some bugs (such as buffer overflows) are better detected
with kmalloc code, so we must test the kmalloc path too.
Signed-off-by: Mikulas Patocka <[email protected]>
---
mm/util.c | 10 ++++++++++
1 file changed, 10 insertions(+)
Index: linux-2.6/mm/util.c
===================================================================
--- linux-2.6.orig/mm/util.c 2018-04-23 00:12:05.000000000 +0200
+++ linux-2.6/mm/util.c 2018-04-23 17:57:02.000000000 +0200
@@ -14,6 +14,7 @@
#include <linux/hugetlb.h>
#include <linux/vmalloc.h>
#include <linux/userfaultfd_k.h>
+#include <linux/random.h>
#include <asm/sections.h>
#include <linux/uaccess.h>
@@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
*/
WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
+#ifdef CONFIG_DEBUG_SG
+ /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
+ if (!(prandom_u32_max(2) & 1))
+ goto do_vmalloc;
+#endif
+
/*
* We want to attempt a large physically contiguous block first because
* it is less likely to fragment multiple larger blocks and therefore
@@ -427,6 +434,9 @@ void *kvmalloc_node(size_t size, gfp_t f
if (ret || size <= PAGE_SIZE)
return ret;
+#ifdef CONFIG_DEBUG_SG
+do_vmalloc:
+#endif
return __vmalloc_node_flags_caller(size, node, flags,
__builtin_return_address(0));
}
On Mon, 23 Apr 2018, Michal Hocko wrote:
> On Mon 23-04-18 10:06:08, Mikulas Patocka wrote:
>
> > > > He didn't want to fix vmalloc(GFP_NOIO)
> > >
> > > I don't remember that conversation, so I don't know whether I agree with
> > > his reasoning or not. But we are supposed to be moving away from GFP_NOIO
> > > towards marking regions with memalloc_noio_save() / restore. If you do
> > > that, you won't need vmalloc(GFP_NOIO).
> >
> > He said the same thing a year ago. And there was small progress. 6 out of
> > 27 __vmalloc calls were converted to memalloc_noio_save in a year - 5 in
> > infiniband and 1 in btrfs. (the whole discussion is here
> > http://lkml.iu.edu/hypermail/linux/kernel/1706.3/04681.html )
>
> Well this is not that easy. It requires a cooperation from maintainers.
> I can only do as much. I've posted patches in the past and actively
> bringing up this topic at LSFMM last two years...
You're right - but you have chosen the uneasy path. Fixing __vmalloc code
is easy and it doesn't require cooperation with maintainers.
> > He refuses 15-line patch to fix GFP_NOIO bug because he believes that in 4
> > years, the kernel will be refactored and GFP_NOIO will be eliminated. Why
> > does he have veto over this part of the code? I'd much rather argue with
> > people who have constructive comments about fixing bugs than with him.
>
> I didn't NACK the patch AFAIR. I've said it is not a good idea longterm.
> I would be much more willing to change my mind if you would back your
> patch by a real bug report. Hacks are acceptable when we have a real
> issue in hands. But if we want to fix potential issue then better make
> it properly.
Developers should fix bugs in advance, not to wait until a crash hapens,
is analyzed and reported.
What's the problem with 15-line hack? Is the problem that kernel
developers would feel depressed when looking the source code? Other than
harming developers' feelings, I don't see what kind of damange could that
piece of code do.
Mikulas
On Mon, 23 Apr 2018, Mikulas Patocka wrote:
> The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> kmalloc fails.
>
> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code.
>
> These bugs are hard to reproduce because kvmalloc falls back to vmalloc
> only if memory is fragmented.
>
> In order to detect these bugs reliably I submit this patch that changes
> kvmalloc to fall back to vmalloc with 1/2 probability if CONFIG_DEBUG_SG
> is turned on. CONFIG_DEBUG_SG is used, because it makes the DMA API layer
> verify the addresses passed to it, and so the user will get a reliable
> stacktrace.
>
Why not just do it unconditionally? Sounds better than "50% of the time
this will catch bugs".
> Some bugs (such as buffer overflows) are better detected
> with kmalloc code, so we must test the kmalloc path too.
>
> Signed-off-by: Mikulas Patocka <[email protected]>
>
> ---
> mm/util.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> Index: linux-2.6/mm/util.c
> ===================================================================
> --- linux-2.6.orig/mm/util.c 2018-04-23 00:12:05.000000000 +0200
> +++ linux-2.6/mm/util.c 2018-04-23 17:57:02.000000000 +0200
> @@ -14,6 +14,7 @@
> #include <linux/hugetlb.h>
> #include <linux/vmalloc.h>
> #include <linux/userfaultfd_k.h>
> +#include <linux/random.h>
>
> #include <asm/sections.h>
> #include <linux/uaccess.h>
> @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> */
> WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
>
> +#ifdef CONFIG_DEBUG_SG
> + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> + if (!(prandom_u32_max(2) & 1))
> + goto do_vmalloc;
> +#endif
> +
> /*
> * We want to attempt a large physically contiguous block first because
> * it is less likely to fragment multiple larger blocks and therefore
> @@ -427,6 +434,9 @@ void *kvmalloc_node(size_t size, gfp_t f
> if (ret || size <= PAGE_SIZE)
> return ret;
>
> +#ifdef CONFIG_DEBUG_SG
> +do_vmalloc:
> +#endif
You can just do
do_vmalloc: __maybe_unused
> return __vmalloc_node_flags_caller(size, node, flags,
> __builtin_return_address(0));
> }
>
>
On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> Some bugs (such as buffer overflows) are better detected
> with kmalloc code, so we must test the kmalloc path too.
Well now, this brings up another item for the collective TODO list --
implement redzone checks for vmalloc. Unless this is something already
taken care of by kasan or similar.
On Mon, 23 Apr 2018, David Rientjes wrote:
> On Mon, 23 Apr 2018, Mikulas Patocka wrote:
>
> > The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> > kmalloc fails.
> >
> > Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> > uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> > found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> > code.
> >
> > These bugs are hard to reproduce because kvmalloc falls back to vmalloc
> > only if memory is fragmented.
> >
> > In order to detect these bugs reliably I submit this patch that changes
> > kvmalloc to fall back to vmalloc with 1/2 probability if CONFIG_DEBUG_SG
> > is turned on. CONFIG_DEBUG_SG is used, because it makes the DMA API layer
> > verify the addresses passed to it, and so the user will get a reliable
> > stacktrace.
>
> Why not just do it unconditionally? Sounds better than "50% of the time
> this will catch bugs".
Because kmalloc (with slub_debug) detects buffer overflows better than
vmalloc. vmalloc detects buffer overflows only at a page boundary. This is
intended for debugging kernels and debugging kernels should detect as many
bugs as possible.
Mikulas
On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
[...]
> @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> */
> WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
>
> +#ifdef CONFIG_DEBUG_SG
> + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> + if (!(prandom_u32_max(2) & 1))
> + goto do_vmalloc;
> +#endif
I really do not think there is anything DEBUG_SG specific here. Why you
simply do not follow should_failslab path or even reuse the function?
> +
> /*
> * We want to attempt a large physically contiguous block first because
> * it is less likely to fragment multiple larger blocks and therefore
> @@ -427,6 +434,9 @@ void *kvmalloc_node(size_t size, gfp_t f
> if (ret || size <= PAGE_SIZE)
> return ret;
>
> +#ifdef CONFIG_DEBUG_SG
> +do_vmalloc:
> +#endif
> return __vmalloc_node_flags_caller(size, node, flags,
> __builtin_return_address(0));
> }
--
Michal Hocko
SUSE Labs
On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
>
>
> On Mon, 23 Apr 2018, Michal Hocko wrote:
>
> > On Mon 23-04-18 10:06:08, Mikulas Patocka wrote:
> >
> > > > > He didn't want to fix vmalloc(GFP_NOIO)
> > > >
> > > > I don't remember that conversation, so I don't know whether I agree with
> > > > his reasoning or not. But we are supposed to be moving away from GFP_NOIO
> > > > towards marking regions with memalloc_noio_save() / restore. If you do
> > > > that, you won't need vmalloc(GFP_NOIO).
> > >
> > > He said the same thing a year ago. And there was small progress. 6 out of
> > > 27 __vmalloc calls were converted to memalloc_noio_save in a year - 5 in
> > > infiniband and 1 in btrfs. (the whole discussion is here
> > > http://lkml.iu.edu/hypermail/linux/kernel/1706.3/04681.html )
> >
> > Well this is not that easy. It requires a cooperation from maintainers.
> > I can only do as much. I've posted patches in the past and actively
> > bringing up this topic at LSFMM last two years...
>
> You're right - but you have chosen the uneasy path.
Yes.
> Fixing __vmalloc code
> is easy and it doesn't require cooperation with maintainers.
But it is a hack against the intention of the scope api. It also alows
maintainers to not care about their broken code.
> > > He refuses 15-line patch to fix GFP_NOIO bug because he believes that in 4
> > > years, the kernel will be refactored and GFP_NOIO will be eliminated. Why
> > > does he have veto over this part of the code? I'd much rather argue with
> > > people who have constructive comments about fixing bugs than with him.
> >
> > I didn't NACK the patch AFAIR. I've said it is not a good idea longterm.
> > I would be much more willing to change my mind if you would back your
> > patch by a real bug report. Hacks are acceptable when we have a real
> > issue in hands. But if we want to fix potential issue then better make
> > it properly.
>
> Developers should fix bugs in advance, not to wait until a crash hapens,
> is analyzed and reported.
I agree. But are those existing users broken in the first place? I have
seen so many GFP_NOFS abuses that I would dare to guess that most of
those vmalloc NOFS abusers can be simply turned into GFP_KERNEL. Maybe
that is the reason we haven't heard any complains in years.
--
Michal Hocko
SUSE Labs
On Mon, 23 Apr 2018, Matthew Wilcox wrote:
> On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> > Some bugs (such as buffer overflows) are better detected
> > with kmalloc code, so we must test the kmalloc path too.
>
> Well now, this brings up another item for the collective TODO list --
> implement redzone checks for vmalloc. Unless this is something already
> taken care of by kasan or similar.
The kmalloc overflow testing is also not ideal - it rounds the size up to
the next slab size and detects buffer overflows only at this boundary.
Some times ago, I made a "kmalloc guard" patch that places a magic number
immediatelly after the requested size - so that it can detect overflows at
byte boundary
( https://www.redhat.com/archives/dm-devel/2014-September/msg00018.html )
That patch found a bug in crypto code:
( http://lkml.iu.edu/hypermail/linux/kernel/1409.1/02325.html )
Mikulas
On Tue, 24 Apr 2018, Michal Hocko wrote:
> On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
>
> > Fixing __vmalloc code
> > is easy and it doesn't require cooperation with maintainers.
>
> But it is a hack against the intention of the scope api.
It is not! You can fix __vmalloc now and you can convert the kernel to the
scope API in 4 years. It's not one way or the other.
> It also alows maintainers to not care about their broken code.
Most maintainers don't even know that it's broken. Out of 14 subsystems
using __vmalloc with GFP_NOIO/NOFS, only 2 realized that its
implementation is broken and implemented a workaround (me and the XFS
developers).
Misimplementing a function in a subtle and hard-to-notice way won't drive
developers away from using it.
> > > > He refuses 15-line patch to fix GFP_NOIO bug because he believes that in 4
> > > > years, the kernel will be refactored and GFP_NOIO will be eliminated. Why
> > > > does he have veto over this part of the code? I'd much rather argue with
> > > > people who have constructive comments about fixing bugs than with him.
> > >
> > > I didn't NACK the patch AFAIR. I've said it is not a good idea longterm.
> > > I would be much more willing to change my mind if you would back your
> > > patch by a real bug report. Hacks are acceptable when we have a real
> > > issue in hands. But if we want to fix potential issue then better make
> > > it properly.
> >
> > Developers should fix bugs in advance, not to wait until a crash hapens,
> > is analyzed and reported.
>
> I agree. But are those existing users broken in the first place? I have
> seen so many GFP_NOFS abuses that I would dare to guess that most of
> those vmalloc NOFS abusers can be simply turned into GFP_KERNEL. Maybe
> that is the reason we haven't heard any complains in years.
alloc_pages reclaims clean pages and most hard work is done by kswapd, so
GFP_KERNEL doesn't cause much issues with writeback. But cheating isn't
justified if you can get away with it. Incorrect GFP flags cause real
problems with shrinkers - because shrinkers are called from alloc_pages
and they do respond to GFP flags.
I had reported deadlock due to GFP issues (9d28eb12447). And the worst
thing about these bug reports is that they are totally unreproducible and
I get nothing, but a stacktrace in bugzilla. I had to guess what happened
and I couldn't even test if the patch fixed the bug.
I'm not really happy that you are deliberately leaving these issues behind
and making excuses.
Mikulas
On Tue, 24 Apr 2018, Michal Hocko wrote:
> On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> [...]
> > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > */
> > WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> >
> > +#ifdef CONFIG_DEBUG_SG
> > + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > + if (!(prandom_u32_max(2) & 1))
> > + goto do_vmalloc;
> > +#endif
>
> I really do not think there is anything DEBUG_SG specific here. Why you
> simply do not follow should_failslab path or even reuse the function?
CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if
you don't like CONFIG_DEBUG_SG, pick any other option that is enabled
there).
Fail-injection framework is if off by default and it must be explicitly
enabled and configured by the user - and most users won't enable it.
Mikulas
> > +
> > /*
> > * We want to attempt a large physically contiguous block first because
> > * it is less likely to fragment multiple larger blocks and therefore
> > @@ -427,6 +434,9 @@ void *kvmalloc_node(size_t size, gfp_t f
> > if (ret || size <= PAGE_SIZE)
> > return ret;
> >
> > +#ifdef CONFIG_DEBUG_SG
> > +do_vmalloc:
> > +#endif
> > return __vmalloc_node_flags_caller(size, node, flags,
> > __builtin_return_address(0));
> > }
>
> --
> Michal Hocko
> SUSE Labs
>
On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
>
>
> On Tue, 24 Apr 2018, Michal Hocko wrote:
>
> > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> >
> > > Fixing __vmalloc code
> > > is easy and it doesn't require cooperation with maintainers.
> >
> > But it is a hack against the intention of the scope api.
>
> It is not!
This discussion simply doesn't make much sense it seems. The scope API
is to document the scope of the reclaim recursion critical section. That
certainly is not a utility function like vmalloc.
--
Michal Hocko
SUSE Labs
On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
>
>
> On Tue, 24 Apr 2018, Michal Hocko wrote:
>
> > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > [...]
> > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > */
> > > WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > >
> > > +#ifdef CONFIG_DEBUG_SG
> > > + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > > + if (!(prandom_u32_max(2) & 1))
> > > + goto do_vmalloc;
> > > +#endif
> >
> > I really do not think there is anything DEBUG_SG specific here. Why you
> > simply do not follow should_failslab path or even reuse the function?
>
> CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if
> you don't like CONFIG_DEBUG_SG, pick any other option that is enabled
> there).
Are you telling me that you are shaping a debugging functionality basing
on what RHEL has enabled? And you call me evil. This is just rediculous.
> Fail-injection framework is if off by default and it must be explicitly
> enabled and configured by the user - and most users won't enable it.
It can be enabled easily. And if you care enough for your debugging
kernel then just make it enabled unconditionally.
--
Michal Hocko
SUSE Labs
On Tue 24-04-18 10:12:42, Michal Hocko wrote:
> On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> >
> >
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> >
> > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > >
> > > > Fixing __vmalloc code
> > > > is easy and it doesn't require cooperation with maintainers.
> > >
> > > But it is a hack against the intention of the scope api.
> >
> > It is not!
>
> This discussion simply doesn't make much sense it seems. The scope API
> is to document the scope of the reclaim recursion critical section. That
> certainly is not a utility function like vmalloc.
http://lkml.kernel.org/r/[email protected]
let's see how it rolls this time.
--
Michal Hocko
SUSE Labs
On Tue, 24 Apr 2018, Michal Hocko wrote:
> On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> >
> >
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> >
> > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > >
> > > > Fixing __vmalloc code
> > > > is easy and it doesn't require cooperation with maintainers.
> > >
> > > But it is a hack against the intention of the scope api.
> >
> > It is not!
>
> This discussion simply doesn't make much sense it seems. The scope API
> is to document the scope of the reclaim recursion critical section. That
> certainly is not a utility function like vmalloc.
That 15-line __vmalloc bugfix doesn't prevent you (or any other kernel
developer) from converting the code to the scope API. You make nonsensical
excuses.
Mikulas
On Tue, 24 Apr 2018, Michal Hocko wrote:
> On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
> >
> >
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> >
> > > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > > [...]
> > > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > > */
> > > > WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > > >
> > > > +#ifdef CONFIG_DEBUG_SG
> > > > + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > > > + if (!(prandom_u32_max(2) & 1))
> > > > + goto do_vmalloc;
> > > > +#endif
> > >
> > > I really do not think there is anything DEBUG_SG specific here. Why you
> > > simply do not follow should_failslab path or even reuse the function?
> >
> > CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if
> > you don't like CONFIG_DEBUG_SG, pick any other option that is enabled
> > there).
>
> Are you telling me that you are shaping a debugging functionality basing
> on what RHEL has enabled? And you call me evil. This is just rediculous.
>
> > Fail-injection framework is if off by default and it must be explicitly
> > enabled and configured by the user - and most users won't enable it.
>
> It can be enabled easily. And if you care enough for your debugging
> kernel then just make it enabled unconditionally.
So, should we add a new option CONFIG_KVMALLOC_FALLBACK_DEFAULT? I'm not
quite sure if 3 lines of debugging code need an extra option, but if you
don't want to reuse any existing debug option, it may be possible. Adding
it to the RHEL debug kernel would be trivial.
Mikulas
On Tue 24-04-18 13:00:11, Mikulas Patocka wrote:
>
>
> On Tue, 24 Apr 2018, Michal Hocko wrote:
>
> > On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
> > >
> > >
> > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > >
> > > > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > > > [...]
> > > > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > > > */
> > > > > WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > > > >
> > > > > +#ifdef CONFIG_DEBUG_SG
> > > > > + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > > > > + if (!(prandom_u32_max(2) & 1))
> > > > > + goto do_vmalloc;
> > > > > +#endif
> > > >
> > > > I really do not think there is anything DEBUG_SG specific here. Why you
> > > > simply do not follow should_failslab path or even reuse the function?
> > >
> > > CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if
> > > you don't like CONFIG_DEBUG_SG, pick any other option that is enabled
> > > there).
> >
> > Are you telling me that you are shaping a debugging functionality basing
> > on what RHEL has enabled? And you call me evil. This is just rediculous.
> >
> > > Fail-injection framework is if off by default and it must be explicitly
> > > enabled and configured by the user - and most users won't enable it.
> >
> > It can be enabled easily. And if you care enough for your debugging
> > kernel then just make it enabled unconditionally.
>
> So, should we add a new option CONFIG_KVMALLOC_FALLBACK_DEFAULT? I'm not
> quite sure if 3 lines of debugging code need an extra option, but if you
> don't want to reuse any existing debug option, it may be possible. Adding
> it to the RHEL debug kernel would be trivial.
Wouldn't it be equally trivial to simply enable the fault injection? You
would get additional failure paths testing as a bonus.
--
Michal Hocko
SUSE Labs
On Tue, Apr 24, 2018 at 08:29:14AM -0400, Mikulas Patocka wrote:
>
>
> On Mon, 23 Apr 2018, Matthew Wilcox wrote:
>
> > On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> > > Some bugs (such as buffer overflows) are better detected
> > > with kmalloc code, so we must test the kmalloc path too.
> >
> > Well now, this brings up another item for the collective TODO list --
> > implement redzone checks for vmalloc. Unless this is something already
> > taken care of by kasan or similar.
>
> The kmalloc overflow testing is also not ideal - it rounds the size up to
> the next slab size and detects buffer overflows only at this boundary.
>
> Some times ago, I made a "kmalloc guard" patch that places a magic number
> immediatelly after the requested size - so that it can detect overflows at
> byte boundary
> ( https://www.redhat.com/archives/dm-devel/2014-September/msg00018.html )
>
> That patch found a bug in crypto code:
> ( http://lkml.iu.edu/hypermail/linux/kernel/1409.1/02325.html )
Is it still worth doing this, now we have kasan?
On Tue, 24 Apr 2018, Michal Hocko wrote:
> On Tue 24-04-18 13:00:11, Mikulas Patocka wrote:
> >
> >
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> >
> > > On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
> > > >
> > > >
> > > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > >
> > > > > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > > > > [...]
> > > > > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > > > > */
> > > > > > WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > > > > >
> > > > > > +#ifdef CONFIG_DEBUG_SG
> > > > > > + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > > > > > + if (!(prandom_u32_max(2) & 1))
> > > > > > + goto do_vmalloc;
> > > > > > +#endif
> > > > >
> > > > > I really do not think there is anything DEBUG_SG specific here. Why you
> > > > > simply do not follow should_failslab path or even reuse the function?
> > > >
> > > > CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if
> > > > you don't like CONFIG_DEBUG_SG, pick any other option that is enabled
> > > > there).
> > >
> > > Are you telling me that you are shaping a debugging functionality basing
> > > on what RHEL has enabled? And you call me evil. This is just rediculous.
> > >
> > > > Fail-injection framework is if off by default and it must be explicitly
> > > > enabled and configured by the user - and most users won't enable it.
> > >
> > > It can be enabled easily. And if you care enough for your debugging
> > > kernel then just make it enabled unconditionally.
> >
> > So, should we add a new option CONFIG_KVMALLOC_FALLBACK_DEFAULT? I'm not
> > quite sure if 3 lines of debugging code need an extra option, but if you
> > don't want to reuse any existing debug option, it may be possible. Adding
> > it to the RHEL debug kernel would be trivial.
>
> Wouldn't it be equally trivial to simply enable the fault injection? You
> would get additional failure paths testing as a bonus.
The RHEL and Fedora debugging kernels are compiled with fault injection.
But the fault-injection framework will do nothing unless it is enabled by
a kernel parameter or debugfs write.
Most users don't know about the fault injection kernel parameters or
debugfs files and won't enabled it. We need a CONFIG_ option to enable it
by default in the debugging kernels (and we could add a kernel parameter
to override the default, fine-tune the fallback probability etc.)
Mikulas
On Tue 24-04-18 13:28:49, Mikulas Patocka wrote:
>
>
> On Tue, 24 Apr 2018, Michal Hocko wrote:
>
> > On Tue 24-04-18 13:00:11, Mikulas Patocka wrote:
> > >
> > >
> > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > >
> > > > On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
> > > > >
> > > > >
> > > > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > > >
> > > > > > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > > > > > [...]
> > > > > > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > > > > > */
> > > > > > > WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > > > > > >
> > > > > > > +#ifdef CONFIG_DEBUG_SG
> > > > > > > + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > > > > > > + if (!(prandom_u32_max(2) & 1))
> > > > > > > + goto do_vmalloc;
> > > > > > > +#endif
> > > > > >
> > > > > > I really do not think there is anything DEBUG_SG specific here. Why you
> > > > > > simply do not follow should_failslab path or even reuse the function?
> > > > >
> > > > > CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if
> > > > > you don't like CONFIG_DEBUG_SG, pick any other option that is enabled
> > > > > there).
> > > >
> > > > Are you telling me that you are shaping a debugging functionality basing
> > > > on what RHEL has enabled? And you call me evil. This is just rediculous.
> > > >
> > > > > Fail-injection framework is if off by default and it must be explicitly
> > > > > enabled and configured by the user - and most users won't enable it.
> > > >
> > > > It can be enabled easily. And if you care enough for your debugging
> > > > kernel then just make it enabled unconditionally.
> > >
> > > So, should we add a new option CONFIG_KVMALLOC_FALLBACK_DEFAULT? I'm not
> > > quite sure if 3 lines of debugging code need an extra option, but if you
> > > don't want to reuse any existing debug option, it may be possible. Adding
> > > it to the RHEL debug kernel would be trivial.
> >
> > Wouldn't it be equally trivial to simply enable the fault injection? You
> > would get additional failure paths testing as a bonus.
>
> The RHEL and Fedora debugging kernels are compiled with fault injection.
> But the fault-injection framework will do nothing unless it is enabled by
> a kernel parameter or debugfs write.
>
> Most users don't know about the fault injection kernel parameters or
> debugfs files and won't enabled it. We need a CONFIG_ option to enable it
> by default in the debugging kernels (and we could add a kernel parameter
> to override the default, fine-tune the fallback probability etc.)
If it is a real issue to install the debugging kernel with the required
kernel parameter then I a config option for the default on makes sense
to me.
--
Michal Hocko
SUSE Labs
On Tue, 24 Apr 2018, Matthew Wilcox wrote:
> On Tue, Apr 24, 2018 at 08:29:14AM -0400, Mikulas Patocka wrote:
> >
> >
> > On Mon, 23 Apr 2018, Matthew Wilcox wrote:
> >
> > > On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> > > > Some bugs (such as buffer overflows) are better detected
> > > > with kmalloc code, so we must test the kmalloc path too.
> > >
> > > Well now, this brings up another item for the collective TODO list --
> > > implement redzone checks for vmalloc. Unless this is something already
> > > taken care of by kasan or similar.
> >
> > The kmalloc overflow testing is also not ideal - it rounds the size up to
> > the next slab size and detects buffer overflows only at this boundary.
> >
> > Some times ago, I made a "kmalloc guard" patch that places a magic number
> > immediatelly after the requested size - so that it can detect overflows at
> > byte boundary
> > ( https://www.redhat.com/archives/dm-devel/2014-September/msg00018.html )
> >
> > That patch found a bug in crypto code:
> > ( http://lkml.iu.edu/hypermail/linux/kernel/1409.1/02325.html )
>
> Is it still worth doing this, now we have kasan?
The kmalloc guard has much lower overhead than kasan.
(BTW. when I tried kasan, it oopsed with persistent memory)
Mikulas
This patch reorders Kconfig entries, so that menuconfig displays proper
indentation.
Signed-off-by: Mikulas Patocka <[email protected]>
---
lib/Kconfig.debug | 36 ++++++++++++++++++------------------
1 file changed, 18 insertions(+), 18 deletions(-)
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug 2018-04-16 21:08:36.000000000 +0200
+++ linux-2.6/lib/Kconfig.debug 2018-04-25 15:56:16.000000000 +0200
@@ -1503,6 +1503,10 @@ config NETDEV_NOTIFIER_ERROR_INJECT
If unsure, say N.
+config FUNCTION_ERROR_INJECTION
+ def_bool y
+ depends on HAVE_FUNCTION_ERROR_INJECTION && KPROBES
+
config FAULT_INJECTION
bool "Fault-injection framework"
depends on DEBUG_KERNEL
@@ -1510,10 +1514,6 @@ config FAULT_INJECTION
Provide fault-injection framework.
For more details, see Documentation/fault-injection/.
-config FUNCTION_ERROR_INJECTION
- def_bool y
- depends on HAVE_FUNCTION_ERROR_INJECTION && KPROBES
-
config FAILSLAB
bool "Fault-injection capability for kmalloc"
depends on FAULT_INJECTION
@@ -1544,16 +1544,6 @@ config FAIL_IO_TIMEOUT
Only works with drivers that use the generic timeout handling,
for others it wont do anything.
-config FAIL_MMC_REQUEST
- bool "Fault-injection capability for MMC IO"
- depends on FAULT_INJECTION_DEBUG_FS && MMC
- help
- Provide fault-injection capability for MMC IO.
- This will make the mmc core return data errors. This is
- useful to test the error handling in the mmc block device
- and to test how the mmc host driver handles retries from
- the block device.
-
config FAIL_FUTEX
bool "Fault-injection capability for futexes"
select DEBUG_FS
@@ -1561,6 +1551,12 @@ config FAIL_FUTEX
help
Provide fault-injection capability for futexes.
+config FAULT_INJECTION_DEBUG_FS
+ bool "Debugfs entries for fault-injection capabilities"
+ depends on FAULT_INJECTION && SYSFS && DEBUG_FS
+ help
+ Enable configuration of fault-injection capabilities via debugfs.
+
config FAIL_FUNCTION
bool "Fault-injection capability for functions"
depends on FAULT_INJECTION_DEBUG_FS && FUNCTION_ERROR_INJECTION
@@ -1571,11 +1567,15 @@ config FAIL_FUNCTION
an error value and have to handle it. This is useful to test the
error handling in various subsystems.
-config FAULT_INJECTION_DEBUG_FS
- bool "Debugfs entries for fault-injection capabilities"
- depends on FAULT_INJECTION && SYSFS && DEBUG_FS
+config FAIL_MMC_REQUEST
+ bool "Fault-injection capability for MMC IO"
+ depends on FAULT_INJECTION_DEBUG_FS && MMC
help
- Enable configuration of fault-injection capabilities via debugfs.
+ Provide fault-injection capability for MMC IO.
+ This will make the mmc core return data errors. This is
+ useful to test the error handling in the mmc block device
+ and to test how the mmc host driver handles retries from
+ the block device.
config FAULT_INJECTION_STACKTRACE_FILTER
bool "stacktrace filter for fault-injection capabilities"
On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > Wouldn't it be equally trivial to simply enable the fault injection? You
> > > would get additional failure paths testing as a bonus.
> >
> > The RHEL and Fedora debugging kernels are compiled with fault injection.
> > But the fault-injection framework will do nothing unless it is enabled by
> > a kernel parameter or debugfs write.
> >
> > Most users don't know about the fault injection kernel parameters or
> > debugfs files and won't enabled it. We need a CONFIG_ option to enable it
> > by default in the debugging kernels (and we could add a kernel parameter
> > to override the default, fine-tune the fallback probability etc.)
>
> If it is a real issue to install the debugging kernel with the required
> kernel parameter then I a config option for the default on makes sense
> to me.
Yes - the debug kernels use the same default kernel parameters as
non-debug kernels and it is expected that all debug features are enabled
by default.
Here I'm sending the patch using the fault-injection framework and the new
option CONFIG_FAIL_KVMALLOC_FALLBACK_PROBABILITY.
Mikulas
From: Mikulas Patocka <[email protected]>
Subject: [PATCH v4] fault-injection: introduce kvmalloc fallback options
This patch introduces a fault-injection option "kvmalloc_fallback". This
option makes kvmalloc randomly fall back to vmalloc.
Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
uses DMA-API on the returned memory or frees it with kfree. Such bugs were
found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
code. This options helps to test for these bugs.
The patch introduces a config option FAIL_KVMALLOC_FALLBACK_PROBABILITY.
It can be enabled in distribution debug kernels, so that kvmalloc abuse
can be tested by the users. The default can be overriden with
"kvmalloc_fallback" parameter or in /sys/kernel/debug/kvmalloc_fallback/.
Signed-off-by: Mikulas Patocka <[email protected]>
---
Documentation/fault-injection/fault-injection.txt | 7 +++++
include/linux/fault-inject.h | 9 +++---
kernel/futex.c | 2 -
lib/Kconfig.debug | 15 +++++++++++
mm/failslab.c | 2 -
mm/page_alloc.c | 2 -
mm/util.c | 30 ++++++++++++++++++++++
7 files changed, 60 insertions(+), 7 deletions(-)
Index: linux-2.6/Documentation/fault-injection/fault-injection.txt
===================================================================
--- linux-2.6.orig/Documentation/fault-injection/fault-injection.txt 2018-04-16 21:08:34.000000000 +0200
+++ linux-2.6/Documentation/fault-injection/fault-injection.txt 2018-04-25 21:36:36.000000000 +0200
@@ -15,6 +15,12 @@ o fail_page_alloc
injects page allocation failures. (alloc_pages(), get_free_pages(), ...)
+o kvmalloc_faillback
+
+ makes the function kvmalloc randonly fall back to vmalloc. This could be used
+ to detects bugs such as using DMA-API on the result of kvmalloc or freeing
+ the result of kvmalloc with free.
+
o fail_futex
injects futex deadlock and uaddr fault errors.
@@ -167,6 +173,7 @@ use the boot option:
failslab=
fail_page_alloc=
+ kvmalloc_faillback=
fail_make_request=
fail_futex=
mmc_core.fail_request=<interval>,<probability>,<space>,<times>
Index: linux-2.6/include/linux/fault-inject.h
===================================================================
--- linux-2.6.orig/include/linux/fault-inject.h 2018-04-16 21:08:36.000000000 +0200
+++ linux-2.6/include/linux/fault-inject.h 2018-04-25 21:38:22.000000000 +0200
@@ -31,17 +31,18 @@ struct fault_attr {
struct dentry *dname;
};
-#define FAULT_ATTR_INITIALIZER { \
+#define FAULT_ATTR_INITIALIZER(p) { \
+ .probability = (p), \
.interval = 1, \
- .times = ATOMIC_INIT(1), \
+ .times = ATOMIC_INIT((p) ? -1 : 1), \
+ .verbose = (p) ? 0 : 2, \
.require_end = ULONG_MAX, \
.stacktrace_depth = 32, \
.ratelimit_state = RATELIMIT_STATE_INIT_DISABLED, \
- .verbose = 2, \
.dname = NULL, \
}
-#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER
+#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER(0)
int setup_fault_attr(struct fault_attr *attr, char *str);
bool should_fail(struct fault_attr *attr, ssize_t size);
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug 2018-04-25 15:56:16.000000000 +0200
+++ linux-2.6/lib/Kconfig.debug 2018-04-25 21:39:45.000000000 +0200
@@ -1527,6 +1527,21 @@ config FAIL_PAGE_ALLOC
help
Provide fault-injection capability for alloc_pages().
+config FAIL_KVMALLOC_FALLBACK_PROBABILITY
+ int "Default kvmalloc fallback probability"
+ depends on FAULT_INJECTION
+ range 0 100
+ default "0"
+ help
+ This option will make kvmalloc randomly fall back to vmalloc.
+ Normally, kvmalloc falls back to vmalloc only rarely, if memory
+ is fragmented.
+
+ This option helps to detect hard-to-reproduce driver bugs, for
+ example using DMA API on the result of kvmalloc.
+
+ The default may be overriden with the kvmalloc_faillback parameter.
+
config FAIL_MAKE_REQUEST
bool "Fault-injection capability for disk IO"
depends on FAULT_INJECTION && BLOCK
Index: linux-2.6/mm/util.c
===================================================================
--- linux-2.6.orig/mm/util.c 2018-04-25 15:48:39.000000000 +0200
+++ linux-2.6/mm/util.c 2018-04-25 21:43:31.000000000 +0200
@@ -14,6 +14,7 @@
#include <linux/hugetlb.h>
#include <linux/vmalloc.h>
#include <linux/userfaultfd_k.h>
+#include <linux/fault-inject.h>
#include <asm/sections.h>
#include <linux/uaccess.h>
@@ -377,6 +378,29 @@ unsigned long vm_mmap(struct file *file,
}
EXPORT_SYMBOL(vm_mmap);
+#ifdef CONFIG_FAULT_INJECTION
+
+static struct fault_attr kvmalloc_fallback =
+ FAULT_ATTR_INITIALIZER(CONFIG_FAIL_KVMALLOC_FALLBACK_PROBABILITY);
+
+static int __init setup_kvmalloc_fallback(char *str)
+{
+ return setup_fault_attr(&kvmalloc_fallback, str);
+}
+
+__setup("kvmalloc_fallback=", setup_kvmalloc_fallback);
+
+#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
+static int __init kvmalloc_fallback_debugfs_init(void)
+{
+ fault_create_debugfs_attr("kvmalloc_fallback", NULL, &kvmalloc_fallback);
+ return 0;
+}
+late_initcall(kvmalloc_fallback_debugfs_init);
+#endif
+
+#endif
+
/**
* kvmalloc_node - attempt to allocate physically contiguous memory, but upon
* failure, fall back to non-contiguous (vmalloc) allocation.
@@ -404,6 +428,11 @@ void *kvmalloc_node(size_t size, gfp_t f
*/
WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
+#ifdef CONFIG_FAULT_INJECTION
+ if (should_fail(&kvmalloc_fallback, size))
+ goto do_vmalloc;
+#endif
+
/*
* We want to attempt a large physically contiguous block first because
* it is less likely to fragment multiple larger blocks and therefore
@@ -427,6 +456,7 @@ void *kvmalloc_node(size_t size, gfp_t f
if (ret || size <= PAGE_SIZE)
return ret;
+do_vmalloc: __maybe_unused
return __vmalloc_node_flags_caller(size, node, flags,
__builtin_return_address(0));
}
Index: linux-2.6/kernel/futex.c
===================================================================
--- linux-2.6.orig/kernel/futex.c 2018-02-14 20:24:42.000000000 +0100
+++ linux-2.6/kernel/futex.c 2018-04-25 21:11:33.000000000 +0200
@@ -288,7 +288,7 @@ static struct {
bool ignore_private;
} fail_futex = {
- .attr = FAULT_ATTR_INITIALIZER,
+ .attr = FAULT_ATTR_INITIALIZER(0),
.ignore_private = false,
};
Index: linux-2.6/mm/failslab.c
===================================================================
--- linux-2.6.orig/mm/failslab.c 2018-04-16 21:08:36.000000000 +0200
+++ linux-2.6/mm/failslab.c 2018-04-25 21:11:40.000000000 +0200
@@ -9,7 +9,7 @@ static struct {
bool ignore_gfp_reclaim;
bool cache_filter;
} failslab = {
- .attr = FAULT_ATTR_INITIALIZER,
+ .attr = FAULT_ATTR_INITIALIZER(0),
.ignore_gfp_reclaim = true,
.cache_filter = false,
};
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2018-04-16 21:08:36.000000000 +0200
+++ linux-2.6/mm/page_alloc.c 2018-04-25 21:11:47.000000000 +0200
@@ -3055,7 +3055,7 @@ static struct {
bool ignore_gfp_reclaim;
u32 min_order;
} fail_page_alloc = {
- .attr = FAULT_ATTR_INITIALIZER,
+ .attr = FAULT_ATTR_INITIALIZER(0),
.ignore_gfp_reclaim = true,
.ignore_gfp_highmem = true,
.min_order = 1,
On 04/25/2018 01:02 PM, Mikulas Patocka wrote:
>
>
> From: Mikulas Patocka <[email protected]>
> Subject: [PATCH v4] fault-injection: introduce kvmalloc fallback options
>
> This patch introduces a fault-injection option "kvmalloc_fallback". This
> option makes kvmalloc randomly fall back to vmalloc.
>
> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
Unfortunately,
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code. This options helps to test for these bugs.
>
> The patch introduces a config option FAIL_KVMALLOC_FALLBACK_PROBABILITY.
> It can be enabled in distribution debug kernels, so that kvmalloc abuse
> can be tested by the users. The default can be overriden with
overridden
> "kvmalloc_fallback" parameter or in /sys/kernel/debug/kvmalloc_fallback/.
>
> Signed-off-by: Mikulas Patocka <[email protected]>
>
> ---
> Documentation/fault-injection/fault-injection.txt | 7 +++++
> include/linux/fault-inject.h | 9 +++---
> kernel/futex.c | 2 -
> lib/Kconfig.debug | 15 +++++++++++
> mm/failslab.c | 2 -
> mm/page_alloc.c | 2 -
> mm/util.c | 30 ++++++++++++++++++++++
> 7 files changed, 60 insertions(+), 7 deletions(-)
>
> Index: linux-2.6/Documentation/fault-injection/fault-injection.txt
> ===================================================================
> --- linux-2.6.orig/Documentation/fault-injection/fault-injection.txt 2018-04-16 21:08:34.000000000 +0200
> +++ linux-2.6/Documentation/fault-injection/fault-injection.txt 2018-04-25 21:36:36.000000000 +0200
> @@ -15,6 +15,12 @@ o fail_page_alloc
>
> injects page allocation failures. (alloc_pages(), get_free_pages(), ...)
>
> +o kvmalloc_faillback
kvmalloc_fallback
> +
> + makes the function kvmalloc randonly fall back to vmalloc. This could be used
randomly
> + to detects bugs such as using DMA-API on the result of kvmalloc or freeing
> + the result of kvmalloc with free.
> +
> o fail_futex
>
> injects futex deadlock and uaddr fault errors.
> @@ -167,6 +173,7 @@ use the boot option:
>
> failslab=
> fail_page_alloc=
> + kvmalloc_faillback=
kvmalloc_fallback=
> fail_make_request=
> fail_futex=
> mmc_core.fail_request=<interval>,<probability>,<space>,<times>
> Index: linux-2.6/lib/Kconfig.debug
> ===================================================================
> --- linux-2.6.orig/lib/Kconfig.debug 2018-04-25 15:56:16.000000000 +0200
> +++ linux-2.6/lib/Kconfig.debug 2018-04-25 21:39:45.000000000 +0200
> @@ -1527,6 +1527,21 @@ config FAIL_PAGE_ALLOC
> help
> Provide fault-injection capability for alloc_pages().
>
> +config FAIL_KVMALLOC_FALLBACK_PROBABILITY
> + int "Default kvmalloc fallback probability"
> + depends on FAULT_INJECTION
> + range 0 100
> + default "0"
> + help
> + This option will make kvmalloc randomly fall back to vmalloc.
> + Normally, kvmalloc falls back to vmalloc only rarely, if memory
> + is fragmented.
> +
> + This option helps to detect hard-to-reproduce driver bugs, for
> + example using DMA API on the result of kvmalloc.
> +
> + The default may be overriden with the kvmalloc_faillback parameter.
overridden kvmalloc_fallback
> +
> config FAIL_MAKE_REQUEST
> bool "Fault-injection capability for disk IO"
> depends on FAULT_INJECTION && BLOCK
--
~Randy
On Wed, 25 Apr 2018, Randy Dunlap wrote:
> On 04/25/2018 01:02 PM, Mikulas Patocka wrote:
> >
> >
> > From: Mikulas Patocka <[email protected]>
> > Subject: [PATCH v4] fault-injection: introduce kvmalloc fallback options
> >
> > This patch introduces a fault-injection option "kvmalloc_fallback". This
> > option makes kvmalloc randomly fall back to vmalloc.
> >
> > Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
>
> Unfortunately,
OK - here I fixed the typos:
From: Mikulas Patocka <[email protected]>
Subject: [PATCH] fault-injection: introduce kvmalloc fallback options
This patch introduces a fault-injection option "kvmalloc_fallback". This
option makes kvmalloc randomly fall back to vmalloc.
Unfortunately, some kernel code has bugs - it uses kvmalloc and then
uses DMA-API on the returned memory or frees it with kfree. Such bugs were
found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
code. This options helps to test for these bugs.
The patch introduces a config option FAIL_KVMALLOC_FALLBACK_PROBABILITY.
It can be enabled in distribution debug kernels, so that kvmalloc abuse
can be tested by the users. The default can be overridden with
"kvmalloc_fallback" parameter or in /sys/kernel/debug/kvmalloc_fallback/.
Signed-off-by: Mikulas Patocka <[email protected]>
---
Documentation/fault-injection/fault-injection.txt | 7 +++++
include/linux/fault-inject.h | 9 +++---
kernel/futex.c | 2 -
lib/Kconfig.debug | 15 +++++++++++
mm/failslab.c | 2 -
mm/page_alloc.c | 2 -
mm/util.c | 30 ++++++++++++++++++++++
7 files changed, 60 insertions(+), 7 deletions(-)
Index: linux-2.6/Documentation/fault-injection/fault-injection.txt
===================================================================
--- linux-2.6.orig/Documentation/fault-injection/fault-injection.txt 2018-04-16 21:08:34.000000000 +0200
+++ linux-2.6/Documentation/fault-injection/fault-injection.txt 2018-04-25 21:36:36.000000000 +0200
@@ -15,6 +15,12 @@ o fail_page_alloc
injects page allocation failures. (alloc_pages(), get_free_pages(), ...)
+o kvmalloc_fallback
+
+ makes the function kvmalloc randomly fall back to vmalloc. This could be used
+ to detects bugs such as using DMA-API on the result of kvmalloc or freeing
+ the result of kvmalloc with free.
+
o fail_futex
injects futex deadlock and uaddr fault errors.
@@ -167,6 +173,7 @@ use the boot option:
failslab=
fail_page_alloc=
+ kvmalloc_fallback=
fail_make_request=
fail_futex=
mmc_core.fail_request=<interval>,<probability>,<space>,<times>
Index: linux-2.6/include/linux/fault-inject.h
===================================================================
--- linux-2.6.orig/include/linux/fault-inject.h 2018-04-16 21:08:36.000000000 +0200
+++ linux-2.6/include/linux/fault-inject.h 2018-04-25 21:38:22.000000000 +0200
@@ -31,17 +31,18 @@ struct fault_attr {
struct dentry *dname;
};
-#define FAULT_ATTR_INITIALIZER { \
+#define FAULT_ATTR_INITIALIZER(p) { \
+ .probability = (p), \
.interval = 1, \
- .times = ATOMIC_INIT(1), \
+ .times = ATOMIC_INIT((p) ? -1 : 1), \
+ .verbose = (p) ? 0 : 2, \
.require_end = ULONG_MAX, \
.stacktrace_depth = 32, \
.ratelimit_state = RATELIMIT_STATE_INIT_DISABLED, \
- .verbose = 2, \
.dname = NULL, \
}
-#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER
+#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER(0)
int setup_fault_attr(struct fault_attr *attr, char *str);
bool should_fail(struct fault_attr *attr, ssize_t size);
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug 2018-04-25 15:56:16.000000000 +0200
+++ linux-2.6/lib/Kconfig.debug 2018-04-25 21:39:45.000000000 +0200
@@ -1527,6 +1527,21 @@ config FAIL_PAGE_ALLOC
help
Provide fault-injection capability for alloc_pages().
+config FAIL_KVMALLOC_FALLBACK_PROBABILITY
+ int "Default kvmalloc fallback probability"
+ depends on FAULT_INJECTION
+ range 0 100
+ default "0"
+ help
+ This option will make kvmalloc randomly fall back to vmalloc.
+ Normally, kvmalloc falls back to vmalloc only rarely, if memory
+ is fragmented.
+
+ This option helps to detect hard-to-reproduce driver bugs, for
+ example using DMA API on the result of kvmalloc.
+
+ The default may be overridden with the kvmalloc_fallback parameter.
+
config FAIL_MAKE_REQUEST
bool "Fault-injection capability for disk IO"
depends on FAULT_INJECTION && BLOCK
Index: linux-2.6/mm/util.c
===================================================================
--- linux-2.6.orig/mm/util.c 2018-04-25 15:48:39.000000000 +0200
+++ linux-2.6/mm/util.c 2018-04-25 21:43:31.000000000 +0200
@@ -14,6 +14,7 @@
#include <linux/hugetlb.h>
#include <linux/vmalloc.h>
#include <linux/userfaultfd_k.h>
+#include <linux/fault-inject.h>
#include <asm/sections.h>
#include <linux/uaccess.h>
@@ -377,6 +378,29 @@ unsigned long vm_mmap(struct file *file,
}
EXPORT_SYMBOL(vm_mmap);
+#ifdef CONFIG_FAULT_INJECTION
+
+static struct fault_attr kvmalloc_fallback =
+ FAULT_ATTR_INITIALIZER(CONFIG_FAIL_KVMALLOC_FALLBACK_PROBABILITY);
+
+static int __init setup_kvmalloc_fallback(char *str)
+{
+ return setup_fault_attr(&kvmalloc_fallback, str);
+}
+
+__setup("kvmalloc_fallback=", setup_kvmalloc_fallback);
+
+#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
+static int __init kvmalloc_fallback_debugfs_init(void)
+{
+ fault_create_debugfs_attr("kvmalloc_fallback", NULL, &kvmalloc_fallback);
+ return 0;
+}
+late_initcall(kvmalloc_fallback_debugfs_init);
+#endif
+
+#endif
+
/**
* kvmalloc_node - attempt to allocate physically contiguous memory, but upon
* failure, fall back to non-contiguous (vmalloc) allocation.
@@ -404,6 +428,11 @@ void *kvmalloc_node(size_t size, gfp_t f
*/
WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
+#ifdef CONFIG_FAULT_INJECTION
+ if (should_fail(&kvmalloc_fallback, size))
+ goto do_vmalloc;
+#endif
+
/*
* We want to attempt a large physically contiguous block first because
* it is less likely to fragment multiple larger blocks and therefore
@@ -427,6 +456,7 @@ void *kvmalloc_node(size_t size, gfp_t f
if (ret || size <= PAGE_SIZE)
return ret;
+do_vmalloc: __maybe_unused
return __vmalloc_node_flags_caller(size, node, flags,
__builtin_return_address(0));
}
Index: linux-2.6/kernel/futex.c
===================================================================
--- linux-2.6.orig/kernel/futex.c 2018-02-14 20:24:42.000000000 +0100
+++ linux-2.6/kernel/futex.c 2018-04-25 21:11:33.000000000 +0200
@@ -288,7 +288,7 @@ static struct {
bool ignore_private;
} fail_futex = {
- .attr = FAULT_ATTR_INITIALIZER,
+ .attr = FAULT_ATTR_INITIALIZER(0),
.ignore_private = false,
};
Index: linux-2.6/mm/failslab.c
===================================================================
--- linux-2.6.orig/mm/failslab.c 2018-04-16 21:08:36.000000000 +0200
+++ linux-2.6/mm/failslab.c 2018-04-25 21:11:40.000000000 +0200
@@ -9,7 +9,7 @@ static struct {
bool ignore_gfp_reclaim;
bool cache_filter;
} failslab = {
- .attr = FAULT_ATTR_INITIALIZER,
+ .attr = FAULT_ATTR_INITIALIZER(0),
.ignore_gfp_reclaim = true,
.cache_filter = false,
};
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2018-04-16 21:08:36.000000000 +0200
+++ linux-2.6/mm/page_alloc.c 2018-04-25 21:11:47.000000000 +0200
@@ -3055,7 +3055,7 @@ static struct {
bool ignore_gfp_reclaim;
u32 min_order;
} fail_page_alloc = {
- .attr = FAULT_ATTR_INITIALIZER,
+ .attr = FAULT_ATTR_INITIALIZER(0),
.ignore_gfp_reclaim = true,
.ignore_gfp_highmem = true,
.min_order = 1,
On 04/25/2018 01:57 PM, Mikulas Patocka wrote:
>
>
> On Wed, 25 Apr 2018, Randy Dunlap wrote:
>
>> On 04/25/2018 01:02 PM, Mikulas Patocka wrote:
>>>
>>>
>>> From: Mikulas Patocka <[email protected]>
>>> Subject: [PATCH v4] fault-injection: introduce kvmalloc fallback options
>>>
>>> This patch introduces a fault-injection option "kvmalloc_fallback". This
>>> option makes kvmalloc randomly fall back to vmalloc.
>>>
>>> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
>>
>> Unfortunately,
>
> OK - here I fixed the typos:
>
>
> From: Mikulas Patocka <[email protected]>
> Subject: [PATCH] fault-injection: introduce kvmalloc fallback options
>
> This patch introduces a fault-injection option "kvmalloc_fallback". This
> option makes kvmalloc randomly fall back to vmalloc.
>
> Unfortunately, some kernel code has bugs - it uses kvmalloc and then
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code. This options helps to test for these bugs.
>
> The patch introduces a config option FAIL_KVMALLOC_FALLBACK_PROBABILITY.
> It can be enabled in distribution debug kernels, so that kvmalloc abuse
> can be tested by the users. The default can be overridden with
> "kvmalloc_fallback" parameter or in /sys/kernel/debug/kvmalloc_fallback/.
>
> Signed-off-by: Mikulas Patocka <[email protected]>
>
> ---
> Documentation/fault-injection/fault-injection.txt | 7 +++++
> include/linux/fault-inject.h | 9 +++---
> kernel/futex.c | 2 -
> lib/Kconfig.debug | 15 +++++++++++
> mm/failslab.c | 2 -
> mm/page_alloc.c | 2 -
> mm/util.c | 30 ++++++++++++++++++++++
> 7 files changed, 60 insertions(+), 7 deletions(-)
Acked-by: Randy Dunlap <[email protected]> # Documentation and Kconfig only
thanks.
--
~Randy
On Wed, 25 Apr 2018, Mikulas Patocka wrote:
> From: Mikulas Patocka <[email protected]>
> Subject: [PATCH] fault-injection: introduce kvmalloc fallback options
>
> This patch introduces a fault-injection option "kvmalloc_fallback". This
> option makes kvmalloc randomly fall back to vmalloc.
>
> Unfortunately, some kernel code has bugs - it uses kvmalloc and then
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code. This options helps to test for these bugs.
>
> The patch introduces a config option FAIL_KVMALLOC_FALLBACK_PROBABILITY.
> It can be enabled in distribution debug kernels, so that kvmalloc abuse
> can be tested by the users. The default can be overridden with
> "kvmalloc_fallback" parameter or in /sys/kernel/debug/kvmalloc_fallback/.
>
Do we really need the new config option? This could just be manually
tunable via fault injection IIUC.
> Signed-off-by: Mikulas Patocka <[email protected]>
>
> ---
> Documentation/fault-injection/fault-injection.txt | 7 +++++
> include/linux/fault-inject.h | 9 +++---
> kernel/futex.c | 2 -
> lib/Kconfig.debug | 15 +++++++++++
> mm/failslab.c | 2 -
> mm/page_alloc.c | 2 -
> mm/util.c | 30 ++++++++++++++++++++++
> 7 files changed, 60 insertions(+), 7 deletions(-)
>
> Index: linux-2.6/Documentation/fault-injection/fault-injection.txt
> ===================================================================
> --- linux-2.6.orig/Documentation/fault-injection/fault-injection.txt 2018-04-16 21:08:34.000000000 +0200
> +++ linux-2.6/Documentation/fault-injection/fault-injection.txt 2018-04-25 21:36:36.000000000 +0200
> @@ -15,6 +15,12 @@ o fail_page_alloc
>
> injects page allocation failures. (alloc_pages(), get_free_pages(), ...)
>
> +o kvmalloc_fallback
> +
> + makes the function kvmalloc randomly fall back to vmalloc. This could be used
> + to detects bugs such as using DMA-API on the result of kvmalloc or freeing
> + the result of kvmalloc with free.
> +
> o fail_futex
>
> injects futex deadlock and uaddr fault errors.
> @@ -167,6 +173,7 @@ use the boot option:
>
> failslab=
> fail_page_alloc=
> + kvmalloc_fallback=
> fail_make_request=
> fail_futex=
> mmc_core.fail_request=<interval>,<probability>,<space>,<times>
> Index: linux-2.6/include/linux/fault-inject.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fault-inject.h 2018-04-16 21:08:36.000000000 +0200
> +++ linux-2.6/include/linux/fault-inject.h 2018-04-25 21:38:22.000000000 +0200
> @@ -31,17 +31,18 @@ struct fault_attr {
> struct dentry *dname;
> };
>
> -#define FAULT_ATTR_INITIALIZER { \
> +#define FAULT_ATTR_INITIALIZER(p) { \
> + .probability = (p), \
> .interval = 1, \
> - .times = ATOMIC_INIT(1), \
> + .times = ATOMIC_INIT((p) ? -1 : 1), \
> + .verbose = (p) ? 0 : 2, \
> .require_end = ULONG_MAX, \
> .stacktrace_depth = 32, \
> .ratelimit_state = RATELIMIT_STATE_INIT_DISABLED, \
> - .verbose = 2, \
> .dname = NULL, \
> }
>
> -#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER
> +#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER(0)
> int setup_fault_attr(struct fault_attr *attr, char *str);
> bool should_fail(struct fault_attr *attr, ssize_t size);
>
> Index: linux-2.6/lib/Kconfig.debug
> ===================================================================
> --- linux-2.6.orig/lib/Kconfig.debug 2018-04-25 15:56:16.000000000 +0200
> +++ linux-2.6/lib/Kconfig.debug 2018-04-25 21:39:45.000000000 +0200
> @@ -1527,6 +1527,21 @@ config FAIL_PAGE_ALLOC
> help
> Provide fault-injection capability for alloc_pages().
>
> +config FAIL_KVMALLOC_FALLBACK_PROBABILITY
> + int "Default kvmalloc fallback probability"
> + depends on FAULT_INJECTION
> + range 0 100
> + default "0"
> + help
> + This option will make kvmalloc randomly fall back to vmalloc.
> + Normally, kvmalloc falls back to vmalloc only rarely, if memory
> + is fragmented.
> +
> + This option helps to detect hard-to-reproduce driver bugs, for
> + example using DMA API on the result of kvmalloc.
> +
> + The default may be overridden with the kvmalloc_fallback parameter.
> +
> config FAIL_MAKE_REQUEST
> bool "Fault-injection capability for disk IO"
> depends on FAULT_INJECTION && BLOCK
> Index: linux-2.6/mm/util.c
> ===================================================================
> --- linux-2.6.orig/mm/util.c 2018-04-25 15:48:39.000000000 +0200
> +++ linux-2.6/mm/util.c 2018-04-25 21:43:31.000000000 +0200
> @@ -14,6 +14,7 @@
> #include <linux/hugetlb.h>
> #include <linux/vmalloc.h>
> #include <linux/userfaultfd_k.h>
> +#include <linux/fault-inject.h>
>
> #include <asm/sections.h>
> #include <linux/uaccess.h>
> @@ -377,6 +378,29 @@ unsigned long vm_mmap(struct file *file,
> }
> EXPORT_SYMBOL(vm_mmap);
>
> +#ifdef CONFIG_FAULT_INJECTION
> +
> +static struct fault_attr kvmalloc_fallback =
> + FAULT_ATTR_INITIALIZER(CONFIG_FAIL_KVMALLOC_FALLBACK_PROBABILITY);
> +
> +static int __init setup_kvmalloc_fallback(char *str)
> +{
> + return setup_fault_attr(&kvmalloc_fallback, str);
> +}
> +
> +__setup("kvmalloc_fallback=", setup_kvmalloc_fallback);
> +
> +#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
> +static int __init kvmalloc_fallback_debugfs_init(void)
> +{
> + fault_create_debugfs_attr("kvmalloc_fallback", NULL, &kvmalloc_fallback);
> + return 0;
> +}
> +late_initcall(kvmalloc_fallback_debugfs_init);
> +#endif
> +
> +#endif
> +
> /**
> * kvmalloc_node - attempt to allocate physically contiguous memory, but upon
> * failure, fall back to non-contiguous (vmalloc) allocation.
> @@ -404,6 +428,11 @@ void *kvmalloc_node(size_t size, gfp_t f
> */
> WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
>
> +#ifdef CONFIG_FAULT_INJECTION
> + if (should_fail(&kvmalloc_fallback, size))
> + goto do_vmalloc;
> +#endif
> +
> /*
> * We want to attempt a large physically contiguous block first because
> * it is less likely to fragment multiple larger blocks and therefore
> @@ -427,6 +456,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> if (ret || size <= PAGE_SIZE)
> return ret;
>
> +do_vmalloc: __maybe_unused
> return __vmalloc_node_flags_caller(size, node, flags,
> __builtin_return_address(0));
> }
> Index: linux-2.6/kernel/futex.c
> ===================================================================
> --- linux-2.6.orig/kernel/futex.c 2018-02-14 20:24:42.000000000 +0100
> +++ linux-2.6/kernel/futex.c 2018-04-25 21:11:33.000000000 +0200
> @@ -288,7 +288,7 @@ static struct {
>
> bool ignore_private;
> } fail_futex = {
> - .attr = FAULT_ATTR_INITIALIZER,
> + .attr = FAULT_ATTR_INITIALIZER(0),
> .ignore_private = false,
> };
>
> Index: linux-2.6/mm/failslab.c
> ===================================================================
> --- linux-2.6.orig/mm/failslab.c 2018-04-16 21:08:36.000000000 +0200
> +++ linux-2.6/mm/failslab.c 2018-04-25 21:11:40.000000000 +0200
> @@ -9,7 +9,7 @@ static struct {
> bool ignore_gfp_reclaim;
> bool cache_filter;
> } failslab = {
> - .attr = FAULT_ATTR_INITIALIZER,
> + .attr = FAULT_ATTR_INITIALIZER(0),
> .ignore_gfp_reclaim = true,
> .cache_filter = false,
> };
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c 2018-04-16 21:08:36.000000000 +0200
> +++ linux-2.6/mm/page_alloc.c 2018-04-25 21:11:47.000000000 +0200
> @@ -3055,7 +3055,7 @@ static struct {
> bool ignore_gfp_reclaim;
> u32 min_order;
> } fail_page_alloc = {
> - .attr = FAULT_ATTR_INITIALIZER,
> + .attr = FAULT_ATTR_INITIALIZER(0),
> .ignore_gfp_reclaim = true,
> .ignore_gfp_highmem = true,
> .min_order = 1,
>
>
On Wed, 25 Apr 2018, David Rientjes wrote:
> On Wed, 25 Apr 2018, Mikulas Patocka wrote:
>
> > From: Mikulas Patocka <[email protected]>
> > Subject: [PATCH] fault-injection: introduce kvmalloc fallback options
> >
> > This patch introduces a fault-injection option "kvmalloc_fallback". This
> > option makes kvmalloc randomly fall back to vmalloc.
> >
> > Unfortunately, some kernel code has bugs - it uses kvmalloc and then
> > uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> > found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> > code. This options helps to test for these bugs.
> >
> > The patch introduces a config option FAIL_KVMALLOC_FALLBACK_PROBABILITY.
> > It can be enabled in distribution debug kernels, so that kvmalloc abuse
> > can be tested by the users. The default can be overridden with
> > "kvmalloc_fallback" parameter or in /sys/kernel/debug/kvmalloc_fallback/.
> >
>
> Do we really need the new config option? This could just be manually
> tunable via fault injection IIUC.
We do, because we want to enable it in RHEL and Fedora debugging kernels,
so that it will be tested by the users.
The users won't use some extra magic kernel options or debugfs files.
Mikulas
> > Signed-off-by: Mikulas Patocka <[email protected]>
> >
> > ---
> > Documentation/fault-injection/fault-injection.txt | 7 +++++
> > include/linux/fault-inject.h | 9 +++---
> > kernel/futex.c | 2 -
> > lib/Kconfig.debug | 15 +++++++++++
> > mm/failslab.c | 2 -
> > mm/page_alloc.c | 2 -
> > mm/util.c | 30 ++++++++++++++++++++++
> > 7 files changed, 60 insertions(+), 7 deletions(-)
> >
> > Index: linux-2.6/Documentation/fault-injection/fault-injection.txt
> > ===================================================================
> > --- linux-2.6.orig/Documentation/fault-injection/fault-injection.txt 2018-04-16 21:08:34.000000000 +0200
> > +++ linux-2.6/Documentation/fault-injection/fault-injection.txt 2018-04-25 21:36:36.000000000 +0200
> > @@ -15,6 +15,12 @@ o fail_page_alloc
> >
> > injects page allocation failures. (alloc_pages(), get_free_pages(), ...)
> >
> > +o kvmalloc_fallback
> > +
> > + makes the function kvmalloc randomly fall back to vmalloc. This could be used
> > + to detects bugs such as using DMA-API on the result of kvmalloc or freeing
> > + the result of kvmalloc with free.
> > +
> > o fail_futex
> >
> > injects futex deadlock and uaddr fault errors.
> > @@ -167,6 +173,7 @@ use the boot option:
> >
> > failslab=
> > fail_page_alloc=
> > + kvmalloc_fallback=
> > fail_make_request=
> > fail_futex=
> > mmc_core.fail_request=<interval>,<probability>,<space>,<times>
> > Index: linux-2.6/include/linux/fault-inject.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/fault-inject.h 2018-04-16 21:08:36.000000000 +0200
> > +++ linux-2.6/include/linux/fault-inject.h 2018-04-25 21:38:22.000000000 +0200
> > @@ -31,17 +31,18 @@ struct fault_attr {
> > struct dentry *dname;
> > };
> >
> > -#define FAULT_ATTR_INITIALIZER { \
> > +#define FAULT_ATTR_INITIALIZER(p) { \
> > + .probability = (p), \
> > .interval = 1, \
> > - .times = ATOMIC_INIT(1), \
> > + .times = ATOMIC_INIT((p) ? -1 : 1), \
> > + .verbose = (p) ? 0 : 2, \
> > .require_end = ULONG_MAX, \
> > .stacktrace_depth = 32, \
> > .ratelimit_state = RATELIMIT_STATE_INIT_DISABLED, \
> > - .verbose = 2, \
> > .dname = NULL, \
> > }
> >
> > -#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER
> > +#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER(0)
> > int setup_fault_attr(struct fault_attr *attr, char *str);
> > bool should_fail(struct fault_attr *attr, ssize_t size);
> >
> > Index: linux-2.6/lib/Kconfig.debug
> > ===================================================================
> > --- linux-2.6.orig/lib/Kconfig.debug 2018-04-25 15:56:16.000000000 +0200
> > +++ linux-2.6/lib/Kconfig.debug 2018-04-25 21:39:45.000000000 +0200
> > @@ -1527,6 +1527,21 @@ config FAIL_PAGE_ALLOC
> > help
> > Provide fault-injection capability for alloc_pages().
> >
> > +config FAIL_KVMALLOC_FALLBACK_PROBABILITY
> > + int "Default kvmalloc fallback probability"
> > + depends on FAULT_INJECTION
> > + range 0 100
> > + default "0"
> > + help
> > + This option will make kvmalloc randomly fall back to vmalloc.
> > + Normally, kvmalloc falls back to vmalloc only rarely, if memory
> > + is fragmented.
> > +
> > + This option helps to detect hard-to-reproduce driver bugs, for
> > + example using DMA API on the result of kvmalloc.
> > +
> > + The default may be overridden with the kvmalloc_fallback parameter.
> > +
> > config FAIL_MAKE_REQUEST
> > bool "Fault-injection capability for disk IO"
> > depends on FAULT_INJECTION && BLOCK
> > Index: linux-2.6/mm/util.c
> > ===================================================================
> > --- linux-2.6.orig/mm/util.c 2018-04-25 15:48:39.000000000 +0200
> > +++ linux-2.6/mm/util.c 2018-04-25 21:43:31.000000000 +0200
> > @@ -14,6 +14,7 @@
> > #include <linux/hugetlb.h>
> > #include <linux/vmalloc.h>
> > #include <linux/userfaultfd_k.h>
> > +#include <linux/fault-inject.h>
> >
> > #include <asm/sections.h>
> > #include <linux/uaccess.h>
> > @@ -377,6 +378,29 @@ unsigned long vm_mmap(struct file *file,
> > }
> > EXPORT_SYMBOL(vm_mmap);
> >
> > +#ifdef CONFIG_FAULT_INJECTION
> > +
> > +static struct fault_attr kvmalloc_fallback =
> > + FAULT_ATTR_INITIALIZER(CONFIG_FAIL_KVMALLOC_FALLBACK_PROBABILITY);
> > +
> > +static int __init setup_kvmalloc_fallback(char *str)
> > +{
> > + return setup_fault_attr(&kvmalloc_fallback, str);
> > +}
> > +
> > +__setup("kvmalloc_fallback=", setup_kvmalloc_fallback);
> > +
> > +#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
> > +static int __init kvmalloc_fallback_debugfs_init(void)
> > +{
> > + fault_create_debugfs_attr("kvmalloc_fallback", NULL, &kvmalloc_fallback);
> > + return 0;
> > +}
> > +late_initcall(kvmalloc_fallback_debugfs_init);
> > +#endif
> > +
> > +#endif
> > +
> > /**
> > * kvmalloc_node - attempt to allocate physically contiguous memory, but upon
> > * failure, fall back to non-contiguous (vmalloc) allocation.
> > @@ -404,6 +428,11 @@ void *kvmalloc_node(size_t size, gfp_t f
> > */
> > WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> >
> > +#ifdef CONFIG_FAULT_INJECTION
> > + if (should_fail(&kvmalloc_fallback, size))
> > + goto do_vmalloc;
> > +#endif
> > +
> > /*
> > * We want to attempt a large physically contiguous block first because
> > * it is less likely to fragment multiple larger blocks and therefore
> > @@ -427,6 +456,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> > if (ret || size <= PAGE_SIZE)
> > return ret;
> >
> > +do_vmalloc: __maybe_unused
> > return __vmalloc_node_flags_caller(size, node, flags,
> > __builtin_return_address(0));
> > }
> > Index: linux-2.6/kernel/futex.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/futex.c 2018-02-14 20:24:42.000000000 +0100
> > +++ linux-2.6/kernel/futex.c 2018-04-25 21:11:33.000000000 +0200
> > @@ -288,7 +288,7 @@ static struct {
> >
> > bool ignore_private;
> > } fail_futex = {
> > - .attr = FAULT_ATTR_INITIALIZER,
> > + .attr = FAULT_ATTR_INITIALIZER(0),
> > .ignore_private = false,
> > };
> >
> > Index: linux-2.6/mm/failslab.c
> > ===================================================================
> > --- linux-2.6.orig/mm/failslab.c 2018-04-16 21:08:36.000000000 +0200
> > +++ linux-2.6/mm/failslab.c 2018-04-25 21:11:40.000000000 +0200
> > @@ -9,7 +9,7 @@ static struct {
> > bool ignore_gfp_reclaim;
> > bool cache_filter;
> > } failslab = {
> > - .attr = FAULT_ATTR_INITIALIZER,
> > + .attr = FAULT_ATTR_INITIALIZER(0),
> > .ignore_gfp_reclaim = true,
> > .cache_filter = false,
> > };
> > Index: linux-2.6/mm/page_alloc.c
> > ===================================================================
> > --- linux-2.6.orig/mm/page_alloc.c 2018-04-16 21:08:36.000000000 +0200
> > +++ linux-2.6/mm/page_alloc.c 2018-04-25 21:11:47.000000000 +0200
> > @@ -3055,7 +3055,7 @@ static struct {
> > bool ignore_gfp_reclaim;
> > u32 min_order;
> > } fail_page_alloc = {
> > - .attr = FAULT_ATTR_INITIALIZER,
> > + .attr = FAULT_ATTR_INITIALIZER(0),
> > .ignore_gfp_reclaim = true,
> > .ignore_gfp_highmem = true,
> > .min_order = 1,
> >
> >
>
On Wed, 2018-04-25 at 17:22 -0400, Mikulas Patocka wrote:
>
> On Wed, 25 Apr 2018, David Rientjes wrote:
>
> > On Wed, 25 Apr 2018, Mikulas Patocka wrote:
> >
> > > From: Mikulas Patocka <[email protected]>
> > > Subject: [PATCH] fault-injection: introduce kvmalloc fallback
> > > options
> > >
> > > This patch introduces a fault-injection option
> > > "kvmalloc_fallback". This option makes kvmalloc randomly fall
> > > back to vmalloc.
> > >
> > > Unfortunately, some kernel code has bugs - it uses kvmalloc and
> > > then uses DMA-API on the returned memory or frees it with kfree.
> > > Such bugs were found in the virtio-net driver, dm-integrity or
> > > RHEL7 powerpc-specific code. This options helps to test for these
> > > bugs.
> > >
> > > The patch introduces a config option
> > > FAIL_KVMALLOC_FALLBACK_PROBABILITY.
> > > It can be enabled in distribution debug kernels, so that kvmalloc
> > > abuse can be tested by the users. The default can be overridden
> > > with "kvmalloc_fallback" parameter or in
> > > /sys/kernel/debug/kvmalloc_fallback/.
> > >
> >
> > Do we really need the new config option?  This could just be
> > manually tunable via fault injection IIUC.
>
> We do, because we want to enable it in RHEL and Fedora debugging
> kernels, so that it will be tested by the users.
>
> The users won't use some extra magic kernel options or debugfs files.
If it can be enabled via a tunable, then the distro can turn it on
without the user having to do anything. If you want to present the
user with a different boot option, you can (just have the tunable set
on the command line), but being tunable driven means that you don't
have to choose that option, you could automatically enable it under a
range of circumstances. I think most sane distributions would want
that flexibility.
Kconfig proliferation, conversely, is a bit of a nightmare from both
the user and the tester's point of view, so we're trying to avoid it
unless absolutely necessary.
James
On Wed, 25 Apr 2018, James Bottomley wrote:
> On Wed, 2018-04-25 at 17:22 -0400, Mikulas Patocka wrote:
> >
> > On Wed, 25 Apr 2018, David Rientjes wrote:
> > >
> > > Do we really need the new config option?  This could just be
> > > manually tunable via fault injection IIUC.
> >
> > We do, because we want to enable it in RHEL and Fedora debugging
> > kernels, so that it will be tested by the users.
> >
> > The users won't use some extra magic kernel options or debugfs files.
>
> If it can be enabled via a tunable, then the distro can turn it on
> without the user having to do anything.
You need to enable it on boot. Enabling it when the kernel starts to
execute userspace code is already too late (because you would miss
kvmalloc calls in the kernel boot path).
These are files in the kernel-debug rpm package. Where would you put the
extra kernel parameter to enable this feature? None of these files contain
kernel parameters.
kernel-debug /boot/.vmlinuz-3.10.0-693.21.1.el7.x86_64.debug.hmac
kernel-debug /boot/System.map-3.10.0-693.21.1.el7.x86_64.debug
kernel-debug /boot/config-3.10.0-693.21.1.el7.x86_64.debug
kernel-debug /boot/initramfs-3.10.0-693.21.1.el7.x86_64.debug.img
kernel-debug /boot/symvers-3.10.0-693.21.1.el7.x86_64.debug.gz
kernel-debug /boot/vmlinuz-3.10.0-693.21.1.el7.x86_64.debug
kernel-debug /etc/ld.so.conf.d/kernel-3.10.0-693.21.1.el7.x86_64.debug.conf
kernel-debug /lib/modules/3.10.0-693.21.1.el7.x86_64.debug
> If you want to present the user with a different boot option, you can
> (just have the tunable set on the command line), but being tunable
> driven means that you don't have to choose that option, you could
> automatically enable it under a range of circumstances. I think most
> sane distributions would want that flexibility.
>
> Kconfig proliferation, conversely, is a bit of a nightmare from both
> the user and the tester's point of view, so we're trying to avoid it
> unless absolutely necessary.
>
> James
I already offered that we don't need to introduce a new kernel option and
we can bind this feature to any other kernel option, that is enabled in
the debug kernel, for example CONFIG_DEBUG_SG. Michal said no and he said
that he wants a new kernel option instead.
Mikulas
On Wed, 25 Apr 2018, Mikulas Patocka wrote:
> You need to enable it on boot. Enabling it when the kernel starts to
> execute userspace code is already too late (because you would miss
> kvmalloc calls in the kernel boot path).
>
Is your motivation that since kvmalloc() never falls back to vmalloc() on
boot because fragmentation is not be an issue at boot that we should catch
bugs where it would matter if it had fallen back? If we are worrying
about falling back to vmalloc before even initscripts have run I think we
have bigger problems.
On Wed, 25 Apr 2018, David Rientjes wrote:
> On Wed, 25 Apr 2018, Mikulas Patocka wrote:
>
> > You need to enable it on boot. Enabling it when the kernel starts to
> > execute userspace code is already too late (because you would miss
> > kvmalloc calls in the kernel boot path).
>
> Is your motivation that since kvmalloc() never falls back to vmalloc() on
> boot because fragmentation is not be an issue at boot that we should catch
> bugs where it would matter if it had fallen back? If we are worrying
> about falling back to vmalloc before even initscripts have run I think we
> have bigger problems.
The same driver can be compiled directly into the kernel or be loaded as a
module. If the user (or the person preparing distro kernel) compiles the
driver directly into the kernel, kvmalloc should be tested on that driver,
because a different user or distribution can compile that driver as a
module.
Mikulas
On Wed, 25 Apr 2018, James Bottomley wrote:
> > > Do we really need the new config option?  This could just be
> > > manually tunable via fault injection IIUC.
> >
> > We do, because we want to enable it in RHEL and Fedora debugging
> > kernels, so that it will be tested by the users.
> >
> > The users won't use some extra magic kernel options or debugfs files.
>
> If it can be enabled via a tunable, then the distro can turn it on
> without the user having to do anything. If you want to present the
> user with a different boot option, you can (just have the tunable set
> on the command line), but being tunable driven means that you don't
> have to choose that option, you could automatically enable it under a
> range of circumstances. I think most sane distributions would want
> that flexibility.
>
> Kconfig proliferation, conversely, is a bit of a nightmare from both
> the user and the tester's point of view, so we're trying to avoid it
> unless absolutely necessary.
>
> James
BTW. even developers who compile their own kernel should have this enabled
by a CONFIG option - because if the developer sees the option when
browsing through menuconfig, he may enable it. If he doesn't see the
option, he won't even know that such an option exists.
Mikulas
On Wed, 2018-04-25 at 19:00 -0400, Mikulas Patocka wrote:
>
> On Wed, 25 Apr 2018, James Bottomley wrote:
>
> > > > Do we really need the new config option?  This could just be
> > > > manually tunable via fault injection IIUC.
> > >Â
> > > We do, because we want to enable it in RHEL and Fedora debugging
> > > kernels, so that it will be tested by the users.
> > >Â
> > > The users won't use some extra magic kernel options or debugfs
> files.
> >Â
> > If it can be enabled via a tunable, then the distro can turn it on
> > without the user having to do anything. If you want to present the
> > user with a different boot option, you can (just have the tunable
> set
> > on the command line), but being tunable driven means that you don't
> > have to choose that option, you could automatically enable it under
> a
> > range of circumstances. I think most sane distributions would want
> > that flexibility.
> >Â
> > Kconfig proliferation, conversely, is a bit of a nightmare from
> both
> > the user and the tester's point of view, so we're trying to avoid
> it
> > unless absolutely necessary.
> >Â
> > James
>
> BTW. even developers who compile their own kernel should have this
> enabled by a CONFIG option - because if the developer sees the option
> when browsing through menuconfig, he may enable it. If he doesn't see
> the option, he won't even know that such an option exists.
I may be an atypical developer but I'd rather have a root canal than
browse through menuconfig options. The way to get people to learn
about new debugging options is to blog about it (or write an lwn.net
article) which google will find the next time I ask it how I debug XXX.
Google (probably as a service to humanity) rarely turns up Kconfig
options in response to a query.
James
On 04/25/2018 01:02 PM, Mikulas Patocka wrote:
> This patch reorders Kconfig entries, so that menuconfig displays proper
> indentation.
>
> Signed-off-by: Mikulas Patocka <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Tested-by: Randy Dunlap <[email protected]>
Thanks.
> ---
> lib/Kconfig.debug | 36 ++++++++++++++++++------------------
> 1 file changed, 18 insertions(+), 18 deletions(-)
>
> Index: linux-2.6/lib/Kconfig.debug
> ===================================================================
> --- linux-2.6.orig/lib/Kconfig.debug 2018-04-16 21:08:36.000000000 +0200
> +++ linux-2.6/lib/Kconfig.debug 2018-04-25 15:56:16.000000000 +0200
> @@ -1503,6 +1503,10 @@ config NETDEV_NOTIFIER_ERROR_INJECT
>
> If unsure, say N.
>
> +config FUNCTION_ERROR_INJECTION
> + def_bool y
> + depends on HAVE_FUNCTION_ERROR_INJECTION && KPROBES
> +
> config FAULT_INJECTION
> bool "Fault-injection framework"
> depends on DEBUG_KERNEL
> @@ -1510,10 +1514,6 @@ config FAULT_INJECTION
> Provide fault-injection framework.
> For more details, see Documentation/fault-injection/.
>
> -config FUNCTION_ERROR_INJECTION
> - def_bool y
> - depends on HAVE_FUNCTION_ERROR_INJECTION && KPROBES
> -
> config FAILSLAB
> bool "Fault-injection capability for kmalloc"
> depends on FAULT_INJECTION
> @@ -1544,16 +1544,6 @@ config FAIL_IO_TIMEOUT
> Only works with drivers that use the generic timeout handling,
> for others it wont do anything.
>
> -config FAIL_MMC_REQUEST
> - bool "Fault-injection capability for MMC IO"
> - depends on FAULT_INJECTION_DEBUG_FS && MMC
> - help
> - Provide fault-injection capability for MMC IO.
> - This will make the mmc core return data errors. This is
> - useful to test the error handling in the mmc block device
> - and to test how the mmc host driver handles retries from
> - the block device.
> -
> config FAIL_FUTEX
> bool "Fault-injection capability for futexes"
> select DEBUG_FS
> @@ -1561,6 +1551,12 @@ config FAIL_FUTEX
> help
> Provide fault-injection capability for futexes.
>
> +config FAULT_INJECTION_DEBUG_FS
> + bool "Debugfs entries for fault-injection capabilities"
> + depends on FAULT_INJECTION && SYSFS && DEBUG_FS
> + help
> + Enable configuration of fault-injection capabilities via debugfs.
> +
> config FAIL_FUNCTION
> bool "Fault-injection capability for functions"
> depends on FAULT_INJECTION_DEBUG_FS && FUNCTION_ERROR_INJECTION
> @@ -1571,11 +1567,15 @@ config FAIL_FUNCTION
> an error value and have to handle it. This is useful to test the
> error handling in various subsystems.
>
> -config FAULT_INJECTION_DEBUG_FS
> - bool "Debugfs entries for fault-injection capabilities"
> - depends on FAULT_INJECTION && SYSFS && DEBUG_FS
> +config FAIL_MMC_REQUEST
> + bool "Fault-injection capability for MMC IO"
> + depends on FAULT_INJECTION_DEBUG_FS && MMC
> help
> - Enable configuration of fault-injection capabilities via debugfs.
> + Provide fault-injection capability for MMC IO.
> + This will make the mmc core return data errors. This is
> + useful to test the error handling in the mmc block device
> + and to test how the mmc host driver handles retries from
> + the block device.
>
> config FAULT_INJECTION_STACKTRACE_FILTER
> bool "stacktrace filter for fault-injection capabilities"
>
--
~Randy
On Wed 25-04-18 18:42:57, Mikulas Patocka wrote:
>
>
> On Wed, 25 Apr 2018, James Bottomley wrote:
[...]
> > Kconfig proliferation, conversely, is a bit of a nightmare from both
> > the user and the tester's point of view, so we're trying to avoid it
> > unless absolutely necessary.
> >
> > James
>
> I already offered that we don't need to introduce a new kernel option and
> we can bind this feature to any other kernel option, that is enabled in
> the debug kernel, for example CONFIG_DEBUG_SG. Michal said no and he said
> that he wants a new kernel option instead.
Just for the record. I didn't say I _want_ a config option. Do not
misinterpret my words. I've said that a config option would be
acceptable if there is no way to deliver the functionality via kernel
package automatically. You haven't provided any argument that would
explain why the kernel package cannot add a boot option. Maybe there are
some but I do not see them right now.
--
Michal Hocko
SUSE Labs
On Thu, 26 Apr 2018, Michal Hocko wrote:
> On Wed 25-04-18 18:42:57, Mikulas Patocka wrote:
> >
> >
> > On Wed, 25 Apr 2018, James Bottomley wrote:
> [...]
> > > Kconfig proliferation, conversely, is a bit of a nightmare from both
> > > the user and the tester's point of view, so we're trying to avoid it
> > > unless absolutely necessary.
> > >
> > > James
> >
> > I already offered that we don't need to introduce a new kernel option and
> > we can bind this feature to any other kernel option, that is enabled in
> > the debug kernel, for example CONFIG_DEBUG_SG. Michal said no and he said
> > that he wants a new kernel option instead.
>
> Just for the record. I didn't say I _want_ a config option. Do not
> misinterpret my words. I've said that a config option would be
> acceptable if there is no way to deliver the functionality via kernel
> package automatically. You haven't provided any argument that would
> explain why the kernel package cannot add a boot option. Maybe there are
> some but I do not see them right now.
AFAIK Grub doesn't load per-kernel options from a per-kernel file. Even if
we hacked grub scripts to add this option, other distributions won't.
Mikulas
On Thu, 2018-04-26 at 10:28 -0400, Mikulas Patocka wrote:
>
> On Thu, 26 Apr 2018, Michal Hocko wrote:
>
> > On Wed 25-04-18 18:42:57, Mikulas Patocka wrote:
> > >Â
> > >Â
> > > On Wed, 25 Apr 2018, James Bottomley wrote:
> > [...]
> > > > Kconfig proliferation, conversely, is a bit of a nightmare from
> both
> > > > the user and the tester's point of view, so we're trying to
> avoid it
> > > > unless absolutely necessary.
> > > >Â
> > > > James
> > >Â
> > > I already offered that we don't need to introduce a new kernel
> option andÂ
> > > we can bind this feature to any other kernel option, that is
> enabled inÂ
> > > the debug kernel, for example CONFIG_DEBUG_SG. Michal said no and
> he saidÂ
> > > that he wants a new kernel option instead.
> >Â
> > Just for the record. I didn't say I _want_ a config option. Do not
> > misinterpret my words. I've said that a config option would be
> > acceptable if there is no way to deliver the functionality via
> kernel
> > package automatically. You haven't provided any argument that would
> > explain why the kernel package cannot add a boot option. Maybe
> there are
> > some but I do not see them right now.
>
> AFAIK Grub doesn't load per-kernel options from a per-kernel file.
> Even if we hacked grub scripts to add this option, other
> distributions won't.
Perhaps find out beforehand instead of insisting on an approach without
knowing. On openSUSE the grub config is built from the files in
/etc/grub.d/ so any package can add a kernel option (and various
conditions around activating it) simply by adding a new file. The
config files are quite sophisticated, so you can add what looks to be a
new kernel, but is really an existing kernel with different options
this way.
James
On Wed, 25 Apr 2018, James Bottomley wrote:
> > BTW. even developers who compile their own kernel should have this
> > enabled?by a CONFIG option - because if the developer sees the option
> > when?browsing through menuconfig, he may enable it. If he doesn't see
> > the?option, he won't even know that such an option exists.
>
> I may be an atypical developer but I'd rather have a root canal than
> browse through menuconfig options. The way to get people to learn
> about new debugging options is to blog about it (or write an lwn.net
> article) which google will find the next time I ask it how I debug XXX.
> Google (probably as a service to humanity) rarely turns up Kconfig
> options in response to a query.
From my point of view, this feature should be as little disruptive to the
developer as possible. It should work automatically behind the scenes
without the developer or the tester even knowing that it is working. From
this point of view, binding it to CONFIG_DEBUG_SG (or any other commonly
used debugging option) would be ideal, because driver developers already
enable CONFIG_DEBUG_SG, so they'll get this kvmalloc test for free.
From your point of view, you should introduce a sysfs file and a kernel
parameter that no one knows about - and then start blogging about it - to
let people know. Why would you bother people with this knowledge? They'll
forget about it anyway and won't turn it on.
Mikulas
On Thu, 26 Apr 2018, James Bottomley wrote:
> On Thu, 2018-04-26 at 10:28 -0400, Mikulas Patocka wrote:
> >
> > On Thu, 26 Apr 2018, Michal Hocko wrote:
> >
> > > On Wed 25-04-18 18:42:57, Mikulas Patocka wrote:
> > > >Â
> > > >Â
> > > > On Wed, 25 Apr 2018, James Bottomley wrote:
> > > [...]
> > > > > Kconfig proliferation, conversely, is a bit of a nightmare from
> > both
> > > > > the user and the tester's point of view, so we're trying to
> > avoid it
> > > > > unless absolutely necessary.
> > > > >Â
> > > > > James
> > > >Â
> > > > I already offered that we don't need to introduce a new kernel
> > option andÂ
> > > > we can bind this feature to any other kernel option, that is
> > enabled inÂ
> > > > the debug kernel, for example CONFIG_DEBUG_SG. Michal said no and
> > he saidÂ
> > > > that he wants a new kernel option instead.
> > >Â
> > > Just for the record. I didn't say I _want_ a config option. Do not
> > > misinterpret my words. I've said that a config option would be
> > > acceptable if there is no way to deliver the functionality via
> > kernel
> > > package automatically. You haven't provided any argument that would
> > > explain why the kernel package cannot add a boot option. Maybe
> > there are
> > > some but I do not see them right now.
> >
> > AFAIK Grub doesn't load per-kernel options from a per-kernel file.
> > Even if we hacked grub scripts to add this option, other
> > distributions won't.
>
> Perhaps find out beforehand instead of insisting on an approach without
> knowing. On openSUSE the grub config is built from the files in
> /etc/grub.d/ so any package can add a kernel option (and various
> conditions around activating it) simply by adding a new file.
And then, different versions of the debug kernel will clash when
attempting to create the same file.
And what about other distributions? What about people who the RHEL kernel
from source with "make"?
The problem with this approach that you are trying to bother more and more
people with this little silly feature.
> The config files are quite sophisticated, so you can add what looks to
> be a new kernel, but is really an existing kernel with different options
> this way.
>
> James
Mikulas
On Thu, 26 Apr 2018, Mikulas Patocka wrote:
>
>
> On Wed, 25 Apr 2018, James Bottomley wrote:
>
> > > BTW. even developers who compile their own kernel should have this
> > > enabled?by a CONFIG option - because if the developer sees the option
> > > when?browsing through menuconfig, he may enable it. If he doesn't see
> > > the?option, he won't even know that such an option exists.
> >
> > I may be an atypical developer but I'd rather have a root canal than
> > browse through menuconfig options. The way to get people to learn
> > about new debugging options is to blog about it (or write an lwn.net
> > article) which google will find the next time I ask it how I debug XXX.
> > Google (probably as a service to humanity) rarely turns up Kconfig
> > options in response to a query.
>
> From my point of view, this feature should be as little disruptive to the
> developer as possible. It should work automatically behind the scenes
> without the developer or the tester even knowing that it is working. From
> this point of view, binding it to CONFIG_DEBUG_SG (or any other commonly
> used debugging option) would be ideal, because driver developers already
> enable CONFIG_DEBUG_SG, so they'll get this kvmalloc test for free.
>
> From your point of view, you should introduce a sysfs file and a kernel
> parameter that no one knows about - and then start blogging about it - to
> let people know. Why would you bother people with this knowledge? They'll
> forget about it anyway and won't turn it on.
BTW. try to think about - how many total lines of code should this feature
consume in the whole Linux ecosystem?
I made a 10-line patch. I got pushback.
I remade it to a 53-line patch. And you try to suggest that 53 lines is
not enough and we must also change kernel packaging scripts in distro
kernels, because the kernel just cannot enable this feature on its own.
If we hack kernel packaging scripts in most distros, how many lines of
code would that be? What's your target?
Mikulas
On Thu, 2018-04-26 at 11:05 -0400, Mikulas Patocka wrote:
>
> On Thu, 26 Apr 2018, James Bottomley wrote:
[...]
> > Perhaps find out beforehand instead of insisting on an approach
> without
> > knowing. On openSUSE the grub config is built from the files in
> > /etc/grub.d/ so any package can add a kernel option (and various
> > conditions around activating it) simply by adding a new file.
>
> And then, different versions of the debug kernel will clash whenÂ
> attempting to create the same file.
Don't be silly ... there are many ways of coping with that in rpm/dpkg.
However, I take it the fact you're now trying to get me to explain
them means you take the point that a kernel dynamic option can be
activated in a variety of easy ways in a distribution including through
the boot menu; so if you want this to appear in the boot menu you don't
need a Kconfig option to achieve it.
> And what about other distributions? What about people who the RHEL
> kernel from source with "make"?
Well, if you build your own kernel and we have a dynamic option, it
will "just work" without you having to muck about trying to re-Kconfig
it, so I'd see that as a win.
> The problem with this approach that you are trying to bother more and
> more people with this little silly feature.
So you're shifting your argument from "I have to do it as a Kconfig
option because the distros require it" to "distributions will build
separate kernel packages for this, but won't do enabling in a non
kernel package"? To be honest, I think the argument is nuts but I
don't really care. From my point of view it's usually me explaining to
people how to debug stuff and "you have to build your own kernel with
this Kconfig option" compared to "add this to the kernel command line
and reboot" is much more effort for the debugger.
James
On Thu, 26 Apr 2018, James Bottomley wrote:
> On Thu, 2018-04-26 at 11:05 -0400, Mikulas Patocka wrote:
> >
> > On Thu, 26 Apr 2018, James Bottomley wrote:
> [...]
> > > Perhaps find out beforehand instead of insisting on an approach
> > without
> > > knowing. On openSUSE the grub config is built from the files in
> > > /etc/grub.d/ so any package can add a kernel option (and various
> > > conditions around activating it) simply by adding a new file.
> >
> > And then, different versions of the debug kernel will clash whenÂ
> > attempting to create the same file.
>
> Don't be silly ... there are many ways of coping with that in rpm/dpkg.
I know you can deal with it - but how many lines of code will that
consume? Multiplied by the total number of rpm-based distros.
Mikulas
On Thu, 26 Apr 2018, James Bottomley wrote:
> So you're shifting your argument from "I have to do it as a Kconfig
> option because the distros require it" to "distributions will build
> separate kernel packages for this, but won't do enabling in a non
> kernel package"? To be honest, I think the argument is nuts but I
> don't really care. From my point of view it's usually me explaining to
> people how to debug stuff and "you have to build your own kernel with
> this Kconfig option" compared to "add this to the kernel command line
> and reboot" is much more effort for the debugger.
>
> James
If you have to explain to the user that he needs to turn it on, it is
already wrong.
In order to find the kvmalloc abuses, it should be tested by as many users
as possible. And it could be tested by as many users as possible, if it
can be enabled in a VISIBLE place (i.e. menuconfig) - or (in my opinion
even better) it should be bound to an CONFIG_ option that is already
enabled for debugging kernel - then you won't have to explain anything to
the user at all.
Hardly anyone - except for people who read this thread - will know about
the new commandline parameters or debugfs files.
I'm not arguing that the commandline parameter or debugfs files are wrong.
They are OK to overridde the default settings for advanced users. But they
are useless for common users because common users won't know about them.
Mikulas
On Thu, Apr 26, 2018 at 11:44:21AM -0400, Mikulas Patocka wrote:
>
>
> On Thu, 26 Apr 2018, James Bottomley wrote:
>
> > On Thu, 2018-04-26 at 11:05 -0400, Mikulas Patocka wrote:
> > >
> > > On Thu, 26 Apr 2018, James Bottomley wrote:
> > [...]
> > > > Perhaps find out beforehand instead of insisting on an approach
> > > without
> > > > knowing.? On openSUSE the grub config is built from the files in
> > > > /etc/grub.d/ so any package can add a kernel option (and various
> > > > conditions around activating it) simply by adding a new file.
> > >
> > > And then, different versions of the debug kernel will clash when?
> > > attempting to create the same file.
> >
> > Don't be silly ... there are many ways of coping with that in rpm/dpkg.
>
> I know you can deal with it - but how many lines of code will that
> consume? Multiplied by the total number of rpm-based distros.
>
> Mikulas
I don't think debug kernels should inject faults by default.
IIUC debug kernels mainly exist so people who experience e.g. memory
corruption can try and debug the failure. In this case, CONFIG_DEBUG_SG
will *already* catch a failure early. Nothing special needs to be done.
There is a much smaller class of people like QA who go actively looking
for trouble. That's the kind of thing fault injection is good for, and
IMO you do not want your QA team to test a separate kernel - otherwise
you are never quite sure whatever was tested will work in the field.
So a config option won't help them either.
How do you make sure QA tests a specific corner case? Add it to
the test plan :)
I don't speak for Red Hat, etc.
--
MST
On Thu, 26 Apr 2018, Michael S. Tsirkin wrote:
> On Thu, Apr 26, 2018 at 11:44:21AM -0400, Mikulas Patocka wrote:
> >
> >
> > On Thu, 26 Apr 2018, James Bottomley wrote:
> >
> > > On Thu, 2018-04-26 at 11:05 -0400, Mikulas Patocka wrote:
> > > >
> > > > On Thu, 26 Apr 2018, James Bottomley wrote:
> > > [...]
> > > > > Perhaps find out beforehand instead of insisting on an approach
> > > > without
> > > > > knowing.? On openSUSE the grub config is built from the files in
> > > > > /etc/grub.d/ so any package can add a kernel option (and various
> > > > > conditions around activating it) simply by adding a new file.
> > > >
> > > > And then, different versions of the debug kernel will clash when?
> > > > attempting to create the same file.
> > >
> > > Don't be silly ... there are many ways of coping with that in rpm/dpkg.
> >
> > I know you can deal with it - but how many lines of code will that
> > consume? Multiplied by the total number of rpm-based distros.
> >
> > Mikulas
>
> I don't think debug kernels should inject faults by default.
But it is not injecting any faults. It is using the fault-injection
framework because Michal said that it should use it. The original version
of the patch didn't use the fault-injection framework at all.
The kvmalloc function can take two branches, the kmalloc branch and
vmalloc branch. The vmalloc branch is taken rarely, so the kernel contains
bugs regarding this path - and the patch increases the probability that
the vmalloc patch is taken, to discover the bugs early and make them
reproducible.
> IIUC debug kernels mainly exist so people who experience e.g. memory
> corruption can try and debug the failure. In this case, CONFIG_DEBUG_SG
> will *already* catch a failure early. Nothing special needs to be done.
The patch helps people debug such memory coprruptions (such as using DMA
API on the result of kvmalloc).
> There is a much smaller class of people like QA who go actively looking
> for trouble. That's the kind of thing fault injection is good for, and
> IMO you do not want your QA team to test a separate kernel - otherwise
> you are never quite sure whatever was tested will work in the field.
> So a config option won't help them either.
>
> How do you make sure QA tests a specific corner case? Add it to
> the test plan :)
>
> I don't speak for Red Hat, etc.
>
> --
> MST
Mikulas
On Thu, Apr 26, 2018 at 12:07:25PM -0400, Mikulas Patocka wrote:
> > IIUC debug kernels mainly exist so people who experience e.g. memory
> > corruption can try and debug the failure. In this case, CONFIG_DEBUG_SG
> > will *already* catch a failure early. Nothing special needs to be done.
>
> The patch helps people debug such memory coprruptions (such as using DMA
> API on the result of kvmalloc).
That's my point. I don't think your patch helps debug any memory
corruptions. With CONFIG_DEBUG_SG using DMA API already causes a
BUG_ON, that's before any memory can get corrupted.
--
MST
On Thu, 26 Apr 2018, Michael S. Tsirkin wrote:
> On Thu, Apr 26, 2018 at 12:07:25PM -0400, Mikulas Patocka wrote:
> > > IIUC debug kernels mainly exist so people who experience e.g. memory
> > > corruption can try and debug the failure. In this case, CONFIG_DEBUG_SG
> > > will *already* catch a failure early. Nothing special needs to be done.
> >
> > The patch helps people debug such memory coprruptions (such as using DMA
> > API on the result of kvmalloc).
>
> That's my point. I don't think your patch helps debug any memory
> corruptions. With CONFIG_DEBUG_SG using DMA API already causes a
> BUG_ON, that's before any memory can get corrupted.
The patch turns a hard-to-reproduce bug into an easy-to-reproduce bug.
Obviously we don't want this in production kernels, but in the debug
kernels it should be done.
Mikulas
On Thu, 26 Apr 2018, Michael S. Tsirkin wrote:
> How do you make sure QA tests a specific corner case? Add it to
> the test plan :)
BTW. how many "lines of code" of corporate bureaucracy would that take? :-)
> I don't speak for Red Hat, etc.
>
> --
> MST
Mikulas
On Thu, Apr 26, 2018 at 02:58:08PM -0400, Mikulas Patocka wrote:
>
>
> On Thu, 26 Apr 2018, Michael S. Tsirkin wrote:
>
> > How do you make sure QA tests a specific corner case? Add it to
> > the test plan :)
>
> BTW. how many "lines of code" of corporate bureaucracy would that take? :-)
It's pretty easy at least here at Red Hat.
> > I don't speak for Red Hat, etc.
> >
> > --
> > MST
>
> Mikulas
>>>>> "James" == James Bottomley <[email protected]> writes:
James> On Wed, 2018-04-25 at 19:00 -0400, Mikulas Patocka wrote:
>>
>> On Wed, 25 Apr 2018, James Bottomley wrote:
>>
>> > > > Do we really need the new config option???This could just be
>> > > > manually? tunable via fault injection IIUC.
>> > >?
>> > > We do, because we want to enable it in RHEL and Fedora debugging
>> > > kernels,?so that it will be tested by the users.
>> > >?
>> > > The users won't use some extra magic kernel options or debugfs
>> files.
>> >?
>> > If it can be enabled via a tunable, then the distro can turn it on
>> > without the user having to do anything.? If you want to present the
>> > user with a different boot option, you can (just have the tunable
>> set
>> > on the command line), but being tunable driven means that you don't
>> > have to choose that option, you could automatically enable it under
>> a
>> > range of circumstances.? I think most sane distributions would want
>> > that flexibility.
>> >?
>> > Kconfig proliferation, conversely, is a bit of a nightmare from
>> both
>> > the user and the tester's point of view, so we're trying to avoid
>> it
>> > unless absolutely necessary.
>> >?
>> > James
>>
>> BTW. even developers who compile their own kernel should have this
>> enabled?by a CONFIG option - because if the developer sees the option
>> when?browsing through menuconfig, he may enable it. If he doesn't see
>> the?option, he won't even know that such an option exists.
James> I may be an atypical developer but I'd rather have a root canal
James> than browse through menuconfig options. The way to get people
James> to learn about new debugging options is to blog about it (or
James> write an lwn.net article) which google will find the next time
James> I ask it how I debug XXX. Google (probably as a service to
James> humanity) rarely turns up Kconfig options in response to a
James> query.
I agree with James here. Looking at the SLAB vs SLUB Kconfig entries
tells me *nothing* about why I should pick one or the other, as an
example.
John
On Thu, Apr 26, 2018 at 02:54:26PM -0400, Mikulas Patocka wrote:
>
>
> On Thu, 26 Apr 2018, Michael S. Tsirkin wrote:
>
> > On Thu, Apr 26, 2018 at 12:07:25PM -0400, Mikulas Patocka wrote:
> > > > IIUC debug kernels mainly exist so people who experience e.g. memory
> > > > corruption can try and debug the failure. In this case, CONFIG_DEBUG_SG
> > > > will *already* catch a failure early. Nothing special needs to be done.
> > >
> > > The patch helps people debug such memory coprruptions (such as using DMA
> > > API on the result of kvmalloc).
> >
> > That's my point. I don't think your patch helps debug any memory
> > corruptions. With CONFIG_DEBUG_SG using DMA API already causes a
> > BUG_ON, that's before any memory can get corrupted.
>
> The patch turns a hard-to-reproduce bug into an easy-to-reproduce bug.
It's still not a memory corruption. It's a BUG_ON the source of which -
should it trigger - can be typically found using grep.
> Obviously we don't want this in production kernels, but in the debug
> kernels it should be done.
>
> Mikulas
I'm not so sure. debug kernels should make debugging easier,
definitely.
Unfortunately they are already slower so some races don't trigger.
If they also start crashing more because we are injecting
memory allocation errors, people are even less likely to
be able to use them.
Just add a comment near the BUG_ON within DMA API telling people how
they can inject this error some more if the bug does not
reproduce, and leave it at that.
--
MST
On Thu, 26 Apr 2018, Michael S. Tsirkin wrote:
> On Thu, Apr 26, 2018 at 02:54:26PM -0400, Mikulas Patocka wrote:
> >
> >
> > On Thu, 26 Apr 2018, Michael S. Tsirkin wrote:
> >
> > > On Thu, Apr 26, 2018 at 12:07:25PM -0400, Mikulas Patocka wrote:
> > > > > IIUC debug kernels mainly exist so people who experience e.g. memory
> > > > > corruption can try and debug the failure. In this case, CONFIG_DEBUG_SG
> > > > > will *already* catch a failure early. Nothing special needs to be done.
> > > >
> > > > The patch helps people debug such memory coprruptions (such as using DMA
> > > > API on the result of kvmalloc).
> > >
> > > That's my point. I don't think your patch helps debug any memory
> > > corruptions. With CONFIG_DEBUG_SG using DMA API already causes a
> > > BUG_ON, that's before any memory can get corrupted.
> >
> > The patch turns a hard-to-reproduce bug into an easy-to-reproduce bug.
>
> It's still not a memory corruption. It's a BUG_ON the source of which -
> should it trigger - can be typically found using grep.
>
> > Obviously we don't want this in production kernels, but in the debug
> > kernels it should be done.
> >
> > Mikulas
>
> I'm not so sure. debug kernels should make debugging easier,
> definitely.
>
> Unfortunately they are already slower so some races don't trigger.
>
> If they also start crashing more because we are injecting
> memory allocation errors, people are even less likely to
> be able to use them.
I've actually already pushed this patch to RHEL-7 (just before 7.5 was
released) and it found out some powerpc issues. See the commit
ea376cc55bc3 in the RHEL-7 git. It was reverted just before RHEL-7.5 was
released with the intention that it will be reinstated just after RHEL-7.5
release, so that these issues could be found and eliminated in the
7.5->7.6 development cycle. Jeff Moyer asked me to put it upstream because
they want to follow upstream and they don't like RHEL-specific patches.
There's clear incentive to put this patch to RHEL-7, that's why I'm
posting it here.
> Just add a comment near the BUG_ON within DMA API telling people how
> they can inject this error some more if the bug does not
> reproduce, and leave it at that.
But the problem is that the powerpc bug only triggers with this patch. It
doesn't trigger without it. So, we have a potential random-crashing bug in
the codebase (and perhaps more others) and we want to eliminate them -
that's why we need the patch.
People on this list argue "this should be a kernel parameter". But the
testers won't enable the kernel parameter, the crashes won't happen
without the kernel parameter and the bugs will stay unreported and
uncorrected. That's why it needs to be the default.
Mikulas
On Thu, Apr 26, 2018 at 03:36:14PM -0400, Mikulas Patocka wrote:
> People on this list argue "this should be a kernel parameter".
How about making it a writeable attribute, so it's easy to turn on/off
after boot. Then you can keep it deterministic, userspace can play with
the attribute at random if it wants to.
--
MST
On Thu, 26 Apr 2018, Michael S. Tsirkin wrote:
> On Thu, Apr 26, 2018 at 03:36:14PM -0400, Mikulas Patocka wrote:
> > People on this list argue "this should be a kernel parameter".
>
> How about making it a writeable attribute, so it's easy to turn on/off
> after boot. Then you can keep it deterministic, userspace can play with
> the attribute at random if it wants to.
>
> --
> MST
It is already controllable by an attribute in debugfs.
Will you email all the testers about this attribute? How many of them will
remember to set it? How many of them will remember to set it a year after?
Will you write a userspace program that manages it and introduce it into
the distributon?
This is a little feature.
Mikulas
On Thu, 26 Apr 2018, John Stoffel wrote:
> >>>>> "James" == James Bottomley <[email protected]> writes:
>
> James> I may be an atypical developer but I'd rather have a root canal
> James> than browse through menuconfig options. The way to get people
> James> to learn about new debugging options is to blog about it (or
> James> write an lwn.net article) which google will find the next time
> James> I ask it how I debug XXX. Google (probably as a service to
> James> humanity) rarely turns up Kconfig options in response to a
> James> query.
>
> I agree with James here. Looking at the SLAB vs SLUB Kconfig entries
> tells me *nothing* about why I should pick one or the other, as an
> example.
>
> John
I see your point - and I think the misunderstanding is this.
This patch is not really helping people to debug existing crashes. It is
not like "you get a crash" - "you google for some keywords" - "you get a
page that suggests to turn this option on" - "you turn it on and solve the
crash".
What this patch really does is that - it makes the kernel deliberately
crash in a situation when the code violates the specification, but it
would not crash otherwise or it would crash very rarely. It helps to
detect specification violations.
If the kernel developer (or tester) doesn't use this option, his buggy
code won't crash - and if it won't crash, he won't fix the bug or report
it. How is the user or developer supposed to learn about this option, if
he gets no crash at all?
Mikulas
On Thu, Apr 26, 2018 at 05:50:20PM -0400, Mikulas Patocka wrote:
> How is the user or developer supposed to learn about this option, if
> he gets no crash at all?
Look in /sys/kernel/debug/fail* ? That actually lets you
filter by module, process etc.
I think this patch conflates two things:
1. Make kvmalloc use the vmalloc path.
This seems a bit narrow.
What is special about kvmalloc? IMHO nothing - it's yet another user
of __GFP_NORETRY or __GFP_RETRY_MAYFAIL. As any such
user, it either recovers correctly or not.
So IMHO it's just a case of
making __GFP_NORETRY, __GFP_RETRY_MAYFAIL, or both
fail once in a while.
Seems like a better extension to me than focusing on vmalloc.
I think you will find more bugs this way.
2. Ability to control this from a separate config
option.
It's still not that clear to me why is this such a
hard requirement. If a distro wants to force specific
boot time options, why isn't CONFIG_CMDLINE sufficient?
But assuming it's important to control this kind of
fault injection to be controlled from
a dedicated menuconfig option, why not the rest of
faults?
IMHO if you split 1/2 up, and generalize, the path upstream
will be much smoother.
Hope this helps.
--
MST
On Fri, 27 Apr 2018, Michael S. Tsirkin wrote:
> On Thu, Apr 26, 2018 at 05:50:20PM -0400, Mikulas Patocka wrote:
> > How is the user or developer supposed to learn about this option, if
> > he gets no crash at all?
>
> Look in /sys/kernel/debug/fail* ? That actually lets you
> filter by module, process etc.
>
> I think this patch conflates two things:
>
> 1. Make kvmalloc use the vmalloc path.
> This seems a bit narrow.
> What is special about kvmalloc? IMHO nothing - it's yet another user
> of __GFP_NORETRY or __GFP_RETRY_MAYFAIL. As any such
__GFP_RETRY_MAYFAIL makes the allocator retry the costly_order allocations
> user, it either recovers correctly or not.
> So IMHO it's just a case of
> making __GFP_NORETRY, __GFP_RETRY_MAYFAIL, or both
> fail once in a while.
> Seems like a better extension to me than focusing on vmalloc.
> I think you will find more bugs this way.
If the array is <= PAGE_SIZE, vmalloc will not use __GFP_NORETRY. So it
still hides some bugs - such as, if a structure grows above 4k, it would
start randomly crashing due to memory fragmentation.
> 2. Ability to control this from a separate config
> option.
>
> It's still not that clear to me why is this such a
> hard requirement. If a distro wants to force specific
> boot time options, why isn't CONFIG_CMDLINE sufficient?
There are 489 kernel options declared with the __setup keyword. Hardly any
kernel developer notices that a new one was added and selects it when
testing his code.
> But assuming it's important to control this kind of
> fault injection to be controlled from
> a dedicated menuconfig option, why not the rest of
> faults?
The injected faults cause damage to the user, so there's no point to
enable them by default. vmalloc fallback should not cause any damage
(assuming that the code is correctly written).
> IMHO if you split 1/2 up, and generalize, the path upstream
> will be much smoother.
This seems like a lost case. So, let's not care about code correctness and
let's solve crashes only after they are reported. If the upstream wants to
work this way, there's nothing that can be done about it.
I'm wondering if I can still push it to RHEL or not.
> Hope this helps.
>
> --
> MST
Mikulas
On Thu 26-04-18 18:52:05, Mikulas Patocka wrote:
>
>
> On Fri, 27 Apr 2018, Michael S. Tsirkin wrote:
[...]
> > But assuming it's important to control this kind of
> > fault injection to be controlled from
> > a dedicated menuconfig option, why not the rest of
> > faults?
>
> The injected faults cause damage to the user, so there's no point to
> enable them by default. vmalloc fallback should not cause any damage
> (assuming that the code is correctly written).
But you want to find those bugs which would BUG_ON easier, so there is a
risk of harm IIUC and this is not much different than other fault
injecting paths.
--
Michal Hocko
SUSE Labs
On Fri, 27 Apr 2018, Michal Hocko wrote:
> On Thu 26-04-18 18:52:05, Mikulas Patocka wrote:
> >
> >
> > On Fri, 27 Apr 2018, Michael S. Tsirkin wrote:
> [...]
> > > But assuming it's important to control this kind of
> > > fault injection to be controlled from
> > > a dedicated menuconfig option, why not the rest of
> > > faults?
> >
> > The injected faults cause damage to the user, so there's no point to
> > enable them by default. vmalloc fallback should not cause any damage
> > (assuming that the code is correctly written).
>
> But you want to find those bugs which would BUG_ON easier, so there is a
> risk of harm IIUC
Yes, I want to harm them, but I only want to harm the users using the
debugging kernel. Testers should be "harmed" by crashes - so that the
users of production kernels are harmed less.
If someone hits this, he should report it, use the kernel parameter to
turn it off and continue with the testing.
> and this is not much different than other fault injecting paths.
Fault injections causes misbehavior even on completely bug-free code (for
example, syscalls randomly returning -ENOMEM). This won't cause
misbehavior on bug-free code.
Mikulas
On Fri, 27 Apr 2018, Michael S. Tsirkin wrote:
> 2. Ability to control this from a separate config
> option.
>
> It's still not that clear to me why is this such a
> hard requirement. If a distro wants to force specific
> boot time options, why isn't CONFIG_CMDLINE sufficient?
So, try this and turn it on with CONFIG_CMDLINE. But I'm not a
blogger and I will not write a blog post about it as James Bottomley
suggests :-)
- so very few users will use it.
fault-injection: introduce kvmalloc fallback options
This patch introduces a fault-injection option "kvmalloc_fallback". This
option makes kvmalloc randomly fall back to vmalloc.
Unfortunately, some kernel code has bugs - it uses kvmalloc and then
uses DMA-API on the returned memory or frees it with kfree. Such bugs were
found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
code. This options helps to test for these bugs.
Signed-off-by: Mikulas Patocka <[email protected]>
---
Documentation/fault-injection/fault-injection.txt | 7 +++++
mm/util.c | 30 ++++++++++++++++++++++
2 files changed, 37 insertions(+)
Index: linux-2.6/Documentation/fault-injection/fault-injection.txt
===================================================================
--- linux-2.6.orig/Documentation/fault-injection/fault-injection.txt 2018-04-28 01:01:25.000000000 +0200
+++ linux-2.6/Documentation/fault-injection/fault-injection.txt 2018-04-28 01:01:25.000000000 +0200
@@ -15,6 +15,12 @@ o fail_page_alloc
injects page allocation failures. (alloc_pages(), get_free_pages(), ...)
+o kvmalloc_fallback
+
+ makes the function kvmalloc randomly fall back to vmalloc. This could be used
+ to detects bugs such as using DMA-API on the result of kvmalloc or freeing
+ the result of kvmalloc with free.
+
o fail_futex
injects futex deadlock and uaddr fault errors.
@@ -167,6 +173,7 @@ use the boot option:
failslab=
fail_page_alloc=
+ kvmalloc_fallback=
fail_make_request=
fail_futex=
mmc_core.fail_request=<interval>,<probability>,<space>,<times>
Index: linux-2.6/mm/util.c
===================================================================
--- linux-2.6.orig/mm/util.c 2018-04-28 01:01:25.000000000 +0200
+++ linux-2.6/mm/util.c 2018-04-28 01:03:25.000000000 +0200
@@ -14,6 +14,7 @@
#include <linux/hugetlb.h>
#include <linux/vmalloc.h>
#include <linux/userfaultfd_k.h>
+#include <linux/fault-inject.h>
#include <asm/sections.h>
#include <linux/uaccess.h>
@@ -377,6 +378,29 @@ unsigned long vm_mmap(struct file *file,
}
EXPORT_SYMBOL(vm_mmap);
+#ifdef CONFIG_FAULT_INJECTION
+
+static DECLARE_FAULT_ATTR(kvmalloc_fallback);
+
+static int __init setup_kvmalloc_fallback(char *str)
+{
+ kvmalloc_fallback.verbose = 0;
+ return setup_fault_attr(&kvmalloc_fallback, str);
+}
+
+__setup("kvmalloc_fallback=", setup_kvmalloc_fallback);
+
+#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
+static int __init kvmalloc_fallback_debugfs_init(void)
+{
+ fault_create_debugfs_attr("kvmalloc_fallback", NULL, &kvmalloc_fallback);
+ return 0;
+}
+late_initcall(kvmalloc_fallback_debugfs_init);
+#endif
+
+#endif
+
/**
* kvmalloc_node - attempt to allocate physically contiguous memory, but upon
* failure, fall back to non-contiguous (vmalloc) allocation.
@@ -404,6 +428,11 @@ void *kvmalloc_node(size_t size, gfp_t f
*/
WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
+#ifdef CONFIG_FAULT_INJECTION
+ if (should_fail(&kvmalloc_fallback, size))
+ goto do_vmalloc;
+#endif
+
/*
* We want to attempt a large physically contiguous block first because
* it is less likely to fragment multiple larger blocks and therefore
@@ -427,6 +456,7 @@ void *kvmalloc_node(size_t size, gfp_t f
if (ret || size <= PAGE_SIZE)
return ret;
+do_vmalloc: __maybe_unused
return __vmalloc_node_flags_caller(size, node, flags,
__builtin_return_address(0));
}
>>>>> "Mikulas" == Mikulas Patocka <[email protected]> writes:
Mikulas> On Thu, 26 Apr 2018, John Stoffel wrote:
>> >>>>> "James" == James Bottomley <[email protected]> writes:
>>
James> I may be an atypical developer but I'd rather have a root canal
James> than browse through menuconfig options. The way to get people
James> to learn about new debugging options is to blog about it (or
James> write an lwn.net article) which google will find the next time
James> I ask it how I debug XXX. Google (probably as a service to
James> humanity) rarely turns up Kconfig options in response to a
James> query.
>>
>> I agree with James here. Looking at the SLAB vs SLUB Kconfig entries
>> tells me *nothing* about why I should pick one or the other, as an
>> example.
>>
>> John
Mikulas> I see your point - and I think the misunderstanding is this.
Thanks.
Mikulas> This patch is not really helping people to debug existing crashes. It is
Mikulas> not like "you get a crash" - "you google for some keywords" - "you get a
Mikulas> page that suggests to turn this option on" - "you turn it on and solve the
Mikulas> crash".
Mikulas> What this patch really does is that - it makes the kernel deliberately
Mikulas> crash in a situation when the code violates the specification, but it
Mikulas> would not crash otherwise or it would crash very rarely. It helps to
Mikulas> detect specification violations.
Mikulas> If the kernel developer (or tester) doesn't use this option, his buggy
Mikulas> code won't crash - and if it won't crash, he won't fix the bug or report
Mikulas> it. How is the user or developer supposed to learn about this option, if
Mikulas> he gets no crash at all?
So why do we make this a KConfig option at all? Just turn it on and
let it rip. Now I also think that Linus has the right idea to not
just sprinkle BUG_ONs into the code, just dump and oops and keep going
if you can. If it's a filesystem or a device, turn it read only so
that people notice right away.
On Mon, 30 Apr 2018, John Stoffel wrote:
> >>>>> "Mikulas" == Mikulas Patocka <[email protected]> writes:
>
> Mikulas> On Thu, 26 Apr 2018, John Stoffel wrote:
>
> Mikulas> I see your point - and I think the misunderstanding is this.
>
> Thanks.
>
> Mikulas> This patch is not really helping people to debug existing crashes. It is
> Mikulas> not like "you get a crash" - "you google for some keywords" - "you get a
> Mikulas> page that suggests to turn this option on" - "you turn it on and solve the
> Mikulas> crash".
>
> Mikulas> What this patch really does is that - it makes the kernel deliberately
> Mikulas> crash in a situation when the code violates the specification, but it
> Mikulas> would not crash otherwise or it would crash very rarely. It helps to
> Mikulas> detect specification violations.
>
> Mikulas> If the kernel developer (or tester) doesn't use this option, his buggy
> Mikulas> code won't crash - and if it won't crash, he won't fix the bug or report
> Mikulas> it. How is the user or developer supposed to learn about this option, if
> Mikulas> he gets no crash at all?
>
> So why do we make this a KConfig option at all?
Because other people see the KConfig option (so, they may enable it) and
they don't see the kernel parameter (so, they won't enable it).
Close your eyes and say how many kernel parameters do you remember :-)
> Just turn it on and let it rip.
I can't test if all the networking drivers use kvmalloc properly, because
I don't have the hardware. You can't test it neither. No one has all the
hardware that is supported by Linux.
Driver issues can only be tested by a mass of users. And if the users
don't know about the debugging option, they won't enable it.
> >> I agree with James here. Looking at the SLAB vs SLUB Kconfig entries
> >> tells me *nothing* about why I should pick one or the other, as an
> >> example.
BTW. You can enable slub debugging either with CONFIG_SLUB_DEBUG_ON or
with the kernel parameter "slub_debug" - and most users who compile their
own kernel use CONFIG_SLUB_DEBUG_ON - just because it is visible.
> Now I also think that Linus has the right idea to not just sprinkle
> BUG_ONs into the code, just dump and oops and keep going if you can.
> If it's a filesystem or a device, turn it read only so that people
> notice right away.
This vmalloc fallback is similar to CONFIG_DEBUG_KOBJECT_RELEASE.
CONFIG_DEBUG_KOBJECT_RELEASE changes the behavior of kobject_put in order
to cause deliberate crashes (that wouldn't happen otherwise) in drivers
that misuse kobject_put. In the same sense, we want to cause deliberate
crashes (that wouldn't happen otherwise) in drivers that misuse kvmalloc.
The crashes will only happen in debugging kernels, not in production
kernels.
Mikulas
On Tue, 24 Apr 2018 12:33:01 -0400 (EDT) Mikulas Patocka <[email protected]> wrote:
>
>
> On Tue, 24 Apr 2018, Michal Hocko wrote:
>
> > On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> > >
> > >
> > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > >
> > > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > > >
> > > > > Fixing __vmalloc code
> > > > > is easy and it doesn't require cooperation with maintainers.
> > > >
> > > > But it is a hack against the intention of the scope api.
> > >
> > > It is not!
> >
> > This discussion simply doesn't make much sense it seems. The scope API
> > is to document the scope of the reclaim recursion critical section. That
> > certainly is not a utility function like vmalloc.
>
> That 15-line __vmalloc bugfix doesn't prevent you (or any other kernel
> developer) from converting the code to the scope API. You make nonsensical
> excuses.
>
Fun thread!
Winding back to the original problem, I'd state it as
- Caller uses kvmalloc() but passes the address into vmalloc-naive
DMA API and
- Caller uses kvmalloc() but passes the address into kfree()
Yes?
If so, then...
Is there a way in which, in the kvmalloc-called-kmalloc path, we can
tag the slab-allocated memory with a "this memory was allocated with
kvmalloc()" flag? I *think* there's extra per-object storage available
with suitable slab/slub debugging options? Perhaps we could steal one
bit from the redzone, dunno.
If so then we can
a) set that flag in kvmalloc() if the kmalloc() call succeeded
b) check for that flag in the DMA code, WARN if it is set.
c) in kvfree(), clear that flag before calling kfree()
d) in kfree(), check for that flag and go WARN() if set.
So both potential bugs are detected all the time, dependent upon
CONFIG_SLUB_DEBUG (and perhaps other slub config options).
On Tue, May 01 2018 at 8:36pm -0400,
Andrew Morton <[email protected]> wrote:
> On Tue, 24 Apr 2018 12:33:01 -0400 (EDT) Mikulas Patocka <[email protected]> wrote:
>
> >
> >
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> >
> > > On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> > > >
> > > >
> > > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > >
> > > > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > > > >
> > > > > > Fixing __vmalloc code
> > > > > > is easy and it doesn't require cooperation with maintainers.
> > > > >
> > > > > But it is a hack against the intention of the scope api.
> > > >
> > > > It is not!
> > >
> > > This discussion simply doesn't make much sense it seems. The scope API
> > > is to document the scope of the reclaim recursion critical section. That
> > > certainly is not a utility function like vmalloc.
> >
> > That 15-line __vmalloc bugfix doesn't prevent you (or any other kernel
> > developer) from converting the code to the scope API. You make nonsensical
> > excuses.
> >
>
> Fun thread!
>
> Winding back to the original problem, I'd state it as
>
> - Caller uses kvmalloc() but passes the address into vmalloc-naive
> DMA API and
>
> - Caller uses kvmalloc() but passes the address into kfree()
>
> Yes?
I think so.
> If so, then...
>
> Is there a way in which, in the kvmalloc-called-kmalloc path, we can
> tag the slab-allocated memory with a "this memory was allocated with
> kvmalloc()" flag? I *think* there's extra per-object storage available
> with suitable slab/slub debugging options? Perhaps we could steal one
> bit from the redzone, dunno.
>
> If so then we can
>
> a) set that flag in kvmalloc() if the kmalloc() call succeeded
>
> b) check for that flag in the DMA code, WARN if it is set.
>
> c) in kvfree(), clear that flag before calling kfree()
>
> d) in kfree(), check for that flag and go WARN() if set.
>
> So both potential bugs are detected all the time, dependent upon
> CONFIG_SLUB_DEBUG (and perhaps other slub config options).
Thanks Andrew, definitely the most sane proposal I've seen to resolve
this.
>>>>> "Mikulas" == Mikulas Patocka <[email protected]> writes:
Mikulas> On Mon, 30 Apr 2018, John Stoffel wrote:
>> >>>>> "Mikulas" == Mikulas Patocka <[email protected]> writes:
>>
Mikulas> On Thu, 26 Apr 2018, John Stoffel wrote:
>>
Mikulas> I see your point - and I think the misunderstanding is this.
>>
>> Thanks.
>>
Mikulas> This patch is not really helping people to debug existing crashes. It is
Mikulas> not like "you get a crash" - "you google for some keywords" - "you get a
Mikulas> page that suggests to turn this option on" - "you turn it on and solve the
Mikulas> crash".
>>
Mikulas> What this patch really does is that - it makes the kernel deliberately
Mikulas> crash in a situation when the code violates the specification, but it
Mikulas> would not crash otherwise or it would crash very rarely. It helps to
Mikulas> detect specification violations.
>>
Mikulas> If the kernel developer (or tester) doesn't use this option, his buggy
Mikulas> code won't crash - and if it won't crash, he won't fix the bug or report
Mikulas> it. How is the user or developer supposed to learn about this option, if
Mikulas> he gets no crash at all?
>>
>> So why do we make this a KConfig option at all?
Mikulas> Because other people see the KConfig option (so, they may enable it) and
Mikulas> they don't see the kernel parameter (so, they won't enable it).
Mikulas> Close your eyes and say how many kernel parameters do you remember :-)
>> Just turn it on and let it rip.
Mikulas> I can't test if all the networking drivers use kvmalloc properly, because
Mikulas> I don't have the hardware. You can't test it neither. No one has all the
Mikulas> hardware that is supported by Linux.
Mikulas> Driver issues can only be tested by a mass of users. And if the users
Mikulas> don't know about the debugging option, they won't enable it.
>> >> I agree with James here. Looking at the SLAB vs SLUB Kconfig entries
>> >> tells me *nothing* about why I should pick one or the other, as an
>> >> example.
Mikulas> BTW. You can enable slub debugging either with CONFIG_SLUB_DEBUG_ON or
Mikulas> with the kernel parameter "slub_debug" - and most users who compile their
Mikulas> own kernel use CONFIG_SLUB_DEBUG_ON - just because it is visible.
You miss my point, which is that there's no explanation of what the
difference is between SLAB and SLUB and which I should choose. The
same goes here. If the KConfig option doesn't give useful info, it's
useless.
>> Now I also think that Linus has the right idea to not just sprinkle
>> BUG_ONs into the code, just dump and oops and keep going if you can.
>> If it's a filesystem or a device, turn it read only so that people
>> notice right away.
Mikulas> This vmalloc fallback is similar to
Mikulas> CONFIG_DEBUG_KOBJECT_RELEASE. CONFIG_DEBUG_KOBJECT_RELEASE
Mikulas> changes the behavior of kobject_put in order to cause
Mikulas> deliberate crashes (that wouldn't happen otherwise) in
Mikulas> drivers that misuse kobject_put. In the same sense, we want
Mikulas> to cause deliberate crashes (that wouldn't happen otherwise)
Mikulas> in drivers that misuse kvmalloc.
Mikulas> The crashes will only happen in debugging kernels, not in
Mikulas> production kernels.
Says you. What about people or distros that enable it
unconditionally? They're going to get all kinds of reports and then
turn it off again. Crashing the system isn't the answer here.
>>>>> "Mike" == Mike Snitzer <[email protected]> writes:
Mike> On Tue, May 01 2018 at 8:36pm -0400,
Mike> Andrew Morton <[email protected]> wrote:
>> On Tue, 24 Apr 2018 12:33:01 -0400 (EDT) Mikulas Patocka <[email protected]> wrote:
>>
>> >
>> >
>> > On Tue, 24 Apr 2018, Michal Hocko wrote:
>> >
>> > > On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
>> > > >
>> > > >
>> > > > On Tue, 24 Apr 2018, Michal Hocko wrote:
>> > > >
>> > > > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
>> > > > >
>> > > > > > Fixing __vmalloc code
>> > > > > > is easy and it doesn't require cooperation with maintainers.
>> > > > >
>> > > > > But it is a hack against the intention of the scope api.
>> > > >
>> > > > It is not!
>> > >
>> > > This discussion simply doesn't make much sense it seems. The scope API
>> > > is to document the scope of the reclaim recursion critical section. That
>> > > certainly is not a utility function like vmalloc.
>> >
>> > That 15-line __vmalloc bugfix doesn't prevent you (or any other kernel
>> > developer) from converting the code to the scope API. You make nonsensical
>> > excuses.
>> >
>>
>> Fun thread!
>>
>> Winding back to the original problem, I'd state it as
>>
>> - Caller uses kvmalloc() but passes the address into vmalloc-naive
>> DMA API and
>>
>> - Caller uses kvmalloc() but passes the address into kfree()
>>
>> Yes?
Mike> I think so.
>> If so, then...
>>
>> Is there a way in which, in the kvmalloc-called-kmalloc path, we can
>> tag the slab-allocated memory with a "this memory was allocated with
>> kvmalloc()" flag? I *think* there's extra per-object storage available
>> with suitable slab/slub debugging options? Perhaps we could steal one
>> bit from the redzone, dunno.
>>
>> If so then we can
>>
>> a) set that flag in kvmalloc() if the kmalloc() call succeeded
>>
>> b) check for that flag in the DMA code, WARN if it is set.
>>
>> c) in kvfree(), clear that flag before calling kfree()
>>
>> d) in kfree(), check for that flag and go WARN() if set.
>>
>> So both potential bugs are detected all the time, dependent upon
>> CONFIG_SLUB_DEBUG (and perhaps other slub config options).
Mike> Thanks Andrew, definitely the most sane proposal I've seen to resolve
Mike> this.
Cuts to the heart of the issue I think, and seems pretty sane. Should
the WARN be rate limited as well?
John
On Tue, 1 May 2018, Andrew Morton wrote:
> On Tue, 24 Apr 2018 12:33:01 -0400 (EDT) Mikulas Patocka <[email protected]> wrote:
>
> >
> >
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> >
> > > On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> > > >
> > > >
> > > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > >
> > > > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > > > >
> > > > > > Fixing __vmalloc code
> > > > > > is easy and it doesn't require cooperation with maintainers.
> > > > >
> > > > > But it is a hack against the intention of the scope api.
> > > >
> > > > It is not!
> > >
> > > This discussion simply doesn't make much sense it seems. The scope API
> > > is to document the scope of the reclaim recursion critical section. That
> > > certainly is not a utility function like vmalloc.
> >
> > That 15-line __vmalloc bugfix doesn't prevent you (or any other kernel
> > developer) from converting the code to the scope API. You make nonsensical
> > excuses.
> >
>
> Fun thread!
>
> Winding back to the original problem, I'd state it as
>
> - Caller uses kvmalloc() but passes the address into vmalloc-naive
> DMA API and
>
> - Caller uses kvmalloc() but passes the address into kfree()
>
> Yes?
>
> If so, then...
>
> Is there a way in which, in the kvmalloc-called-kmalloc path, we can
> tag the slab-allocated memory with a "this memory was allocated with
> kvmalloc()" flag? I *think* there's extra per-object storage available
> with suitable slab/slub debugging options? Perhaps we could steal one
> bit from the redzone, dunno.
>
> If so then we can
>
> a) set that flag in kvmalloc() if the kmalloc() call succeeded
>
> b) check for that flag in the DMA code, WARN if it is set.
>
> c) in kvfree(), clear that flag before calling kfree()
>
> d) in kfree(), check for that flag and go WARN() if set.
>
> So both potential bugs are detected all the time, dependent upon
> CONFIG_SLUB_DEBUG (and perhaps other slub config options).
Yes, it would be good. You also need to check it in virt_to_phys(),
virt_to_pfn(), __pa() and maybe some others.
Mikulas
On Wed, 2 May 2018, John Stoffel wrote:
> You miss my point, which is that there's no explanation of what the
> difference is between SLAB and SLUB and which I should choose. The
> same goes here. If the KConfig option doesn't give useful info, it's
> useless.
So what, we could write explamantion of that option.
> >> Now I also think that Linus has the right idea to not just sprinkle
> >> BUG_ONs into the code, just dump and oops and keep going if you can.
> >> If it's a filesystem or a device, turn it read only so that people
> >> notice right away.
>
> Mikulas> This vmalloc fallback is similar to
> Mikulas> CONFIG_DEBUG_KOBJECT_RELEASE. CONFIG_DEBUG_KOBJECT_RELEASE
> Mikulas> changes the behavior of kobject_put in order to cause
> Mikulas> deliberate crashes (that wouldn't happen otherwise) in
> Mikulas> drivers that misuse kobject_put. In the same sense, we want
> Mikulas> to cause deliberate crashes (that wouldn't happen otherwise)
> Mikulas> in drivers that misuse kvmalloc.
>
> Mikulas> The crashes will only happen in debugging kernels, not in
> Mikulas> production kernels.
>
> Says you. What about people or distros that enable it
> unconditionally? They're going to get all kinds of reports and then
> turn it off again. Crashing the system isn't the answer here.
I've made that kvmalloc bug too (in the function
dm_integrity_free_journal_scatterlist). I'd much rather like if the kernel
crashed (because then - I would fix the bug). The kernel didn't crash and
the bug sneaked into the official linux tree, where may be causing random
crashes for other users.
Mikulas
Hello, Mikulas.
On Tue, Apr 24, 2018 at 02:41:47PM -0400, Mikulas Patocka wrote:
>
>
> On Tue, 24 Apr 2018, Matthew Wilcox wrote:
>
> > On Tue, Apr 24, 2018 at 08:29:14AM -0400, Mikulas Patocka wrote:
> > >
> > >
> > > On Mon, 23 Apr 2018, Matthew Wilcox wrote:
> > >
> > > > On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> > > > > Some bugs (such as buffer overflows) are better detected
> > > > > with kmalloc code, so we must test the kmalloc path too.
> > > >
> > > > Well now, this brings up another item for the collective TODO list --
> > > > implement redzone checks for vmalloc. Unless this is something already
> > > > taken care of by kasan or similar.
> > >
> > > The kmalloc overflow testing is also not ideal - it rounds the size up to
> > > the next slab size and detects buffer overflows only at this boundary.
> > >
> > > Some times ago, I made a "kmalloc guard" patch that places a magic number
> > > immediatelly after the requested size - so that it can detect overflows at
> > > byte boundary
> > > ( https://www.redhat.com/archives/dm-devel/2014-September/msg00018.html )
> > >
> > > That patch found a bug in crypto code:
> > > ( http://lkml.iu.edu/hypermail/linux/kernel/1409.1/02325.html )
> >
> > Is it still worth doing this, now we have kasan?
>
> The kmalloc guard has much lower overhead than kasan.
I skimm at your code and it requires rebuilding the kernel.
I think that if rebuilding is required as the same with the KASAN,
using the KASAN is better since it has far better coverage for
detection the bug.
However, I think that if the redzone can be setup tightly
without rebuild, it would be worth implementing. I have an idea to
implement it only for the SLUB. Could I try it? (I'm asking this
because I'm inspired from the above patch.) :)
Or do you wanna try it?
Thanks.