LinuxLists.cc - [PATCH] fix crash when using XFS on loopback

2014-01-04 17:46:06

Subject: [PATCH] fix crash when using XFS on loopback

The patch 8456a648cf44f14365f1f44de90a3da2526a4776 causes crash in the
LVM2 testsuite on PA-RISC (the crashing test is fsadm.sh). The testsuite
doesn't crash on 3.12, crashes on 3.13-rc1 and later.

Bad Address (null pointer deref?): Code=15 regs=000000413edd89a0 (Addr=000006202224647d)
CPU: 3 PID: 24008 Comm: loop0 Not tainted 3.13.0-rc6 #5
task: 00000001bf3c0048 ti: 000000413edd8000 task.ti: 000000413edd8000

YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00001000000001101111100100001110 Not tainted
r00-03 000000ff0806f90e 00000000405c8de0 000000004013e6c0 000000413edd83f0
r04-07 00000000405a95e0 0000000000000200 00000001414735f0 00000001bf349e40
r08-11 0000000010fe3d10 0000000000000001 00000040829c7778 000000413efd9000
r12-15 0000000000000000 000000004060d800 0000000010fe3000 0000000010fe3000
r16-19 000000413edd82a0 00000041078ddbc0 0000000000000010 0000000000000001
r20-23 0008f3d0d83a8000 0000000000000000 00000040829c7778 0000000000000080
r24-27 00000001bf349e40 00000001bf349e40 202d66202224640d 00000000405a95e0
r28-31 202d662022246465 000000413edd88f0 000000413edd89a0 0000000000000001
sr00-03 000000000532c000 0000000000000000 0000000000000000 000000000532c000
sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000

IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000401fe42c 00000000401fe430
IIR: 539c0030 ISR: 00000000202d6000 IOR: 000006202224647d
CPU: 3 CR30: 000000413edd8000 CR31: 0000000000000000
ORIG_R28: 00000000405a95e0
IAOQ[0]: vma_interval_tree_iter_first+0x14/0x48
IAOQ[1]: vma_interval_tree_iter_first+0x18/0x48
RP(r2): flush_dcache_page+0x128/0x388
Backtrace:
[<000000004013e6c0>] flush_dcache_page+0x128/0x388
[<0000000010fe6ca0>] lo_splice_actor+0x90/0x148 [loop]
[<00000000402579b0>] splice_from_pipe_feed+0xc0/0x1d0
[<00000000402580a4>] __splice_from_pipe+0xac/0xc0
[<0000000010fe6bbc>] lo_direct_splice_actor+0x1c/0x70 [loop]
[<000000004025854c>] splice_direct_to_actor+0xec/0x228
[<0000000010fe63ac>] lo_receive+0xe4/0x298 [loop]
[<0000000010fe69d8>] loop_thread+0x478/0x640 [loop]
[<000000004018975c>] kthread+0x134/0x168
[<000000004012c020>] end_fault_vector+0x20/0x28
[<00000000115e0098>] xfs_setsize_buftarg+0x0/0x90 [xfs]

Kernel panic - not syncing: Bad Address (null pointer deref?)

The patch 8456a648cf44f14365f1f44de90a3da2526a4776 changes the page
structure so that the slab subsystem reuses the page->mapping field.

The crash happens in the following way:
* XFS allocates some memory from slab and issues a bio to read data into
it.
* the bio is sent to the loopback device.
* lo_receive creates an actor and calls splice_direct_to_actor.
* lo_splice_actor copies data to the target page.
* lo_splice_actor calls flush_dcache_page because the page may be mapped
by userspace. In that case we need to flush the kernel cache.
* flush_dcache_page asks for the list of userspace mappings, however that
page->mapping field is reused by the slab subsystem for a different
purpose. This causes the crash.

Note that other architectures without coherent caches (sparc, arm, mips)
also call page_mapping from flush_dcache_page, so they may crash in the
same way.

This patch fixes this bug by testing if the page is a slab page in
page_mapping and returning NULL if it is.

The patch also fixes VM_BUG_ON(PageSlab(page)) that could happen in
earlier kernels in the same scenario on architectures without cache
coherence when CONFIG_DEBUG_VM is enabled - so it should be backported to
stable kernels.

In the old kernels, the function page_mapping is placed in
include/linux/mm.h, so you should modify the patch accordingly when
backporting it.

Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]

---
mm/util.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

Index: linux-3.13-rc6/mm/util.c
===================================================================
--- linux-3.13-rc6.orig/mm/util.c 2014-01-04 00:06:07.000000000 +0100
+++ linux-3.13-rc6/mm/util.c 2014-01-04 00:24:42.000000000 +0100
@@ -390,7 +390,10 @@ struct address_space *page_mapping(struc
{
struct address_space *mapping = page->mapping;

- VM_BUG_ON(PageSlab(page));
+ /* This happens if someone calls flush_dcache_page on slab page */
+ if (unlikely(PageSlab(page)))
+ return NULL;
+
if (unlikely(PageSwapCache(page))) {
swp_entry_t entry;

2014-01-04 18:55:45

by John David Anglin

[permalink] [raw]

Subject: Re: [PATCH] fix crash when using XFS on loopback

On 4-Jan-14, at 12:45 PM, Mikulas Patocka wrote:

> * flush_dcache_page asks for the list of userspace mappings, however
> that
> page->mapping field is reused by the slab subsystem for a different
> purpose. This causes the crash.

I'd noticed the other day that the parisc implementation of
flush_dcache_page()
should return if "!mapping || mapping != page->mapping" is true. This
would
have avoided crash.

Dave
--
John David Anglin [email protected]

2014-01-04 19:56:30

by Mikulas Patocka

[permalink] [raw]

Subject: Re: [PATCH] fix crash when using XFS on loopback

On Sat, 4 Jan 2014, John David Anglin wrote:

> On 4-Jan-14, at 12:45 PM, Mikulas Patocka wrote:
>
> > * flush_dcache_page asks for the list of userspace mappings, however that
> > page->mapping field is reused by the slab subsystem for a different
> > purpose. This causes the crash.
>
> I'd noticed the other day that the parisc implementation of
> flush_dcache_page()
> should return if "!mapping || mapping != page->mapping" is true. This would
> have avoided crash.
>
> Dave

I think no.

page_mapping returns NULL if the page has only anonymous mapping and it is
not placed in the swap cache. In this case, you need to flush the kernel
cache.

Maybe you could skip cache flush if the page is neither anonymous nor
file-backed, but I haven't seen this condition in other architectures'
flush_dcache_page.

Mikulas

2014-01-04 20:31:33

by John David Anglin

[permalink] [raw]

Subject: Re: [PATCH] fix crash when using XFS on loopback

On 4-Jan-14, at 2:55 PM, Mikulas Patocka wrote:

> On Sat, 4 Jan 2014, John David Anglin wrote:
>
>> On 4-Jan-14, at 12:45 PM, Mikulas Patocka wrote:
>>
>>> * flush_dcache_page asks for the list of userspace mappings,
>>> however that
>>> page->mapping field is reused by the slab subsystem for a different
>>> purpose. This causes the crash.
>>
>> I'd noticed the other day that the parisc implementation of
>> flush_dcache_page()
>> should return if "!mapping || mapping != page->mapping" is true.
>> This would
>> have avoided crash.
>>
>> Dave
>
> I think no.
>
> page_mapping returns NULL if the page has only anonymous mapping and
> it is
> not placed in the swap cache. In this case, you need to flush the
> kernel
> cache.

The suggestion is to add the "mapping != page->mapping" to the current
NULL check.
It occurs after the kernel cache flush.

It doesn't seem right to flush the vma mappings associated with swap
address space
and that appears to be happening with current code.

Dave
--
John David Anglin [email protected]

2014-01-04 20:52:37

by Mikulas Patocka

[permalink] [raw]

Subject: Re: [PATCH] fix crash when using XFS on loopback

On Sat, 4 Jan 2014, John David Anglin wrote:

> On 4-Jan-14, at 2:55 PM, Mikulas Patocka wrote:
>
> > On Sat, 4 Jan 2014, John David Anglin wrote:
> >
> > > On 4-Jan-14, at 12:45 PM, Mikulas Patocka wrote:
> > >
> > > > * flush_dcache_page asks for the list of userspace mappings, however
> > > > that
> > > > page->mapping field is reused by the slab subsystem for a different
> > > > purpose. This causes the crash.
> > >
> > > I'd noticed the other day that the parisc implementation of
> > > flush_dcache_page()
> > > should return if "!mapping || mapping != page->mapping" is true. This
> > > would
> > > have avoided crash.
> > >
> > > Dave
> >
> > I think no.
> >
> > page_mapping returns NULL if the page has only anonymous mapping and it is
> > not placed in the swap cache. In this case, you need to flush the kernel
> > cache.
>
>
> The suggestion is to add the "mapping != page->mapping" to the current NULL
> check.
> It occurs after the kernel cache flush.

"if (!mapping || mapping != page->mapping) return;"
returns if the mapping is NULL (and that is wrong because the variable
mapping is NULL for anonymous pages).

You could probably return "if (!mapping && !PageAnon(page))", but the
other architectures aren't doing it.

> It doesn't seem right to flush the vma mappings associated with swap address
> space
> and that appears to be happening with current code.
>
> Dave
> --
> John David Anglin [email protected]

I suppose that "vma_interval_tree_foreach" is empty operation for swap
address space. Or isn't it?

Mikulas

2014-01-06 07:35:44

by Joonsoo Kim

[permalink] [raw]

Subject: Re: [PATCH] fix crash when using XFS on loopback

On Sat, Jan 04, 2014 at 12:45:45PM -0500, Mikulas Patocka wrote:
> The patch 8456a648cf44f14365f1f44de90a3da2526a4776 causes crash in the
> LVM2 testsuite on PA-RISC (the crashing test is fsadm.sh). The testsuite
> doesn't crash on 3.12, crashes on 3.13-rc1 and later.
>
> Bad Address (null pointer deref?): Code=15 regs=000000413edd89a0 (Addr=000006202224647d)
> CPU: 3 PID: 24008 Comm: loop0 Not tainted 3.13.0-rc6 #5
> task: 00000001bf3c0048 ti: 000000413edd8000 task.ti: 000000413edd8000
>
> YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
> PSW: 00001000000001101111100100001110 Not tainted
> r00-03 000000ff0806f90e 00000000405c8de0 000000004013e6c0 000000413edd83f0
> r04-07 00000000405a95e0 0000000000000200 00000001414735f0 00000001bf349e40
> r08-11 0000000010fe3d10 0000000000000001 00000040829c7778 000000413efd9000
> r12-15 0000000000000000 000000004060d800 0000000010fe3000 0000000010fe3000
> r16-19 000000413edd82a0 00000041078ddbc0 0000000000000010 0000000000000001
> r20-23 0008f3d0d83a8000 0000000000000000 00000040829c7778 0000000000000080
> r24-27 00000001bf349e40 00000001bf349e40 202d66202224640d 00000000405a95e0
> r28-31 202d662022246465 000000413edd88f0 000000413edd89a0 0000000000000001
> sr00-03 000000000532c000 0000000000000000 0000000000000000 000000000532c000
> sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>
> IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000401fe42c 00000000401fe430
> IIR: 539c0030 ISR: 00000000202d6000 IOR: 000006202224647d
> CPU: 3 CR30: 000000413edd8000 CR31: 0000000000000000
> ORIG_R28: 00000000405a95e0
> IAOQ[0]: vma_interval_tree_iter_first+0x14/0x48
> IAOQ[1]: vma_interval_tree_iter_first+0x18/0x48
> RP(r2): flush_dcache_page+0x128/0x388
> Backtrace:
> [<000000004013e6c0>] flush_dcache_page+0x128/0x388
> [<0000000010fe6ca0>] lo_splice_actor+0x90/0x148 [loop]
> [<00000000402579b0>] splice_from_pipe_feed+0xc0/0x1d0
> [<00000000402580a4>] __splice_from_pipe+0xac/0xc0
> [<0000000010fe6bbc>] lo_direct_splice_actor+0x1c/0x70 [loop]
> [<000000004025854c>] splice_direct_to_actor+0xec/0x228
> [<0000000010fe63ac>] lo_receive+0xe4/0x298 [loop]
> [<0000000010fe69d8>] loop_thread+0x478/0x640 [loop]
> [<000000004018975c>] kthread+0x134/0x168
> [<000000004012c020>] end_fault_vector+0x20/0x28
> [<00000000115e0098>] xfs_setsize_buftarg+0x0/0x90 [xfs]
>
> Kernel panic - not syncing: Bad Address (null pointer deref?)
>
> The patch 8456a648cf44f14365f1f44de90a3da2526a4776 changes the page
> structure so that the slab subsystem reuses the page->mapping field.
>
> The crash happens in the following way:
> * XFS allocates some memory from slab and issues a bio to read data into
> it.
> * the bio is sent to the loopback device.
> * lo_receive creates an actor and calls splice_direct_to_actor.
> * lo_splice_actor copies data to the target page.
> * lo_splice_actor calls flush_dcache_page because the page may be mapped
> by userspace. In that case we need to flush the kernel cache.
> * flush_dcache_page asks for the list of userspace mappings, however that
> page->mapping field is reused by the slab subsystem for a different
> purpose. This causes the crash.
>
> Note that other architectures without coherent caches (sparc, arm, mips)
> also call page_mapping from flush_dcache_page, so they may crash in the
> same way.
>
> This patch fixes this bug by testing if the page is a slab page in
> page_mapping and returning NULL if it is.
>
>
> The patch also fixes VM_BUG_ON(PageSlab(page)) that could happen in
> earlier kernels in the same scenario on architectures without cache
> coherence when CONFIG_DEBUG_VM is enabled - so it should be backported to
> stable kernels.
>
>
> In the old kernels, the function page_mapping is placed in
> include/linux/mm.h, so you should modify the patch accordingly when
> backporting it.
>
>
> Signed-off-by: Mikulas Patocka <[email protected]>
> Cc: [email protected]
>
> ---
> mm/util.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> Index: linux-3.13-rc6/mm/util.c
> ===================================================================
> --- linux-3.13-rc6.orig/mm/util.c 2014-01-04 00:06:07.000000000 +0100
> +++ linux-3.13-rc6/mm/util.c 2014-01-04 00:24:42.000000000 +0100
> @@ -390,7 +390,10 @@ struct address_space *page_mapping(struc
> {
> struct address_space *mapping = page->mapping;
>
> - VM_BUG_ON(PageSlab(page));
> + /* This happens if someone calls flush_dcache_page on slab page */
> + if (unlikely(PageSlab(page)))
> + return NULL;
> +
> if (unlikely(PageSwapCache(page))) {
> swp_entry_t entry;
>
> --

Hello,

I'm surprised that this VM_BUG_ON() has not been triggered until now. It was
introduced in 2007 by commit (b5fab14). Maybe there is no person who test
with CONFIG_DEBUG_VM.

There is one more bug report same as this.
* possible regression on 3.13 when calling flush_dcache_page
(lkml.org/lkml/2013/12/12/255)

As mentioned in the description of commit (b5fab14), slab object may not be
properly aligned and use of page oriented function to this object can be
dangerous. I searched the XFS code and found that they only try to allocate
multiple of 512 bytes, so there is no problem for now. But, IMHO, it is better
not to use slab objects for this purpose.

And I rapidly searched every callsites of page_mapping() and, IMHO, this patch
would work correctly. But possibly reverting original commit is better solution.

Hello, Pekka and Christoph.
Could you teach me which direction we have to go?

Thanks.

2014-01-06 17:54:42

by Mikulas Patocka

[permalink] [raw]

Subject: Re: [PATCH] fix crash when using XFS on loopback

Hi

On Mon, 6 Jan 2014, Joonsoo Kim wrote:

> Hello,
>
> I'm surprised that this VM_BUG_ON() has not been triggered until now. It was
> introduced in 2007 by commit (b5fab14). Maybe there is no person who test
> with CONFIG_DEBUG_VM.

Last time I tried it, PS-RISC didn't work with CONFIG_DEBUG_VM at all.

> There is one more bug report same as this.
> * possible regression on 3.13 when calling flush_dcache_page
> (lkml.org/lkml/2013/12/12/255)

That link doesn't show anything.

> As mentioned in the description of commit (b5fab14), slab object may not be
> properly aligned and use of page oriented function to this object can be
> dangerous. I searched the XFS code and found that they only try to allocate
> multiple of 512 bytes, so there is no problem for now. But, IMHO, it is better
> not to use slab objects for this purpose.

If slab debugging is enabled, kmalloc memory is not aligned.

In XFS in xfs_buf_allocate_memory they test if the kmalloc memory crosses
page boundary - if it does, they free the kmalloc memory and allocate a
full page. Maybe this approach could still run into problems with some
bus-master adapters that assume alignment in hardware...

dm-bufio also does I/O to slab-allocated buffers, but it allocates the
object from slab (not kmalloc) with proper alignment.

> And I rapidly searched every callsites of page_mapping() and, IMHO, this
> patch would work correctly. But possibly reverting original commit is
> better solution.

Reverting the original commit wouldn't fix that VM_BUG_ON.

> Hello, Pekka and Christoph.
> Could you teach me which direction we have to go?
>
> Thanks.

Mikulas

2014-01-07 01:41:50

by Joonsoo Kim

[permalink] [raw]

Subject: Re: [PATCH] fix crash when using XFS on loopback

On Mon, Jan 06, 2014 at 12:54:22PM -0500, Mikulas Patocka wrote:
> Hi
>
> On Mon, 6 Jan 2014, Joonsoo Kim wrote:
>
> > Hello,
> >
> > I'm surprised that this VM_BUG_ON() has not been triggered until now. It was
> > introduced in 2007 by commit (b5fab14). Maybe there is no person who test
> > with CONFIG_DEBUG_VM.
>
> Last time I tried it, PS-RISC didn't work with CONFIG_DEBUG_VM at all.
>
> > There is one more bug report same as this.
> > * possible regression on 3.13 when calling flush_dcache_page
> > (lkml.org/lkml/2013/12/12/255)
>
> That link doesn't show anything.
>
> > As mentioned in the description of commit (b5fab14), slab object may not be
> > properly aligned and use of page oriented function to this object can be
> > dangerous. I searched the XFS code and found that they only try to allocate
> > multiple of 512 bytes, so there is no problem for now. But, IMHO, it is better
> > not to use slab objects for this purpose.
>
> If slab debugging is enabled, kmalloc memory is not aligned.
>
> In XFS in xfs_buf_allocate_memory they test if the kmalloc memory crosses
> page boundary - if it does, they free the kmalloc memory and allocate a
> full page. Maybe this approach could still run into problems with some
> bus-master adapters that assume alignment in hardware...
>
>
> dm-bufio also does I/O to slab-allocated buffers, but it allocates the
> object from slab (not kmalloc) with proper alignment.

Hello,

Okay. I see.
Thanks for good explanation.

>
> > And I rapidly searched every callsites of page_mapping() and, IMHO, this
> > patch would work correctly. But possibly reverting original commit is
> > better solution.
>
> Reverting the original commit wouldn't fix that VM_BUG_ON.

Initially, I thought that VM_BUG_ON() isn't wrong and it was better to remove
the callsites where do I/O with slab-allocated buffers, because doing I/O
with slab-allocated buffers needs a great care. So I didn't fully agreed with
your patch and recommended to revert original commit yesterday. After reverting
that, I would attempt to remove the callsites.

But, now, I change my thought, because of your explanation. There are already
some users to do I/O with slab-allocated buffers and they already did it with
some cares, so I guess that admitting this usage is more beneficial than
forbidding it.

Reviewed-by: Joonsoo Kim <[email protected]>

Thanks.

2014-01-08 21:05:30

by Helge Deller

[permalink] [raw]

Subject: Re: [PATCH] fix crash when using XFS on loopback

On 01/07/2014 02:41 AM, Joonsoo Kim wrote:
> On Mon, Jan 06, 2014 at 12:54:22PM -0500, Mikulas Patocka wrote:
>> Hi
>>
>> On Mon, 6 Jan 2014, Joonsoo Kim wrote:
>>
>>> Hello,
>>>
>>> I'm surprised that this VM_BUG_ON() has not been triggered until now. It was
>>> introduced in 2007 by commit (b5fab14). Maybe there is no person who test
>>> with CONFIG_DEBUG_VM.
>> Last time I tried it, PS-RISC didn't work with CONFIG_DEBUG_VM at all.
>>
>>> There is one more bug report same as this.
>>> * possible regression on 3.13 when calling flush_dcache_page
>>> (lkml.org/lkml/2013/12/12/255)
>> That link doesn't show anything.
>>
>>> As mentioned in the description of commit (b5fab14), slab object may not be
>>> properly aligned and use of page oriented function to this object can be
>>> dangerous. I searched the XFS code and found that they only try to allocate
>>> multiple of 512 bytes, so there is no problem for now. But, IMHO, it is better
>>> not to use slab objects for this purpose.
>> If slab debugging is enabled, kmalloc memory is not aligned.
>>
>> In XFS in xfs_buf_allocate_memory they test if the kmalloc memory crosses
>> page boundary - if it does, they free the kmalloc memory and allocate a
>> full page. Maybe this approach could still run into problems with some
>> bus-master adapters that assume alignment in hardware...
>>
>>
>> dm-bufio also does I/O to slab-allocated buffers, but it allocates the
>> object from slab (not kmalloc) with proper alignment.
> Hello,
>
> Okay. I see.
> Thanks for good explanation.
>
>>> And I rapidly searched every callsites of page_mapping() and, IMHO, this
>>> patch would work correctly. But possibly reverting original commit is
>>> better solution.
>> Reverting the original commit wouldn't fix that VM_BUG_ON.
> Initially, I thought that VM_BUG_ON() isn't wrong and it was better to remove
> the callsites where do I/O with slab-allocated buffers, because doing I/O
> with slab-allocated buffers needs a great care. So I didn't fully agreed with
> your patch and recommended to revert original commit yesterday. After reverting
> that, I would attempt to remove the callsites.
>
> But, now, I change my thought, because of your explanation. There are already
> some users to do I/O with slab-allocated buffers and they already did it with
> some cares, so I guess that admitting this usage is more beneficial than
> forbidding it.
>
> Reviewed-by: Joonsoo Kim <[email protected]>

I can queue up this patch in my next pull-request for the parisc-tree
which I plan to
send tomorrow, unless people want this patch to go via mm-tree or
similiar...
Please let me know.

Helge

2014-01-08 21:37:54

by Pekka Enberg

[permalink] [raw]

Subject: Re: [PATCH] fix crash when using XFS on loopback

On Wed, Jan 8, 2014 at 11:05 PM, Helge Deller <[email protected]> wrote:
> On 01/07/2014 02:41 AM, Joonsoo Kim wrote:
>>
>> On Mon, Jan 06, 2014 at 12:54:22PM -0500, Mikulas Patocka wrote:
>>>
>>> Hi
>>>
>>> On Mon, 6 Jan 2014, Joonsoo Kim wrote:
>>>
>>>> Hello,
>>>>
>>>> I'm surprised that this VM_BUG_ON() has not been triggered until now. It
>>>> was
>>>> introduced in 2007 by commit (b5fab14). Maybe there is no person who
>>>> test
>>>> with CONFIG_DEBUG_VM.
>>>
>>> Last time I tried it, PS-RISC didn't work with CONFIG_DEBUG_VM at all.
>>>
>>>> There is one more bug report same as this.
>>>> * possible regression on 3.13 when calling flush_dcache_page
>>>> (lkml.org/lkml/2013/12/12/255)
>>>
>>> That link doesn't show anything.
>>>
>>>> As mentioned in the description of commit (b5fab14), slab object may not
>>>> be
>>>> properly aligned and use of page oriented function to this object can be
>>>> dangerous. I searched the XFS code and found that they only try to
>>>> allocate
>>>> multiple of 512 bytes, so there is no problem for now. But, IMHO, it is
>>>> better
>>>> not to use slab objects for this purpose.
>>>
>>> If slab debugging is enabled, kmalloc memory is not aligned.
>>>
>>> In XFS in xfs_buf_allocate_memory they test if the kmalloc memory crosses
>>> page boundary - if it does, they free the kmalloc memory and allocate a
>>> full page. Maybe this approach could still run into problems with some
>>> bus-master adapters that assume alignment in hardware...
>>>
>>>
>>> dm-bufio also does I/O to slab-allocated buffers, but it allocates the
>>> object from slab (not kmalloc) with proper alignment.
>>
>> Hello,
>>
>> Okay. I see.
>> Thanks for good explanation.
>>
>>>> And I rapidly searched every callsites of page_mapping() and, IMHO, this
>>>> patch would work correctly. But possibly reverting original commit is
>>>> better solution.
>>>
>>> Reverting the original commit wouldn't fix that VM_BUG_ON.
>>
>> Initially, I thought that VM_BUG_ON() isn't wrong and it was better to
>> remove
>> the callsites where do I/O with slab-allocated buffers, because doing I/O
>> with slab-allocated buffers needs a great care. So I didn't fully agreed
>> with
>> your patch and recommended to revert original commit yesterday. After
>> reverting
>> that, I would attempt to remove the callsites.
>>
>> But, now, I change my thought, because of your explanation. There are
>> already
>> some users to do I/O with slab-allocated buffers and they already did it
>> with
>> some cares, so I guess that admitting this usage is more beneficial than
>> forbidding it.
>>
>> Reviewed-by: Joonsoo Kim <[email protected]>
>
>
> I can queue up this patch in my next pull-request for the parisc-tree which
> I plan to
> send tomorrow, unless people want this patch to go via mm-tree or
> similiar...
> Please let me know.

The patch looks good to me but it probably should go through Andrew's tree.

Acked-by: Pekka Enberg <[email protected]>

2014-01-08 21:42:11

by Helge Deller

[permalink] [raw]

Subject: Re: [PATCH] fix crash when using XFS on loopback

On 01/08/2014 10:37 PM, Pekka Enberg wrote:
> On Wed, Jan 8, 2014 at 11:05 PM, Helge Deller <[email protected]> wrote:
>> On 01/07/2014 02:41 AM, Joonsoo Kim wrote:
>>> On Mon, Jan 06, 2014 at 12:54:22PM -0500, Mikulas Patocka wrote:
>>>> Hi
>>>>
>>>> On Mon, 6 Jan 2014, Joonsoo Kim wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I'm surprised that this VM_BUG_ON() has not been triggered until now. It
>>>>> was
>>>>> introduced in 2007 by commit (b5fab14). Maybe there is no person who
>>>>> test
>>>>> with CONFIG_DEBUG_VM.
>>>> Last time I tried it, PS-RISC didn't work with CONFIG_DEBUG_VM at all.
>>>>
>>>>> There is one more bug report same as this.
>>>>> * possible regression on 3.13 when calling flush_dcache_page
>>>>> (lkml.org/lkml/2013/12/12/255)
>>>> That link doesn't show anything.
>>>>
>>>>> As mentioned in the description of commit (b5fab14), slab object may not
>>>>> be
>>>>> properly aligned and use of page oriented function to this object can be
>>>>> dangerous. I searched the XFS code and found that they only try to
>>>>> allocate
>>>>> multiple of 512 bytes, so there is no problem for now. But, IMHO, it is
>>>>> better
>>>>> not to use slab objects for this purpose.
>>>> If slab debugging is enabled, kmalloc memory is not aligned.
>>>>
>>>> In XFS in xfs_buf_allocate_memory they test if the kmalloc memory crosses
>>>> page boundary - if it does, they free the kmalloc memory and allocate a
>>>> full page. Maybe this approach could still run into problems with some
>>>> bus-master adapters that assume alignment in hardware...
>>>>
>>>>
>>>> dm-bufio also does I/O to slab-allocated buffers, but it allocates the
>>>> object from slab (not kmalloc) with proper alignment.
>>> Hello,
>>>
>>> Okay. I see.
>>> Thanks for good explanation.
>>>
>>>>> And I rapidly searched every callsites of page_mapping() and, IMHO, this
>>>>> patch would work correctly. But possibly reverting original commit is
>>>>> better solution.
>>>> Reverting the original commit wouldn't fix that VM_BUG_ON.
>>> Initially, I thought that VM_BUG_ON() isn't wrong and it was better to
>>> remove
>>> the callsites where do I/O with slab-allocated buffers, because doing I/O
>>> with slab-allocated buffers needs a great care. So I didn't fully agreed
>>> with
>>> your patch and recommended to revert original commit yesterday. After
>>> reverting
>>> that, I would attempt to remove the callsites.
>>>
>>> But, now, I change my thought, because of your explanation. There are
>>> already
>>> some users to do I/O with slab-allocated buffers and they already did it
>>> with
>>> some cares, so I guess that admitting this usage is more beneficial than
>>> forbidding it.
>>>
>>> Reviewed-by: Joonsoo Kim <[email protected]>
>>
>> I can queue up this patch in my next pull-request for the parisc-tree which
>> I plan to
>> send tomorrow, unless people want this patch to go via mm-tree or
>> similiar...
>> Please let me know.
> The patch looks good to me but it probably should go through Andrew's tree.
>
> Acked-by: Pekka Enberg <[email protected]>

Absolutely fine with me. Andrew, can you please pick it up for 3.13 ?
Thanks,
Helge

2014-01-08 21:59:35

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] fix crash when using XFS on loopback

On Wed, 8 Jan 2014 23:37:49 +0200 Pekka Enberg <[email protected]> wrote:

> The patch looks good to me but it probably should go through Andrew's tree.

yup.

page_mapping() will be called quite frequently, and adding a new
test-n-branch in there will be somewhat costly. We might end up with a
better kernel if we were to instead revert 8456a648cf44f. How useful
was that patch?

2014-01-09 00:13:18

by Joonsoo Kim

[permalink] [raw]

Subject: Re: [PATCH] fix crash when using XFS on loopback

On Wed, Jan 08, 2014 at 01:59:30PM -0800, Andrew Morton wrote:
> On Wed, 8 Jan 2014 23:37:49 +0200 Pekka Enberg <[email protected]> wrote:
>
> > The patch looks good to me but it probably should go through Andrew's tree.
>
> yup.
>
> page_mapping() will be called quite frequently, and adding a new
> test-n-branch in there will be somewhat costly. We might end up with a
> better kernel if we were to instead revert 8456a648cf44f. How useful
> was that patch?

Hello,

Performance effect of this patch was decribed in the cover-letter, but
I missed to attach it to patch description. Sorry about that.

In summary, this patch saves some memory and decreases cache-footprint
so that it increases performance.

Here goes the description in cover-letter.

Below is some numbers of 'cat /proc/slabinfo'.

* Before *
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables [snip...]
kmalloc-512 527 600 512 8 1 : tunables 54 27 0 : slabdata 75 75 0
kmalloc-256 210 210 256 15 1 : tunables 120 60 0 : slabdata 14 14 0
kmalloc-192 1040 1040 192 20 1 : tunables 120 60 0 : slabdata 52 52 0
kmalloc-96 750 750 128 30 1 : tunables 120 60 0 : slabdata 25 25 0
kmalloc-64 2773 2773 64 59 1 : tunables 120 60 0 : slabdata 47 47 0
kmalloc-128 660 690 128 30 1 : tunables 120 60 0 : slabdata 23 23 0
kmalloc-32 11200 11200 32 112 1 : tunables 120 60 0 : slabdata 100 100 0
kmem_cache 197 200 192 20 1 : tunables 120 60 0 : slabdata 10 10 0

* After *
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables [snip...]
kmalloc-512 525 640 512 8 1 : tunables 54 27 0 : slabdata 80 80 0
kmalloc-256 210 210 256 15 1 : tunables 120 60 0 : slabdata 14 14 0
kmalloc-192 1016 1040 192 20 1 : tunables 120 60 0 : slabdata 52 52 0
kmalloc-96 560 620 128 31 1 : tunables 120 60 0 : slabdata 20 20 0
kmalloc-64 2148 2280 64 60 1 : tunables 120 60 0 : slabdata 38 38 0
kmalloc-128 647 682 128 31 1 : tunables 120 60 0 : slabdata 22 22 0
kmalloc-32 11360 11413 32 113 1 : tunables 120 60 0 : slabdata 101 101 0
kmem_cache 197 200 192 20 1 : tunables 120 60 0 : slabdata 10 10 0

kmem_caches consisting of objects less than or equal to 128 byte have one more
objects in a slab. You can see it at objperslab.

Here are the performance results on my 4 cpus machine.

* Before *

Performance counter stats for 'perf bench sched messaging -g 50 -l 1000' (10 runs):

238,309,671 cache-misses ( +- 0.40% )

12.010172090 seconds time elapsed ( +- 0.21% )

* After *

Performance counter stats for 'perf bench sched messaging -g 50 -l 1000' (10 runs):

229,945,138 cache-misses ( +- 0.23% )

11.627897174 seconds time elapsed ( +- 0.14% )

cache-misses are reduced by this patchset, roughly 5%.
And elapsed times are also improved by 3.1% to baseline.

Thanks.

2014-01-09 00:19:41

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] fix crash when using XFS on loopback

On Thu, 9 Jan 2014 09:13:31 +0900 Joonsoo Kim <[email protected]> wrote:

> On Wed, Jan 08, 2014 at 01:59:30PM -0800, Andrew Morton wrote:
> > On Wed, 8 Jan 2014 23:37:49 +0200 Pekka Enberg <[email protected]> wrote:
> >
> > > The patch looks good to me but it probably should go through Andrew's tree.
> >
> > yup.
> >
> > page_mapping() will be called quite frequently, and adding a new
> > test-n-branch in there will be somewhat costly. We might end up with a
> > better kernel if we were to instead revert 8456a648cf44f. How useful
> > was that patch?
>
> Hello,
>
> Performance effect of this patch was decribed in the cover-letter, but
> I missed to attach it to patch description. Sorry about that.
>
> In summary, this patch saves some memory and decreases cache-footprint
> so that it increases performance.
>
> Here goes the description in cover-letter.
>
> ...
>
> cache-misses are reduced by this patchset, roughly 5%.
> And elapsed times are also improved by 3.1% to baseline.

ah, OK, thanks, useful. A few instructions added to page_mapping()
won't have effects like that!

2014-01-09 08:35:19

by Pekka Enberg

[permalink] [raw]

Subject: Re: [PATCH] fix crash when using XFS on loopback

On Thu, Jan 9, 2014 at 2:19 AM, Andrew Morton <[email protected]> wrote:
>> cache-misses are reduced by this patchset, roughly 5%.
>> And elapsed times are also improved by 3.1% to baseline.
>
> ah, OK, thanks, useful. A few instructions added to page_mapping()
> won't have effects like that!

Yup, I merged the series because the numbers were so impressive.

There's a link to the cover letter in merge commit 24f971a but it
would have been better to include them in the changelog itself.

Pekka

2014-01-09 08:50:04

by Simon Baatz

[permalink] [raw]

Subject: Re: [PATCH] fix crash when using XFS on loopback

Hi Mikulas,

On Sat, Jan 04, 2014 at 12:45:45PM -0500, Mikulas Patocka wrote:
> The patch 8456a648cf44f14365f1f44de90a3da2526a4776 causes crash in the
> LVM2 testsuite on PA-RISC (the crashing test is fsadm.sh). The testsuite
> doesn't crash on 3.12, crashes on 3.13-rc1 and later.
>
> Bad Address (null pointer deref?): Code=15 regs=000000413edd89a0 (Addr=000006202224647d)
> CPU: 3 PID: 24008 Comm: loop0 Not tainted 3.13.0-rc6 #5
> task: 00000001bf3c0048 ti: 000000413edd8000 task.ti: 000000413edd8000
>
> YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
> PSW: 00001000000001101111100100001110 Not tainted
> r00-03 000000ff0806f90e 00000000405c8de0 000000004013e6c0 000000413edd83f0
> r04-07 00000000405a95e0 0000000000000200 00000001414735f0 00000001bf349e40
> r08-11 0000000010fe3d10 0000000000000001 00000040829c7778 000000413efd9000
> r12-15 0000000000000000 000000004060d800 0000000010fe3000 0000000010fe3000
> r16-19 000000413edd82a0 00000041078ddbc0 0000000000000010 0000000000000001
> r20-23 0008f3d0d83a8000 0000000000000000 00000040829c7778 0000000000000080
> r24-27 00000001bf349e40 00000001bf349e40 202d66202224640d 00000000405a95e0
> r28-31 202d662022246465 000000413edd88f0 000000413edd89a0 0000000000000001
> sr00-03 000000000532c000 0000000000000000 0000000000000000 000000000532c000
> sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>
> IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000401fe42c 00000000401fe430
> IIR: 539c0030 ISR: 00000000202d6000 IOR: 000006202224647d
> CPU: 3 CR30: 000000413edd8000 CR31: 0000000000000000
> ORIG_R28: 00000000405a95e0
> IAOQ[0]: vma_interval_tree_iter_first+0x14/0x48
> IAOQ[1]: vma_interval_tree_iter_first+0x18/0x48
> RP(r2): flush_dcache_page+0x128/0x388
> Backtrace:
> [<000000004013e6c0>] flush_dcache_page+0x128/0x388
> [<0000000010fe6ca0>] lo_splice_actor+0x90/0x148 [loop]
> [<00000000402579b0>] splice_from_pipe_feed+0xc0/0x1d0
> [<00000000402580a4>] __splice_from_pipe+0xac/0xc0
> [<0000000010fe6bbc>] lo_direct_splice_actor+0x1c/0x70 [loop]
> [<000000004025854c>] splice_direct_to_actor+0xec/0x228
> [<0000000010fe63ac>] lo_receive+0xe4/0x298 [loop]
> [<0000000010fe69d8>] loop_thread+0x478/0x640 [loop]
> [<000000004018975c>] kthread+0x134/0x168
> [<000000004012c020>] end_fault_vector+0x20/0x28
> [<00000000115e0098>] xfs_setsize_buftarg+0x0/0x90 [xfs]
>
> Kernel panic - not syncing: Bad Address (null pointer deref?)
>
> The patch 8456a648cf44f14365f1f44de90a3da2526a4776 changes the page
> structure so that the slab subsystem reuses the page->mapping field.
>
> The crash happens in the following way:
> * XFS allocates some memory from slab and issues a bio to read data into
> it.
> * the bio is sent to the loopback device.
> * lo_receive creates an actor and calls splice_direct_to_actor.
> * lo_splice_actor copies data to the target page.
> * lo_splice_actor calls flush_dcache_page because the page may be mapped
> by userspace. In that case we need to flush the kernel cache.
> * flush_dcache_page asks for the list of userspace mappings, however that
> page->mapping field is reused by the slab subsystem for a different
> purpose. This causes the crash.
>
> Note that other architectures without coherent caches (sparc, arm, mips)
> also call page_mapping from flush_dcache_page, so they may crash in the
> same way.
>
> This patch fixes this bug by testing if the page is a slab page in
> page_mapping and returning NULL if it is.
>
>
> The patch also fixes VM_BUG_ON(PageSlab(page)) that could happen in
> earlier kernels in the same scenario on architectures without cache
> coherence when CONFIG_DEBUG_VM is enabled - so it should be backported to
> stable kernels.
>
>
> In the old kernels, the function page_mapping is placed in
> include/linux/mm.h, so you should modify the patch accordingly when
> backporting it.
>
>
> Signed-off-by: Mikulas Patocka <[email protected]>
> Cc: [email protected]
>
> ---
> mm/util.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> Index: linux-3.13-rc6/mm/util.c
> ===================================================================
> --- linux-3.13-rc6.orig/mm/util.c 2014-01-04 00:06:07.000000000 +0100
> +++ linux-3.13-rc6/mm/util.c 2014-01-04 00:24:42.000000000 +0100
> @@ -390,7 +390,10 @@ struct address_space *page_mapping(struc
> {
> struct address_space *mapping = page->mapping;
>
> - VM_BUG_ON(PageSlab(page));
> + /* This happens if someone calls flush_dcache_page on slab page */
> + if (unlikely(PageSlab(page)))
> + return NULL;
> +
> if (unlikely(PageSwapCache(page))) {
> swp_entry_t entry;

I don't think that this is the correct fix. According to cachetlb.txt
flush_(kernel_)dcache_page() is not supposed to be called with a slab
page in the first place. There is code in the kernel to avoid that
(see for example the discussion in [1] and [2]).

Also on ARM, page_mapping() == NULL results in
flush_(kernel_)dcache_page() assuming that the page is an anon page.
Consequently, it would flush the slab page, which make no sense.

Thus, I think we either need to add the check to the original caller
of flush_dcache_page() or we allow flush_(kernel_)dcache_page() to be
called with slab pages and put the check there (this has been
proposed by Russell King once [3], but would affect multiple
architectures)

- Simon

[1] https://lkml.org/lkml/2013/10/24/414
[2] https://lkml.org/lkml/2013/10/28/432
[3] https://lkml.org/lkml/2013/10/27/89