2015-11-18 15:53:06

by Jeff Moyer

[permalink] [raw]
Subject: dax pmd fault handler never returns to userspace

Hi,

When running the nvml library's test suite against an ext4 file system
mounted with -o dax, I ran into an issue where many of the tests would
simply timeout. The problem appears to be that the pmd fault handler
never returns to userspace (the application is doing a memcpy of 512
bytes into pmem). Here's the 'perf report -g' output:

- 88.30% 0.01% blk_non_zero.st libc-2.17.so [.] __memmove_ssse3_back
- 88.30% __memmove_ssse3_back
- 66.63% page_fault
- 66.47% do_page_fault
- 66.16% __do_page_fault
- 63.38% handle_mm_fault
- 61.15% ext4_dax_pmd_fault
- 45.04% __dax_pmd_fault
- 37.05% vmf_insert_pfn_pmd
- track_pfn_insert
- 35.58% lookup_memtype
- 33.80% pat_pagerange_is_ram
- 33.40% walk_system_ram_range
- 31.63% find_next_iomem_res
21.78% strcmp

And here's 'perf top':

Samples: 2M of event 'cycles:pp', Event count (approx.): 56080150519
Overhead Shared Object Symbol
22.55% [kernel] [k] strcmp
20.33% [unknown] [k] 0x00007f9f549ef3f3
10.01% [kernel] [k] native_irq_return_iret
9.54% [kernel] [k] find_next_iomem_res
3.00% [jbd2] [k] start_this_handle

This is easily reproduced by doing the following:

git clone https://github.com/pmem/nvml.git
cd nvml
make
make test
cd src/test/blk_non_zero
./blk_non_zero.static-nondebug 512 /path/to/ext4/dax/fs/testfile1 c 1073741824 w:0

I also ran the test suite against xfs, and the problem is not present
there. However, I did not verify that the xfs tests were getting pmd
faults.

I'm happy to help diagnose the problem further, if necessary.

Cheers,
Jeff


2015-11-18 15:56:46

by Zwisler, Ross

[permalink] [raw]
Subject: Re: dax pmd fault handler never returns to userspace

On Wed, 2015-11-18 at 10:53 -0500, Jeff Moyer wrote:
> Hi,
>
> When running the nvml library's test suite against an ext4 file system
> mounted with -o dax, I ran into an issue where many of the tests would
> simply timeout. The problem appears to be that the pmd fault handler
> never returns to userspace (the application is doing a memcpy of 512
> bytes into pmem). Here's the 'perf report -g' output:
>
> - 88.30% 0.01% blk_non_zero.st libc-2.17.so [.] __memmove_ssse3_back
> - 88.30% __memmove_ssse3_back
> - 66.63% page_fault
> - 66.47% do_page_fault
> - 66.16% __do_page_fault
> - 63.38% handle_mm_fault
> - 61.15% ext4_dax_pmd_fault
> - 45.04% __dax_pmd_fault
> - 37.05% vmf_insert_pfn_pmd
> - track_pfn_insert
> - 35.58% lookup_memtype
> - 33.80% pat_pagerange_is_ram
> - 33.40% walk_system_ram_range
> - 31.63% find_next_iomem_res
> 21.78% strcmp
>
> And here's 'perf top':
>
> Samples: 2M of event 'cycles:pp', Event count (approx.): 56080150519
> Overhead Shared Object Symbol
> 22.55% [kernel] [k] strcmp
> 20.33% [unknown] [k] 0x00007f9f549ef3f3
> 10.01% [kernel] [k] native_irq_return_iret
> 9.54% [kernel] [k] find_next_iomem_res
> 3.00% [jbd2] [k] start_this_handle
>
> This is easily reproduced by doing the following:
>
> git clone https://github.com/pmem/nvml.git
> cd nvml
> make
> make test
> cd src/test/blk_non_zero
> ./blk_non_zero.static-nondebug 512 /path/to/ext4/dax/fs/testfile1 c 1073741824 w:0
>
> I also ran the test suite against xfs, and the problem is not present
> there. However, I did not verify that the xfs tests were getting pmd
> faults.
>
> I'm happy to help diagnose the problem further, if necessary.

Thanks for the report, I'll take a look.

- Ross


2015-11-18 16:52:59

by Dan Williams

[permalink] [raw]
Subject: Re: dax pmd fault handler never returns to userspace

On Wed, Nov 18, 2015 at 7:53 AM, Jeff Moyer <[email protected]> wrote:
> Hi,
>
> When running the nvml library's test suite against an ext4 file system
> mounted with -o dax, I ran into an issue where many of the tests would
> simply timeout. The problem appears to be that the pmd fault handler
> never returns to userspace (the application is doing a memcpy of 512
> bytes into pmem). Here's the 'perf report -g' output:
>
> - 88.30% 0.01% blk_non_zero.st libc-2.17.so [.] __memmove_ssse3_back
> - 88.30% __memmove_ssse3_back
> - 66.63% page_fault
> - 66.47% do_page_fault
> - 66.16% __do_page_fault
> - 63.38% handle_mm_fault
> - 61.15% ext4_dax_pmd_fault
> - 45.04% __dax_pmd_fault
> - 37.05% vmf_insert_pfn_pmd
> - track_pfn_insert
> - 35.58% lookup_memtype
> - 33.80% pat_pagerange_is_ram
> - 33.40% walk_system_ram_range
> - 31.63% find_next_iomem_res
> 21.78% strcmp
>
> And here's 'perf top':
>
> Samples: 2M of event 'cycles:pp', Event count (approx.): 56080150519
> Overhead Shared Object Symbol
> 22.55% [kernel] [k] strcmp
> 20.33% [unknown] [k] 0x00007f9f549ef3f3
> 10.01% [kernel] [k] native_irq_return_iret
> 9.54% [kernel] [k] find_next_iomem_res
> 3.00% [jbd2] [k] start_this_handle
>
> This is easily reproduced by doing the following:
>
> git clone https://github.com/pmem/nvml.git
> cd nvml
> make
> make test
> cd src/test/blk_non_zero
> ./blk_non_zero.static-nondebug 512 /path/to/ext4/dax/fs/testfile1 c 1073741824 w:0
>
> I also ran the test suite against xfs, and the problem is not present
> there. However, I did not verify that the xfs tests were getting pmd
> faults.
>
> I'm happy to help diagnose the problem further, if necessary.

Sysrq-t or sysrq-w dump? Also do you have the locking fix from Yigal?

https://lists.01.org/pipermail/linux-nvdimm/2015-November/002842.html

2015-11-18 17:00:14

by Ross Zwisler

[permalink] [raw]
Subject: Re: dax pmd fault handler never returns to userspace

On Wed, Nov 18, 2015 at 08:52:59AM -0800, Dan Williams wrote:
> On Wed, Nov 18, 2015 at 7:53 AM, Jeff Moyer <[email protected]> wrote:
> > Hi,
> >
> > When running the nvml library's test suite against an ext4 file system
> > mounted with -o dax, I ran into an issue where many of the tests would
> > simply timeout. The problem appears to be that the pmd fault handler
> > never returns to userspace (the application is doing a memcpy of 512
> > bytes into pmem). Here's the 'perf report -g' output:
> >
> > - 88.30% 0.01% blk_non_zero.st libc-2.17.so [.] __memmove_ssse3_back
> > - 88.30% __memmove_ssse3_back
> > - 66.63% page_fault
> > - 66.47% do_page_fault
> > - 66.16% __do_page_fault
> > - 63.38% handle_mm_fault
> > - 61.15% ext4_dax_pmd_fault
> > - 45.04% __dax_pmd_fault
> > - 37.05% vmf_insert_pfn_pmd
> > - track_pfn_insert
> > - 35.58% lookup_memtype
> > - 33.80% pat_pagerange_is_ram
> > - 33.40% walk_system_ram_range
> > - 31.63% find_next_iomem_res
> > 21.78% strcmp
> >
> > And here's 'perf top':
> >
> > Samples: 2M of event 'cycles:pp', Event count (approx.): 56080150519
> > Overhead Shared Object Symbol
> > 22.55% [kernel] [k] strcmp
> > 20.33% [unknown] [k] 0x00007f9f549ef3f3
> > 10.01% [kernel] [k] native_irq_return_iret
> > 9.54% [kernel] [k] find_next_iomem_res
> > 3.00% [jbd2] [k] start_this_handle
> >
> > This is easily reproduced by doing the following:
> >
> > git clone https://github.com/pmem/nvml.git
> > cd nvml
> > make
> > make test
> > cd src/test/blk_non_zero
> > ./blk_non_zero.static-nondebug 512 /path/to/ext4/dax/fs/testfile1 c 1073741824 w:0
> >
> > I also ran the test suite against xfs, and the problem is not present
> > there. However, I did not verify that the xfs tests were getting pmd
> > faults.
> >
> > I'm happy to help diagnose the problem further, if necessary.
>
> Sysrq-t or sysrq-w dump? Also do you have the locking fix from Yigal?
>
> https://lists.01.org/pipermail/linux-nvdimm/2015-November/002842.html

I was able to reproduce the issue in my setup with v4.3, and the patch from
Yigal seems to solve it. Jeff, can you confirm?

2015-11-18 17:43:01

by Jeff Moyer

[permalink] [raw]
Subject: Re: dax pmd fault handler never returns to userspace

Ross Zwisler <[email protected]> writes:

> On Wed, Nov 18, 2015 at 08:52:59AM -0800, Dan Williams wrote:
>> Sysrq-t or sysrq-w dump? Also do you have the locking fix from Yigal?
>>
>> https://lists.01.org/pipermail/linux-nvdimm/2015-November/002842.html
>
> I was able to reproduce the issue in my setup with v4.3, and the patch from
> Yigal seems to solve it. Jeff, can you confirm?

I applied the patch from Yigal and the symptoms persist. Ross, what are
you testing on? I'm using an NVDIMM-N.

Dan, here's sysrq-l (which is what w used to look like, I think). Only
cpu 3 is interesting:

[ 825.339264] NMI backtrace for cpu 3
[ 825.356347] CPU: 3 PID: 13555 Comm: blk_non_zero.st Not tainted 4.4.0-rc1+ #17
[ 825.392056] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 06/09/2015
[ 825.424472] task: ffff880465bf6a40 ti: ffff88046133c000 task.ti: ffff88046133c000
[ 825.461480] RIP: 0010:[<ffffffff81329856>] [<ffffffff81329856>] strcmp+0x6/0x30
[ 825.497916] RSP: 0000:ffff88046133fbc8 EFLAGS: 00000246
[ 825.524836] RAX: 0000000000000000 RBX: ffff880c7fffd7c0 RCX: 000000076c800000
[ 825.566847] RDX: 000000076c800fff RSI: ffffffff818ea1c8 RDI: ffffffff818ea1c8
[ 825.605265] RBP: ffff88046133fbc8 R08: 0000000000000001 R09: ffff8804652300c0
[ 825.643628] R10: 00007f1b4fe0b000 R11: ffff880465230228 R12: ffffffff818ea1bd
[ 825.681381] R13: 0000000000000001 R14: ffff88046133fc20 R15: 0000000080000200
[ 825.718607] FS: 00007f1b5102d880(0000) GS:ffff88046f8c0000(0000) knlGS:00000000000000
00
[ 825.761663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 825.792213] CR2: 00007f1b4fe0b000 CR3: 000000046b225000 CR4: 00000000001406e0
[ 825.830906] Stack:
[ 825.841235] ffff88046133fc10 ffffffff81084610 000000076c800000 000000076c800fff
[ 825.879533] 000000076c800fff 00000000ffffffff ffff88046133fc90 ffffffff8106d1d0
[ 825.916774] 000000000000000c ffff88046133fc80 ffffffff81084f0d 000000076c800000
[ 825.953220] Call Trace:
[ 825.965386] [<ffffffff81084610>] find_next_iomem_res+0xd0/0x130
[ 825.996804] [<ffffffff8106d1d0>] ? pat_enabled+0x20/0x20
[ 826.024773] [<ffffffff81084f0d>] walk_system_ram_range+0x8d/0xf0
[ 826.055565] [<ffffffff8106d2d8>] pat_pagerange_is_ram+0x78/0xa0
[ 826.088971] [<ffffffff8106d475>] lookup_memtype+0x35/0xc0
[ 826.121385] [<ffffffff8106e33b>] track_pfn_insert+0x2b/0x60
[ 826.154600] [<ffffffff811e5523>] vmf_insert_pfn_pmd+0xb3/0x210
[ 826.187992] [<ffffffff8124acab>] __dax_pmd_fault+0x3cb/0x610
[ 826.221337] [<ffffffffa0769910>] ? ext4_dax_mkwrite+0x20/0x20 [ext4]
[ 826.259190] [<ffffffffa0769a4d>] ext4_dax_pmd_fault+0xcd/0x100 [ext4]
[ 826.293414] [<ffffffff811b0af7>] handle_mm_fault+0x3b7/0x510
[ 826.323763] [<ffffffff81068f98>] __do_page_fault+0x188/0x3f0
[ 826.358186] [<ffffffff81069230>] do_page_fault+0x30/0x80
[ 826.391212] [<ffffffff8169c148>] page_fault+0x28/0x30
[ 826.420752] Code: 89 e5 74 09 48 83 c2 01 80 3a 00 75 f7 48 83 c6 01 0f b6 4e ff 48 83
c2 01 84 c9 88 4a ff 75 ed 5d c3 0f 1f 00 55 48 89 e5 eb 04 <84> c0 74 18 48 83 c7 01 0f
b6 47 ff 48 83 c6 01 3a 46 ff 74 eb

The full output is large (48 cpus), so I'm going to be lazy and not
cut-n-paste it here.

Cheers,
Jeff

2015-11-18 18:10:45

by Dan Williams

[permalink] [raw]
Subject: Re: dax pmd fault handler never returns to userspace

On Wed, Nov 18, 2015 at 9:43 AM, Jeff Moyer <[email protected]> wrote:
> Ross Zwisler <[email protected]> writes:
>
>> On Wed, Nov 18, 2015 at 08:52:59AM -0800, Dan Williams wrote:
>>> Sysrq-t or sysrq-w dump? Also do you have the locking fix from Yigal?
>>>
>>> https://lists.01.org/pipermail/linux-nvdimm/2015-November/002842.html
>>
>> I was able to reproduce the issue in my setup with v4.3, and the patch from
>> Yigal seems to solve it. Jeff, can you confirm?
>
> I applied the patch from Yigal and the symptoms persist. Ross, what are
> you testing on? I'm using an NVDIMM-N.
>
> Dan, here's sysrq-l (which is what w used to look like, I think). Only
> cpu 3 is interesting:
>
> [ 825.339264] NMI backtrace for cpu 3
> [ 825.356347] CPU: 3 PID: 13555 Comm: blk_non_zero.st Not tainted 4.4.0-rc1+ #17
> [ 825.392056] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 06/09/2015
> [ 825.424472] task: ffff880465bf6a40 ti: ffff88046133c000 task.ti: ffff88046133c000
> [ 825.461480] RIP: 0010:[<ffffffff81329856>] [<ffffffff81329856>] strcmp+0x6/0x30
> [ 825.497916] RSP: 0000:ffff88046133fbc8 EFLAGS: 00000246
> [ 825.524836] RAX: 0000000000000000 RBX: ffff880c7fffd7c0 RCX: 000000076c800000
> [ 825.566847] RDX: 000000076c800fff RSI: ffffffff818ea1c8 RDI: ffffffff818ea1c8
> [ 825.605265] RBP: ffff88046133fbc8 R08: 0000000000000001 R09: ffff8804652300c0
> [ 825.643628] R10: 00007f1b4fe0b000 R11: ffff880465230228 R12: ffffffff818ea1bd
> [ 825.681381] R13: 0000000000000001 R14: ffff88046133fc20 R15: 0000000080000200
> [ 825.718607] FS: 00007f1b5102d880(0000) GS:ffff88046f8c0000(0000) knlGS:00000000000000
> 00
> [ 825.761663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 825.792213] CR2: 00007f1b4fe0b000 CR3: 000000046b225000 CR4: 00000000001406e0
> [ 825.830906] Stack:
> [ 825.841235] ffff88046133fc10 ffffffff81084610 000000076c800000 000000076c800fff
> [ 825.879533] 000000076c800fff 00000000ffffffff ffff88046133fc90 ffffffff8106d1d0
> [ 825.916774] 000000000000000c ffff88046133fc80 ffffffff81084f0d 000000076c800000
> [ 825.953220] Call Trace:
> [ 825.965386] [<ffffffff81084610>] find_next_iomem_res+0xd0/0x130
> [ 825.996804] [<ffffffff8106d1d0>] ? pat_enabled+0x20/0x20
> [ 826.024773] [<ffffffff81084f0d>] walk_system_ram_range+0x8d/0xf0
> [ 826.055565] [<ffffffff8106d2d8>] pat_pagerange_is_ram+0x78/0xa0
> [ 826.088971] [<ffffffff8106d475>] lookup_memtype+0x35/0xc0
> [ 826.121385] [<ffffffff8106e33b>] track_pfn_insert+0x2b/0x60
> [ 826.154600] [<ffffffff811e5523>] vmf_insert_pfn_pmd+0xb3/0x210
> [ 826.187992] [<ffffffff8124acab>] __dax_pmd_fault+0x3cb/0x610
> [ 826.221337] [<ffffffffa0769910>] ? ext4_dax_mkwrite+0x20/0x20 [ext4]
> [ 826.259190] [<ffffffffa0769a4d>] ext4_dax_pmd_fault+0xcd/0x100 [ext4]
> [ 826.293414] [<ffffffff811b0af7>] handle_mm_fault+0x3b7/0x510
> [ 826.323763] [<ffffffff81068f98>] __do_page_fault+0x188/0x3f0
> [ 826.358186] [<ffffffff81069230>] do_page_fault+0x30/0x80
> [ 826.391212] [<ffffffff8169c148>] page_fault+0x28/0x30
> [ 826.420752] Code: 89 e5 74 09 48 83 c2 01 80 3a 00 75 f7 48 83 c6 01 0f b6 4e ff 48 83
> c2 01 84 c9 88 4a ff 75 ed 5d c3 0f 1f 00 55 48 89 e5 eb 04 <84> c0 74 18 48 83 c7 01 0f
> b6 47 ff 48 83 c6 01 3a 46 ff 74 eb

Hmm, a loop in the resource sibling list?

What does /proc/iomem say?

Not related to this bug, but lookup_memtype() looks broken for pmd
mappings as we only check for PAGE_SIZE instead of HPAGE_SIZE. Which
will cause problems if we're straddling the end of memory.

> The full output is large (48 cpus), so I'm going to be lazy and not
> cut-n-paste it here.

Thanks for that ;-)

2015-11-18 18:23:20

by Ross Zwisler

[permalink] [raw]
Subject: Re: dax pmd fault handler never returns to userspace

On Wed, Nov 18, 2015 at 10:10:45AM -0800, Dan Williams wrote:
> On Wed, Nov 18, 2015 at 9:43 AM, Jeff Moyer <[email protected]> wrote:
> > Ross Zwisler <[email protected]> writes:
> >
> >> On Wed, Nov 18, 2015 at 08:52:59AM -0800, Dan Williams wrote:
> >>> Sysrq-t or sysrq-w dump? Also do you have the locking fix from Yigal?
> >>>
> >>> https://lists.01.org/pipermail/linux-nvdimm/2015-November/002842.html
> >>
> >> I was able to reproduce the issue in my setup with v4.3, and the patch from
> >> Yigal seems to solve it. Jeff, can you confirm?
> >
> > I applied the patch from Yigal and the symptoms persist. Ross, what are
> > you testing on? I'm using an NVDIMM-N.
> >
> > Dan, here's sysrq-l (which is what w used to look like, I think). Only
> > cpu 3 is interesting:
> >
> > [ 825.339264] NMI backtrace for cpu 3
> > [ 825.356347] CPU: 3 PID: 13555 Comm: blk_non_zero.st Not tainted 4.4.0-rc1+ #17
> > [ 825.392056] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 06/09/2015
> > [ 825.424472] task: ffff880465bf6a40 ti: ffff88046133c000 task.ti: ffff88046133c000
> > [ 825.461480] RIP: 0010:[<ffffffff81329856>] [<ffffffff81329856>] strcmp+0x6/0x30
> > [ 825.497916] RSP: 0000:ffff88046133fbc8 EFLAGS: 00000246
> > [ 825.524836] RAX: 0000000000000000 RBX: ffff880c7fffd7c0 RCX: 000000076c800000
> > [ 825.566847] RDX: 000000076c800fff RSI: ffffffff818ea1c8 RDI: ffffffff818ea1c8
> > [ 825.605265] RBP: ffff88046133fbc8 R08: 0000000000000001 R09: ffff8804652300c0
> > [ 825.643628] R10: 00007f1b4fe0b000 R11: ffff880465230228 R12: ffffffff818ea1bd
> > [ 825.681381] R13: 0000000000000001 R14: ffff88046133fc20 R15: 0000000080000200
> > [ 825.718607] FS: 00007f1b5102d880(0000) GS:ffff88046f8c0000(0000) knlGS:00000000000000
> > 00
> > [ 825.761663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 825.792213] CR2: 00007f1b4fe0b000 CR3: 000000046b225000 CR4: 00000000001406e0
> > [ 825.830906] Stack:
> > [ 825.841235] ffff88046133fc10 ffffffff81084610 000000076c800000 000000076c800fff
> > [ 825.879533] 000000076c800fff 00000000ffffffff ffff88046133fc90 ffffffff8106d1d0
> > [ 825.916774] 000000000000000c ffff88046133fc80 ffffffff81084f0d 000000076c800000
> > [ 825.953220] Call Trace:
> > [ 825.965386] [<ffffffff81084610>] find_next_iomem_res+0xd0/0x130
> > [ 825.996804] [<ffffffff8106d1d0>] ? pat_enabled+0x20/0x20
> > [ 826.024773] [<ffffffff81084f0d>] walk_system_ram_range+0x8d/0xf0
> > [ 826.055565] [<ffffffff8106d2d8>] pat_pagerange_is_ram+0x78/0xa0
> > [ 826.088971] [<ffffffff8106d475>] lookup_memtype+0x35/0xc0
> > [ 826.121385] [<ffffffff8106e33b>] track_pfn_insert+0x2b/0x60
> > [ 826.154600] [<ffffffff811e5523>] vmf_insert_pfn_pmd+0xb3/0x210
> > [ 826.187992] [<ffffffff8124acab>] __dax_pmd_fault+0x3cb/0x610
> > [ 826.221337] [<ffffffffa0769910>] ? ext4_dax_mkwrite+0x20/0x20 [ext4]
> > [ 826.259190] [<ffffffffa0769a4d>] ext4_dax_pmd_fault+0xcd/0x100 [ext4]
> > [ 826.293414] [<ffffffff811b0af7>] handle_mm_fault+0x3b7/0x510
> > [ 826.323763] [<ffffffff81068f98>] __do_page_fault+0x188/0x3f0
> > [ 826.358186] [<ffffffff81069230>] do_page_fault+0x30/0x80
> > [ 826.391212] [<ffffffff8169c148>] page_fault+0x28/0x30
> > [ 826.420752] Code: 89 e5 74 09 48 83 c2 01 80 3a 00 75 f7 48 83 c6 01 0f b6 4e ff 48 83
> > c2 01 84 c9 88 4a ff 75 ed 5d c3 0f 1f 00 55 48 89 e5 eb 04 <84> c0 74 18 48 83 c7 01 0f
> > b6 47 ff 48 83 c6 01 3a 46 ff 74 eb
>
> Hmm, a loop in the resource sibling list?
>
> What does /proc/iomem say?
>
> Not related to this bug, but lookup_memtype() looks broken for pmd
> mappings as we only check for PAGE_SIZE instead of HPAGE_SIZE. Which
> will cause problems if we're straddling the end of memory.
>
> > The full output is large (48 cpus), so I'm going to be lazy and not
> > cut-n-paste it here.
>
> Thanks for that ;-)

Yea, my first round of testing was broken, sorry about that.

It looks like this test causes the PMD fault handler to be called repeatedly
over and over until you kill the userspace process. This doesn't happen for
XFS because when using XFS this test doesn't hit PMD faults, only PTE faults.

So, looks like a livelock as far as I can tell.

Still debugging.

2015-11-18 18:30:57

by Jeff Moyer

[permalink] [raw]
Subject: Re: dax pmd fault handler never returns to userspace

Dan Williams <[email protected]> writes:

> Hmm, a loop in the resource sibling list?
>
> What does /proc/iomem say?

Inline, below.

-Jeff

# cat /proc/iomem
00000000-00000fff : reserved
00001000-00092fff : System RAM
00093000-00093fff : reserved
00094000-0009ffff : System RAM
000a0000-000bffff : PCI Bus 0000:00
000c4000-000cbfff : PCI Bus 0000:00
000f0000-000fffff : System ROM
00100000-6b4fcfff : System RAM
01000000-0169edea : Kernel code
0169edeb-01b3507f : Kernel data
01cf5000-0200ffff : Kernel bss
6b4fd000-6b97cfff : reserved
6b97d000-6b97dfff : System RAM
6b97e000-6b9fefff : reserved
6b9ff000-76d48017 : System RAM
76d48018-76d4d457 : System RAM
76d4d458-76d4e017 : System RAM
76d4e018-76d7dc57 : System RAM
76d7dc58-76d7e017 : System RAM
76d7e018-76dadc57 : System RAM
76dadc58-76dae017 : System RAM
76dae018-76dddc57 : System RAM
76dddc58-76dde017 : System RAM
76dde018-76e0dc57 : System RAM
76e0dc58-76e0e017 : System RAM
76e0e018-76e18057 : System RAM
76e18058-76e19017 : System RAM
76e19018-76e21057 : System RAM
76e21058-76e22017 : System RAM
76e22018-76e7aa57 : System RAM
76e7aa58-76e7b017 : System RAM
76e7b018-76ed3a57 : System RAM
76ed3a58-76ed4017 : System RAM
76ed4018-76ef0457 : System RAM
76ef0458-784fefff : System RAM
784ff000-791fefff : reserved
791c9004-791c902f : APEI ERST
791ca000-791d9fff : APEI ERST
791ff000-7b5fefff : ACPI Non-volatile Storage
7b5ff000-7b7fefff : ACPI Tables
7b7ff000-7b7fffff : System RAM
7b800000-7bffffff : RAM buffer
80000000-8fffffff : PCI MMCONFIG 0000 [bus 00-ff]
80000000-8fffffff : reserved
90000000-c7ffbfff : PCI Bus 0000:00
90000000-92afffff : PCI Bus 0000:01
90000000-9000ffff : 0000:01:00.2
91000000-91ffffff : 0000:01:00.1
91000000-91ffffff : mgadrmfb_vram
92000000-927fffff : 0000:01:00.1
92800000-928fffff : 0000:01:00.2
92800000-928fffff : hpilo
92900000-929fffff : 0000:01:00.2
92900000-929fffff : hpilo
92a00000-92a7ffff : 0000:01:00.2
92a00000-92a7ffff : hpilo
92a80000-92a87fff : 0000:01:00.2
92a80000-92a87fff : hpilo
92a88000-92a8bfff : 0000:01:00.1
92a88000-92a8bfff : mgadrmfb_mmio
92a8c000-92a8c0ff : 0000:01:00.2
92a8c000-92a8c0ff : hpilo
92a8d000-92a8d1ff : 0000:01:00.0
92b00000-92bfffff : PCI Bus 0000:02
92b00000-92b3ffff : 0000:02:00.0
92b40000-92b7ffff : 0000:02:00.1
92b80000-92bbffff : 0000:02:00.2
92bc0000-92bfffff : 0000:02:00.3
93000000-950fffff : PCI Bus 0000:04
93000000-937fffff : 0000:04:00.0
93000000-937fffff : bnx2x
93800000-93ffffff : 0000:04:00.0
93800000-93ffffff : bnx2x
94000000-947fffff : 0000:04:00.1
94000000-947fffff : bnx2x
94800000-94ffffff : 0000:04:00.1
94800000-94ffffff : bnx2x
95000000-9500ffff : 0000:04:00.0
95000000-9500ffff : bnx2x
95010000-9501ffff : 0000:04:00.1
95010000-9501ffff : bnx2x
95080000-950fffff : 0000:04:00.0
95100000-951fffff : PCI Bus 0000:02
95100000-9510ffff : 0000:02:00.3
95100000-9510ffff : tg3
95110000-9511ffff : 0000:02:00.3
95110000-9511ffff : tg3
95120000-9512ffff : 0000:02:00.3
95120000-9512ffff : tg3
95130000-9513ffff : 0000:02:00.2
95130000-9513ffff : tg3
95140000-9514ffff : 0000:02:00.2
95140000-9514ffff : tg3
95150000-9515ffff : 0000:02:00.2
95150000-9515ffff : tg3
95160000-9516ffff : 0000:02:00.1
95160000-9516ffff : tg3
95170000-9517ffff : 0000:02:00.1
95170000-9517ffff : tg3
95180000-9518ffff : 0000:02:00.1
95180000-9518ffff : tg3
95190000-9519ffff : 0000:02:00.0
95190000-9519ffff : tg3
951a0000-951affff : 0000:02:00.0
951a0000-951affff : tg3
951b0000-951bffff : 0000:02:00.0
951b0000-951bffff : tg3
95200000-953fffff : PCI Bus 0000:03
95200000-952fffff : 0000:03:00.0
95200000-952fffff : hpsa
95300000-953003ff : 0000:03:00.0
95300000-953003ff : hpsa
95380000-953fffff : 0000:03:00.0
95400000-954007ff : 0000:00:1f.2
95400000-954007ff : ahci
95401000-954013ff : 0000:00:1d.0
95401000-954013ff : ehci_hcd
95402000-954023ff : 0000:00:1a.0
95402000-954023ff : ehci_hcd
95404000-95404fff : 0000:00:05.4
c7ffc000-c7ffcfff : dmar1
c8000000-fbffbfff : PCI Bus 0000:80
c8000000-c8000fff : 0000:80:05.4
fbffc000-fbffcfff : dmar0
fec00000-fecfffff : PNP0003:00
fec00000-fec003ff : IOAPIC 0
fec01000-fec013ff : IOAPIC 1
fec40000-fec403ff : IOAPIC 2
fed00000-fed003ff : HPET 0
fed00000-fed003ff : PNP0103:00
fed12000-fed1200f : pnp 00:01
fed12010-fed1201f : pnp 00:01
fed1b000-fed1bfff : pnp 00:01
fed1c000-fed3ffff : pnp 00:01
fed1f410-fed1f414 : iTCO_wdt.0.auto
fed45000-fed8bfff : pnp 00:01
fee00000-feefffff : pnp 00:01
fee00000-fee00fff : Local APIC
ff000000-ffffffff : pnp 00:01
100000000-47fffffff : System RAM
480000000-87fffffff : Persistent Memory
480000000-67fffffff : NVDM002C:00
480000000-67fffffff : btt0.1
680000000-87fffffff : NVDM002C:01
680000000-87fffffff : namespace1.0
880000000-c7fffffff : System RAM
c80000000-107fffffff : Persistent Memory
c80000000-e7fffffff : NVDM002C:02
c80000000-e7fffffff : namespace2.0
e80000000-107fffffff : NVDM002C:03
e80000000-107fffffff : btt3.1
38000000000-39fffffffff : PCI Bus 0000:00
39fffd00000-39fffefffff : PCI Bus 0000:04
39fffd00000-39fffd7ffff : 0000:04:00.1
39fffd80000-39fffdfffff : 0000:04:00.0
39fffe00000-39fffe1ffff : 0000:04:00.1
39fffe20000-39fffe3ffff : 0000:04:00.0
39ffff00000-39ffff0ffff : 0000:00:14.0
39ffff00000-39ffff0ffff : xhci-hcd
39ffff10000-39ffff13fff : 0000:00:04.7
39ffff10000-39ffff13fff : ioatdma
39ffff14000-39ffff17fff : 0000:00:04.6
39ffff14000-39ffff17fff : ioatdma
39ffff18000-39ffff1bfff : 0000:00:04.5
39ffff18000-39ffff1bfff : ioatdma
39ffff1c000-39ffff1ffff : 0000:00:04.4
39ffff1c000-39ffff1ffff : ioatdma
39ffff20000-39ffff23fff : 0000:00:04.3
39ffff20000-39ffff23fff : ioatdma
39ffff24000-39ffff27fff : 0000:00:04.2
39ffff24000-39ffff27fff : ioatdma
39ffff28000-39ffff2bfff : 0000:00:04.1
39ffff28000-39ffff2bfff : ioatdma
39ffff2c000-39ffff2ffff : 0000:00:04.0
39ffff2c000-39ffff2ffff : ioatdma
39ffff31000-39ffff310ff : 0000:00:1f.3
3a000000000-3bfffffffff : PCI Bus 0000:80
3bffff00000-3bffff03fff : 0000:80:04.7
3bffff00000-3bffff03fff : ioatdma
3bffff04000-3bffff07fff : 0000:80:04.6
3bffff04000-3bffff07fff : ioatdma
3bffff08000-3bffff0bfff : 0000:80:04.5
3bffff08000-3bffff0bfff : ioatdma
3bffff0c000-3bffff0ffff : 0000:80:04.4
3bffff0c000-3bffff0ffff : ioatdma
3bffff10000-3bffff13fff : 0000:80:04.3
3bffff10000-3bffff13fff : ioatdma
3bffff14000-3bffff17fff : 0000:80:04.2
3bffff14000-3bffff17fff : ioatdma
3bffff18000-3bffff1bfff : 0000:80:04.1
3bffff18000-3bffff1bfff : ioatdma
3bffff1c000-3bffff1ffff : 0000:80:04.0
3bffff1c000-3bffff1ffff : ioatdma

2015-11-18 18:32:46

by Jeff Moyer

[permalink] [raw]
Subject: Re: dax pmd fault handler never returns to userspace

Ross Zwisler <[email protected]> writes:

> Yea, my first round of testing was broken, sorry about that.
>
> It looks like this test causes the PMD fault handler to be called repeatedly
> over and over until you kill the userspace process. This doesn't happen for
> XFS because when using XFS this test doesn't hit PMD faults, only PTE faults.

Hmm, I wonder why not? Sounds like that will need investigating as
well, right?

-Jeff

> So, looks like a livelock as far as I can tell.
>
> Still debugging.

Thanks!
Jeff

2015-11-18 18:53:26

by Ross Zwisler

[permalink] [raw]
Subject: Re: dax pmd fault handler never returns to userspace

On Wed, Nov 18, 2015 at 01:32:46PM -0500, Jeff Moyer wrote:
> Ross Zwisler <[email protected]> writes:
>
> > Yea, my first round of testing was broken, sorry about that.
> >
> > It looks like this test causes the PMD fault handler to be called repeatedly
> > over and over until you kill the userspace process. This doesn't happen for
> > XFS because when using XFS this test doesn't hit PMD faults, only PTE faults.
>
> Hmm, I wonder why not?

Well, whether or not you get PMDs is dependent on the block allocator for the
filesystem. We ask the FS how much space is contiguous via get_blocks(), and
if it's less than PMD_SIZE (2 MiB) we fall back to the regular 4k page fault
path. This code all lives in __dax_pmd_fault(). There are also a bunch of
other reasons why we'd fall back to 4k faults - the virtual address isn't 2
MiB aligned, etc. It's actually pretty hard to get everything right so you
actually get PMD faults.

Anyway, my guess is that we're failing to meet one of our criteria in XFS, so
we just always fall back to PTEs for this test.

> Sounds like that will need investigating as well, right?

Yep, on it.

2015-11-18 18:58:29

by Dan Williams

[permalink] [raw]
Subject: Re: dax pmd fault handler never returns to userspace

On Wed, Nov 18, 2015 at 10:53 AM, Ross Zwisler
<[email protected]> wrote:
> On Wed, Nov 18, 2015 at 01:32:46PM -0500, Jeff Moyer wrote:
>> Ross Zwisler <[email protected]> writes:
>>
>> > Yea, my first round of testing was broken, sorry about that.
>> >
>> > It looks like this test causes the PMD fault handler to be called repeatedly
>> > over and over until you kill the userspace process. This doesn't happen for
>> > XFS because when using XFS this test doesn't hit PMD faults, only PTE faults.
>>
>> Hmm, I wonder why not?
>
> Well, whether or not you get PMDs is dependent on the block allocator for the
> filesystem. We ask the FS how much space is contiguous via get_blocks(), and
> if it's less than PMD_SIZE (2 MiB) we fall back to the regular 4k page fault
> path. This code all lives in __dax_pmd_fault(). There are also a bunch of
> other reasons why we'd fall back to 4k faults - the virtual address isn't 2
> MiB aligned, etc. It's actually pretty hard to get everything right so you
> actually get PMD faults.
>
> Anyway, my guess is that we're failing to meet one of our criteria in XFS, so
> we just always fall back to PTEs for this test.
>
>> Sounds like that will need investigating as well, right?
>
> Yep, on it.

XFS can do pmd faults just fine, you just need to use fiemap to find a
2MiB aligned physical offset. See the ndctl pmd test I posted.

2015-11-18 21:33:09

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: dax pmd fault handler never returns to userspace

On Wed, 2015-11-18 at 11:23 -0700, Ross Zwisler wrote:
> On Wed, Nov 18, 2015 at 10:10:45AM -0800, Dan Williams wrote:
> > On Wed, Nov 18, 2015 at 9:43 AM, Jeff Moyer <[email protected]> wrote:
> > > Ross Zwisler <[email protected]> writes:
> > >
> > > > On Wed, Nov 18, 2015 at 08:52:59AM -0800, Dan Williams wrote:
> > > > > Sysrq-t or sysrq-w dump? Also do you have the locking fix from Yigal?
> > > > >
> > > > > https://lists.01.org/pipermail/linux-nvdimm/2015-November/002842.html
> > > >
> > > > I was able to reproduce the issue in my setup with v4.3, and the patch
> > > > from
> > > > Yigal seems to solve it. Jeff, can you confirm?
> > >
> > > I applied the patch from Yigal and the symptoms persist. Ross, what are
> > > you testing on? I'm using an NVDIMM-N.
> > >
> > > Dan, here's sysrq-l (which is what w used to look like, I think). Only
> > > cpu 3 is interesting:
> > >
> > > [ 825.339264] NMI backtrace for cpu 3
> > > [ 825.356347] CPU: 3 PID: 13555 Comm: blk_non_zero.st Not tainted 4.4.0
> > > -rc1+ #17
> > > [ 825.392056] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 06/09/2015
> > > [ 825.424472] task: ffff880465bf6a40 ti: ffff88046133c000 task.ti:
> > > ffff88046133c000
> > > [ 825.461480] RIP: 0010:[<ffffffff81329856>] [<ffffffff81329856>]
> > > strcmp+0x6/0x30
> > > [ 825.497916] RSP: 0000:ffff88046133fbc8 EFLAGS: 00000246
> > > [ 825.524836] RAX: 0000000000000000 RBX: ffff880c7fffd7c0 RCX:
> > > 000000076c800000
> > > [ 825.566847] RDX: 000000076c800fff RSI: ffffffff818ea1c8 RDI:
> > > ffffffff818ea1c8
> > > [ 825.605265] RBP: ffff88046133fbc8 R08: 0000000000000001 R09:
> > > ffff8804652300c0
> > > [ 825.643628] R10: 00007f1b4fe0b000 R11: ffff880465230228 R12:
> > > ffffffff818ea1bd
> > > [ 825.681381] R13: 0000000000000001 R14: ffff88046133fc20 R15:
> > > 0000000080000200
> > > [ 825.718607] FS: 00007f1b5102d880(0000) GS:ffff88046f8c0000(0000)
> > > knlGS:00000000000000
> > > 00
> > > [ 825.761663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 825.792213] CR2: 00007f1b4fe0b000 CR3: 000000046b225000 CR4:
> > > 00000000001406e0
> > > [ 825.830906] Stack:
> > > [ 825.841235] ffff88046133fc10 ffffffff81084610 000000076c800000
> > > 000000076c800fff
> > > [ 825.879533] 000000076c800fff 00000000ffffffff ffff88046133fc90
> > > ffffffff8106d1d0
> > > [ 825.916774] 000000000000000c ffff88046133fc80 ffffffff81084f0d
> > > 000000076c800000
> > > [ 825.953220] Call Trace:
> > > [ 825.965386] [<ffffffff81084610>] find_next_iomem_res+0xd0/0x130
> > > [ 825.996804] [<ffffffff8106d1d0>] ? pat_enabled+0x20/0x20
> > > [ 826.024773] [<ffffffff81084f0d>] walk_system_ram_range+0x8d/0xf0
> > > [ 826.055565] [<ffffffff8106d2d8>] pat_pagerange_is_ram+0x78/0xa0
> > > [ 826.088971] [<ffffffff8106d475>] lookup_memtype+0x35/0xc0
> > > [ 826.121385] [<ffffffff8106e33b>] track_pfn_insert+0x2b/0x60
> > > [ 826.154600] [<ffffffff811e5523>] vmf_insert_pfn_pmd+0xb3/0x210
> > > [ 826.187992] [<ffffffff8124acab>] __dax_pmd_fault+0x3cb/0x610
> > > [ 826.221337] [<ffffffffa0769910>] ? ext4_dax_mkwrite+0x20/0x20 [ext4]
> > > [ 826.259190] [<ffffffffa0769a4d>] ext4_dax_pmd_fault+0xcd/0x100 [ext4]
> > > [ 826.293414] [<ffffffff811b0af7>] handle_mm_fault+0x3b7/0x510
> > > [ 826.323763] [<ffffffff81068f98>] __do_page_fault+0x188/0x3f0
> > > [ 826.358186] [<ffffffff81069230>] do_page_fault+0x30/0x80
> > > [ 826.391212] [<ffffffff8169c148>] page_fault+0x28/0x30
> > > [ 826.420752] Code: 89 e5 74 09 48 83 c2 01 80 3a 00 75 f7 48 83 c6 01 0f
> > > b6 4e ff 48 83
> > > c2 01 84 c9 88 4a ff 75 ed 5d c3 0f 1f 00 55 48 89 e5 eb 04 <84> c0 74 18
> > > 48 83 c7 01 0f
> > > b6 47 ff 48 83 c6 01 3a 46 ff 74 eb
> >
> > Hmm, a loop in the resource sibling list?
> >
> > What does /proc/iomem say?
> >
> > Not related to this bug, but lookup_memtype() looks broken for pmd
> > mappings as we only check for PAGE_SIZE instead of HPAGE_SIZE. Which
> > will cause problems if we're straddling the end of memory.
> >
> > > The full output is large (48 cpus), so I'm going to be lazy and not
> > > cut-n-paste it here.
> >
> > Thanks for that ;-)
>
> Yea, my first round of testing was broken, sorry about that.
>
> It looks like this test causes the PMD fault handler to be called repeatedly
> over and over until you kill the userspace process. This doesn't happen for
> XFS because when using XFS this test doesn't hit PMD faults, only PTE faults.
>
> So, looks like a livelock as far as I can tell.
>
> Still debugging.

I am seeing a similar/same problem in my test. I think the problem is that in
case of a WP fault, wp_huge_pmd() -> __dax_pmd_fault() -> vmf_insert_pfn_pmd(),
which is a no-op since the PMD is mapped already. We need WP handling for this
PMD map.

If it helps, I have attached change for follow_trans_huge_pmd(). I have not
tested much, though.

Thanks,
-Toshi


Attachments:
follow_pfn_pmd.patch (2.22 kB)

2015-11-18 21:57:24

by Dan Williams

[permalink] [raw]
Subject: Re: dax pmd fault handler never returns to userspace

On Wed, Nov 18, 2015 at 1:33 PM, Toshi Kani <[email protected]> wrote:
> I am seeing a similar/same problem in my test. I think the problem is that in
> case of a WP fault, wp_huge_pmd() -> __dax_pmd_fault() -> vmf_insert_pfn_pmd(),
> which is a no-op since the PMD is mapped already. We need WP handling for this
> PMD map.
>
> If it helps, I have attached change for follow_trans_huge_pmd(). I have not
> tested much, though.

Interesting, I didn't get this far because my tests were crashing the
kernel. I'll add this case the pmd fault test in ndctl.

2015-11-18 22:04:41

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: dax pmd fault handler never returns to userspace

On Wed, 2015-11-18 at 13:57 -0800, Dan Williams wrote:
> On Wed, Nov 18, 2015 at 1:33 PM, Toshi Kani <[email protected]> wrote:
> > I am seeing a similar/same problem in my test. I think the problem is that
> > in
> > case of a WP fault, wp_huge_pmd() -> __dax_pmd_fault() ->
> > vmf_insert_pfn_pmd(),
> > which is a no-op since the PMD is mapped already. We need WP handling for
> > this
> > PMD map.
> >
> > If it helps, I have attached change for follow_trans_huge_pmd(). I have not
> > tested much, though.
>
> Interesting, I didn't get this far because my tests were crashing the
> kernel. I'll add this case the pmd fault test in ndctl.

I hit this one with mmap(MAP_POPULATE). With this change, I then hit the WP
fault loop when writing to the range.

Thanks,
-Toshi


2015-11-19 00:36:24

by Ross Zwisler

[permalink] [raw]
Subject: Re: dax pmd fault handler never returns to userspace

On Wed, Nov 18, 2015 at 03:04:41PM -0700, Toshi Kani wrote:
> On Wed, 2015-11-18 at 13:57 -0800, Dan Williams wrote:
> > On Wed, Nov 18, 2015 at 1:33 PM, Toshi Kani <[email protected]> wrote:
> > > I am seeing a similar/same problem in my test. I think the problem is that
> > > in
> > > case of a WP fault, wp_huge_pmd() -> __dax_pmd_fault() ->
> > > vmf_insert_pfn_pmd(),
> > > which is a no-op since the PMD is mapped already. We need WP handling for
> > > this
> > > PMD map.
> > >
> > > If it helps, I have attached change for follow_trans_huge_pmd(). I have not
> > > tested much, though.
> >
> > Interesting, I didn't get this far because my tests were crashing the
> > kernel. I'll add this case the pmd fault test in ndctl.
>
> I hit this one with mmap(MAP_POPULATE). With this change, I then hit the WP
> fault loop when writing to the range.

Here's a fix - please let me know if this seems incomplete or incorrect for
some reason.

-- >8 --
>From 02aa9f37d7ec9c0c38413f7e304b2577eb9f974a Mon Sep 17 00:00:00 2001
From: Ross Zwisler <[email protected]>
Date: Wed, 18 Nov 2015 17:15:09 -0700
Subject: [PATCH] mm: Allow DAX PMD mappings to become writeable

Prior to this change DAX PMD mappings that were made read-only were never able
to be made writable again. This is because the code in insert_pfn_pmd() that
calls pmd_mkdirty() and pmd_mkwrite() would skip these calls if the PMD
already existed in the page table.

Instead, if we are doing a write always mark the PMD entry as dirty and
writeable. Without this code we can get into a condition where we mark the
PMD as read-only, and then on a subsequent write fault we get into an infinite
loop of PMD faults where we try unsuccessfully to make the PMD writeable.

Signed-off-by: Ross Zwisler <[email protected]>
---
mm/huge_memory.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bbac913..1b3df56 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -877,15 +877,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
spinlock_t *ptl;

ptl = pmd_lock(mm, pmd);
- if (pmd_none(*pmd)) {
- entry = pmd_mkhuge(pfn_pmd(pfn, prot));
- if (write) {
- entry = pmd_mkyoung(pmd_mkdirty(entry));
- entry = maybe_pmd_mkwrite(entry, vma);
- }
- set_pmd_at(mm, addr, pmd, entry);
- update_mmu_cache_pmd(vma, addr, pmd);
+ entry = pmd_mkhuge(pfn_pmd(pfn, prot));
+ if (write) {
+ entry = pmd_mkyoung(pmd_mkdirty(entry));
+ entry = maybe_pmd_mkwrite(entry, vma);
}
+ set_pmd_at(mm, addr, pmd, entry);
+ update_mmu_cache_pmd(vma, addr, pmd);
spin_unlock(ptl);
}

--
2.6.3


2015-11-19 00:39:29

by Dan Williams

[permalink] [raw]
Subject: Re: dax pmd fault handler never returns to userspace

On Wed, Nov 18, 2015 at 4:36 PM, Ross Zwisler
<[email protected]> wrote:
> On Wed, Nov 18, 2015 at 03:04:41PM -0700, Toshi Kani wrote:
>> On Wed, 2015-11-18 at 13:57 -0800, Dan Williams wrote:
>> > On Wed, Nov 18, 2015 at 1:33 PM, Toshi Kani <[email protected]> wrote:
>> > > I am seeing a similar/same problem in my test. I think the problem is that
>> > > in
>> > > case of a WP fault, wp_huge_pmd() -> __dax_pmd_fault() ->
>> > > vmf_insert_pfn_pmd(),
>> > > which is a no-op since the PMD is mapped already. We need WP handling for
>> > > this
>> > > PMD map.
>> > >
>> > > If it helps, I have attached change for follow_trans_huge_pmd(). I have not
>> > > tested much, though.
>> >
>> > Interesting, I didn't get this far because my tests were crashing the
>> > kernel. I'll add this case the pmd fault test in ndctl.
>>
>> I hit this one with mmap(MAP_POPULATE). With this change, I then hit the WP
>> fault loop when writing to the range.
>
> Here's a fix - please let me know if this seems incomplete or incorrect for
> some reason.
>
> -- >8 --
> From 02aa9f37d7ec9c0c38413f7e304b2577eb9f974a Mon Sep 17 00:00:00 2001
> From: Ross Zwisler <[email protected]>
> Date: Wed, 18 Nov 2015 17:15:09 -0700
> Subject: [PATCH] mm: Allow DAX PMD mappings to become writeable
>
> Prior to this change DAX PMD mappings that were made read-only were never able
> to be made writable again. This is because the code in insert_pfn_pmd() that
> calls pmd_mkdirty() and pmd_mkwrite() would skip these calls if the PMD
> already existed in the page table.
>
> Instead, if we are doing a write always mark the PMD entry as dirty and
> writeable. Without this code we can get into a condition where we mark the
> PMD as read-only, and then on a subsequent write fault we get into an infinite
> loop of PMD faults where we try unsuccessfully to make the PMD writeable.
>
> Signed-off-by: Ross Zwisler <[email protected]>
> ---
> mm/huge_memory.c | 14 ++++++--------
> 1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bbac913..1b3df56 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -877,15 +877,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
> spinlock_t *ptl;
>
> ptl = pmd_lock(mm, pmd);
> - if (pmd_none(*pmd)) {
> - entry = pmd_mkhuge(pfn_pmd(pfn, prot));
> - if (write) {
> - entry = pmd_mkyoung(pmd_mkdirty(entry));
> - entry = maybe_pmd_mkwrite(entry, vma);
> - }
> - set_pmd_at(mm, addr, pmd, entry);
> - update_mmu_cache_pmd(vma, addr, pmd);
> + entry = pmd_mkhuge(pfn_pmd(pfn, prot));
> + if (write) {
> + entry = pmd_mkyoung(pmd_mkdirty(entry));
> + entry = maybe_pmd_mkwrite(entry, vma);
> }
> + set_pmd_at(mm, addr, pmd, entry);
> + update_mmu_cache_pmd(vma, addr, pmd);
> spin_unlock(ptl);
> }

Looks good to me.

2015-11-19 01:05:09

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: dax pmd fault handler never returns to userspace

On Wed, 2015-11-18 at 17:36 -0700, Ross Zwisler wrote:
> On Wed, Nov 18, 2015 at 03:04:41PM -0700, Toshi Kani wrote:
> > On Wed, 2015-11-18 at 13:57 -0800, Dan Williams wrote:
> > > On Wed, Nov 18, 2015 at 1:33 PM, Toshi Kani <[email protected]> wrote:
> > > > I am seeing a similar/same problem in my test. I think the problem is
> > > > that in case of a WP fault, wp_huge_pmd() -> __dax_pmd_fault() ->
> > > > vmf_insert_pfn_pmd(), which is a no-op since the PMD is mapped already.
> > > > We need WP handling for this PMD map.
> > > >
> > > > If it helps, I have attached change for follow_trans_huge_pmd(). I have
> > > > not tested much, though.
> > >
> > > Interesting, I didn't get this far because my tests were crashing the
> > > kernel. I'll add this case the pmd fault test in ndctl.
> >
> > I hit this one with mmap(MAP_POPULATE). With this change, I then hit the WP
> > fault loop when writing to the range.
>
> Here's a fix - please let me know if this seems incomplete or incorrect for
> some reason.

My test looks working now. :-) I will do more testing and submit the gup patch
as well.

Thanks,
-Toshi

2015-11-19 01:19:58

by Dan Williams

[permalink] [raw]
Subject: Re: dax pmd fault handler never returns to userspace

On Wed, Nov 18, 2015 at 4:36 PM, Ross Zwisler
<[email protected]> wrote:
> On Wed, Nov 18, 2015 at 03:04:41PM -0700, Toshi Kani wrote:
>> On Wed, 2015-11-18 at 13:57 -0800, Dan Williams wrote:
>> > On Wed, Nov 18, 2015 at 1:33 PM, Toshi Kani <[email protected]> wrote:
>> > > I am seeing a similar/same problem in my test. I think the problem is that
>> > > in
>> > > case of a WP fault, wp_huge_pmd() -> __dax_pmd_fault() ->
>> > > vmf_insert_pfn_pmd(),
>> > > which is a no-op since the PMD is mapped already. We need WP handling for
>> > > this
>> > > PMD map.
>> > >
>> > > If it helps, I have attached change for follow_trans_huge_pmd(). I have not
>> > > tested much, though.
>> >
>> > Interesting, I didn't get this far because my tests were crashing the
>> > kernel. I'll add this case the pmd fault test in ndctl.
>>
>> I hit this one with mmap(MAP_POPULATE). With this change, I then hit the WP
>> fault loop when writing to the range.
>
> Here's a fix - please let me know if this seems incomplete or incorrect for
> some reason.
>
> -- >8 --
> From 02aa9f37d7ec9c0c38413f7e304b2577eb9f974a Mon Sep 17 00:00:00 2001
> From: Ross Zwisler <[email protected]>
> Date: Wed, 18 Nov 2015 17:15:09 -0700
> Subject: [PATCH] mm: Allow DAX PMD mappings to become writeable
>
> Prior to this change DAX PMD mappings that were made read-only were never able
> to be made writable again. This is because the code in insert_pfn_pmd() that
> calls pmd_mkdirty() and pmd_mkwrite() would skip these calls if the PMD
> already existed in the page table.
>
> Instead, if we are doing a write always mark the PMD entry as dirty and
> writeable. Without this code we can get into a condition where we mark the
> PMD as read-only, and then on a subsequent write fault we get into an infinite
> loop of PMD faults where we try unsuccessfully to make the PMD writeable.
>
> Signed-off-by: Ross Zwisler <[email protected]>
> ---
> mm/huge_memory.c | 14 ++++++--------
> 1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bbac913..1b3df56 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -877,15 +877,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
> spinlock_t *ptl;
>
> ptl = pmd_lock(mm, pmd);
> - if (pmd_none(*pmd)) {
> - entry = pmd_mkhuge(pfn_pmd(pfn, prot));
> - if (write) {
> - entry = pmd_mkyoung(pmd_mkdirty(entry));
> - entry = maybe_pmd_mkwrite(entry, vma);
> - }
> - set_pmd_at(mm, addr, pmd, entry);
> - update_mmu_cache_pmd(vma, addr, pmd);
> + entry = pmd_mkhuge(pfn_pmd(pfn, prot));
> + if (write) {
> + entry = pmd_mkyoung(pmd_mkdirty(entry));
> + entry = maybe_pmd_mkwrite(entry, vma);
> }
> + set_pmd_at(mm, addr, pmd, entry);
> + update_mmu_cache_pmd(vma, addr, pmd);
> spin_unlock(ptl);

Hmm other paths that do pmd_mkwrite are using pmdp_set_access_flags()
and it's not immediately clear to me why.

2015-11-19 22:34:58

by Dave Chinner

[permalink] [raw]
Subject: Re: dax pmd fault handler never returns to userspace

On Wed, Nov 18, 2015 at 10:58:29AM -0800, Dan Williams wrote:
> On Wed, Nov 18, 2015 at 10:53 AM, Ross Zwisler
> <[email protected]> wrote:
> > On Wed, Nov 18, 2015 at 01:32:46PM -0500, Jeff Moyer wrote:
> >> Ross Zwisler <[email protected]> writes:
> >>
> >> > Yea, my first round of testing was broken, sorry about that.
> >> >
> >> > It looks like this test causes the PMD fault handler to be called repeatedly
> >> > over and over until you kill the userspace process. This doesn't happen for
> >> > XFS because when using XFS this test doesn't hit PMD faults, only PTE faults.
> >>
> >> Hmm, I wonder why not?
> >
> > Well, whether or not you get PMDs is dependent on the block allocator for the
> > filesystem. We ask the FS how much space is contiguous via get_blocks(), and
> > if it's less than PMD_SIZE (2 MiB) we fall back to the regular 4k page fault
> > path. This code all lives in __dax_pmd_fault(). There are also a bunch of
> > other reasons why we'd fall back to 4k faults - the virtual address isn't 2
> > MiB aligned, etc. It's actually pretty hard to get everything right so you
> > actually get PMD faults.
> >
> > Anyway, my guess is that we're failing to meet one of our criteria in XFS, so
> > we just always fall back to PTEs for this test.
> >
> >> Sounds like that will need investigating as well, right?
> >
> > Yep, on it.
>
> XFS can do pmd faults just fine, you just need to use fiemap to find a
> 2MiB aligned physical offset. See the ndctl pmd test I posted.

This comes under the topic of "XFS and Storage Alignment 101".
there's nothing new here and it's just like aligning your filesystem
to RAID5/6 geometries for optimal sequential IO patterns:

# mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
....
# mount /dev/pmem0 /mnt/xfs
# xfs_io -c "extsize 2m" /mnt/xfs

And now XFS will allocate strip unit (2MB) aligned extents of 2MB
in all files created in that filesystem. Now all you have to care
about is correctly aligning the base address of /dev/pmem0 to 2MB so
that all the stripe units (and hence file extent allocations) are
correctly aligned to the page tables.

Cheers,

Dave.
--
Dave Chinner
[email protected]