2021-08-04 08:10:40

by Li Zhijian

[permalink] [raw]
Subject: Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

convert to text and send again

2021/8/4 15:55, Li, Zhijian wrote:
>
> Hey all:
>
> Recently, i reported a issue to rpmahttps://github.com/pmem/rpma/issues/1142
> where we found that the native rpma + fsdax example failed in recent kernel.
>
> Below is the bisect log
>
> [lizhijian@yl linux]$ git bisect log
> git bisect start
> # good: [bbf5c979011a099af5dc76498918ed7df445635b] Linux 5.9
> git bisect good bbf5c979011a099af5dc76498918ed7df445635b
> # bad: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10
> git bisect bad 2c85ebc57b3e1817b6ce1a6b703928e113a90442
> # good: [4d0e9df5e43dba52d38b251e3b909df8fa1110be] lib, uaccess: add failure injection to usercopy functions
> git bisect good 4d0e9df5e43dba52d38b251e3b909df8fa1110be
> # bad: [6694875ef8045cdb1e6712ee9b68fe08763507d8] ext4: indicate that fast_commit is available via /sys/fs/ext4/feature/...
> git bisect bad 6694875ef8045cdb1e6712ee9b68fe08763507d8
> # good: [14c914fcb515c424177bb6848cc2858ebfe717a8] Merge tag 'wireless-drivers-next-2020-10-02' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
> git bisect good 14c914fcb515c424177bb6848cc2858ebfe717a8
> # good: [6f78b9acf04fbf9ede7f4265e7282f9fb39d2c8c] Merge tag 'mtd/for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
> git bisect good 6f78b9acf04fbf9ede7f4265e7282f9fb39d2c8c
> # bad: [bbe85027ce8019c73ab99ad1c2603e2dcd1afa49] Merge tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
> git bisect bad bbe85027ce8019c73ab99ad1c2603e2dcd1afa49
> # bad: [9d9af1007bc08971953ae915d88dc9bb21344b53] Merge tag 'perf-tools-for-v5.10-2020-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
> git bisect bad 9d9af1007bc08971953ae915d88dc9bb21344b53
> # good: [21c2fe94abb2abe894e6aabe6b4e84a255c8d339] RDMA/mthca: Combine special QP struct with mthca QP
> git bisect good 21c2fe94abb2abe894e6aabe6b4e84a255c8d339
> # good: [dbaa1b3d9afba3c050d365245a36616ae3f425a7] Merge branch 'perf/urgent' into perf/core
> git bisect good dbaa1b3d9afba3c050d365245a36616ae3f425a7
> # bad: [c7a198c700763ac89abbb166378f546aeb9afb33] RDMA/ucma: Fix use after free in destroy id flow
> git bisect bad c7a198c700763ac89abbb166378f546aeb9afb33
> # bad: [5ce2dced8e95e76ff7439863a118a053a7fc6f91] RDMA/ipoib: Set rtnl_link_ops for ipoib interfaces
> git bisect bad 5ce2dced8e95e76ff7439863a118a053a7fc6f91
> # bad: [a03bfc37d59de316436c46f5691c5a972ed57c82] RDMA/mlx5: Sync device with CPU pages upon ODP MR registration
> git bisect bad a03bfc37d59de316436c46f5691c5a972ed57c82
> # good: [a6f0b08dbaf289c3c57284e16ac8043140f2139b] RDMA/core: Remove ucontext->closing
> git bisect good a6f0b08dbaf289c3c57284e16ac8043140f2139b
> # bad: [36f30e486dce22345c2dd3a3ba439c12cd67f6ba] IB/core: Improve ODP to use hmm_range_fault()
> git bisect bad 36f30e486dce22345c2dd3a3ba439c12cd67f6ba
> # good: [2ee9bf346fbfd1dad0933b9eb3a4c2c0979b633e] RDMA/addr: Fix race with netevent_callback()/rdma_addr_cancel()
> git bisect good 2ee9bf346fbfd1dad0933b9eb3a4c2c0979b633e
> # first bad commit: [36f30e486dce22345c2dd3a3ba439c12cd67f6ba] IB/core: Improve ODP to use hmm_range_fault()
>
> Note: some commit have to apply a extra patch to avoid a kernel panic.
> > git cherry-pick d4c5da5 # dax: Fix stack overflow when mounting fsdax pmem device
>
>
> Thanks
> Li
>
>




2021-08-06 07:30:15

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

On Wed, Aug 04, 2021 at 04:06:53PM +0800, Li, Zhijian/李 智坚 wrote:
> convert to text and send again
>
> 2021/8/4 15:55, Li, Zhijian wrote:
> >
> > Hey all:
> >
> > Recently, i reported a issue to rpmahttps://github.com/pmem/rpma/issues/1142
> > where we found that the native rpma + fsdax example failed in recent kernel.
> >
> > Below is the bisect log
> >
> > [lizhijian@yl linux]$ git bisect log
> > git bisect start
> > # good: [bbf5c979011a099af5dc76498918ed7df445635b] Linux 5.9
> > git bisect good bbf5c979011a099af5dc76498918ed7df445635b
> > # bad: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10
> > git bisect bad 2c85ebc57b3e1817b6ce1a6b703928e113a90442
> > # good: [4d0e9df5e43dba52d38b251e3b909df8fa1110be] lib, uaccess: add failure injection to usercopy functions
> > git bisect good 4d0e9df5e43dba52d38b251e3b909df8fa1110be
> > # bad: [6694875ef8045cdb1e6712ee9b68fe08763507d8] ext4: indicate that fast_commit is available via /sys/fs/ext4/feature/...
> > git bisect bad 6694875ef8045cdb1e6712ee9b68fe08763507d8
> > # good: [14c914fcb515c424177bb6848cc2858ebfe717a8] Merge tag 'wireless-drivers-next-2020-10-02' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
> > git bisect good 14c914fcb515c424177bb6848cc2858ebfe717a8
> > # good: [6f78b9acf04fbf9ede7f4265e7282f9fb39d2c8c] Merge tag 'mtd/for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
> > git bisect good 6f78b9acf04fbf9ede7f4265e7282f9fb39d2c8c
> > # bad: [bbe85027ce8019c73ab99ad1c2603e2dcd1afa49] Merge tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
> > git bisect bad bbe85027ce8019c73ab99ad1c2603e2dcd1afa49
> > # bad: [9d9af1007bc08971953ae915d88dc9bb21344b53] Merge tag 'perf-tools-for-v5.10-2020-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
> > git bisect bad 9d9af1007bc08971953ae915d88dc9bb21344b53
> > # good: [21c2fe94abb2abe894e6aabe6b4e84a255c8d339] RDMA/mthca: Combine special QP struct with mthca QP
> > git bisect good 21c2fe94abb2abe894e6aabe6b4e84a255c8d339
> > # good: [dbaa1b3d9afba3c050d365245a36616ae3f425a7] Merge branch 'perf/urgent' into perf/core
> > git bisect good dbaa1b3d9afba3c050d365245a36616ae3f425a7
> > # bad: [c7a198c700763ac89abbb166378f546aeb9afb33] RDMA/ucma: Fix use after free in destroy id flow
> > git bisect bad c7a198c700763ac89abbb166378f546aeb9afb33
> > # bad: [5ce2dced8e95e76ff7439863a118a053a7fc6f91] RDMA/ipoib: Set rtnl_link_ops for ipoib interfaces
> > git bisect bad 5ce2dced8e95e76ff7439863a118a053a7fc6f91
> > # bad: [a03bfc37d59de316436c46f5691c5a972ed57c82] RDMA/mlx5: Sync device with CPU pages upon ODP MR registration
> > git bisect bad a03bfc37d59de316436c46f5691c5a972ed57c82
> > # good: [a6f0b08dbaf289c3c57284e16ac8043140f2139b] RDMA/core: Remove ucontext->closing
> > git bisect good a6f0b08dbaf289c3c57284e16ac8043140f2139b
> > # bad: [36f30e486dce22345c2dd3a3ba439c12cd67f6ba] IB/core: Improve ODP to use hmm_range_fault()
> > git bisect bad 36f30e486dce22345c2dd3a3ba439c12cd67f6ba
> > # good: [2ee9bf346fbfd1dad0933b9eb3a4c2c0979b633e] RDMA/addr: Fix race with netevent_callback()/rdma_addr_cancel()
> > git bisect good 2ee9bf346fbfd1dad0933b9eb3a4c2c0979b633e
> > # first bad commit: [36f30e486dce22345c2dd3a3ba439c12cd67f6ba] IB/core: Improve ODP to use hmm_range_fault()

This is perhaps not so surprising, but I think you should report it to
the dax people that hmm_range_fault and dax don't work together..

Though I think it is supposed to, and I'm surprised it doesn't.

Jason

2021-08-06 07:38:19

by Li Zhijian

[permalink] [raw]
Subject: Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

Hi Jason

thank you for your advice.

CCing nvdimm

both ext4 and xfs are impacted.

Thanks


at 2021/8/6 9:45, Jason Gunthorpe wrote:
> On Wed, Aug 04, 2021 at 04:06:53PM +0800, Li, Zhijian/李 智坚 wrote:
>> convert to text and send again
>>
>> 2021/8/4 15:55, Li, Zhijian wrote:
>>> Hey all:
>>>
>>> Recently, i reported a issue to rpmahttps://github.com/pmem/rpma/issues/1142
>>> where we found that the native rpma + fsdax example failed in recent kernel.
>>>
>>> Below is the bisect log
>>>
>>> [lizhijian@yl linux]$ git bisect log
>>> git bisect start
>>> # good: [bbf5c979011a099af5dc76498918ed7df445635b] Linux 5.9
>>> git bisect good bbf5c979011a099af5dc76498918ed7df445635b
>>> # bad: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10
>>> git bisect bad 2c85ebc57b3e1817b6ce1a6b703928e113a90442
>>> # good: [4d0e9df5e43dba52d38b251e3b909df8fa1110be] lib, uaccess: add failure injection to usercopy functions
>>> git bisect good 4d0e9df5e43dba52d38b251e3b909df8fa1110be
>>> # bad: [6694875ef8045cdb1e6712ee9b68fe08763507d8] ext4: indicate that fast_commit is available via /sys/fs/ext4/feature/...
>>> git bisect bad 6694875ef8045cdb1e6712ee9b68fe08763507d8
>>> # good: [14c914fcb515c424177bb6848cc2858ebfe717a8] Merge tag 'wireless-drivers-next-2020-10-02' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
>>> git bisect good 14c914fcb515c424177bb6848cc2858ebfe717a8
>>> # good: [6f78b9acf04fbf9ede7f4265e7282f9fb39d2c8c] Merge tag 'mtd/for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
>>> git bisect good 6f78b9acf04fbf9ede7f4265e7282f9fb39d2c8c
>>> # bad: [bbe85027ce8019c73ab99ad1c2603e2dcd1afa49] Merge tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
>>> git bisect bad bbe85027ce8019c73ab99ad1c2603e2dcd1afa49
>>> # bad: [9d9af1007bc08971953ae915d88dc9bb21344b53] Merge tag 'perf-tools-for-v5.10-2020-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
>>> git bisect bad 9d9af1007bc08971953ae915d88dc9bb21344b53
>>> # good: [21c2fe94abb2abe894e6aabe6b4e84a255c8d339] RDMA/mthca: Combine special QP struct with mthca QP
>>> git bisect good 21c2fe94abb2abe894e6aabe6b4e84a255c8d339
>>> # good: [dbaa1b3d9afba3c050d365245a36616ae3f425a7] Merge branch 'perf/urgent' into perf/core
>>> git bisect good dbaa1b3d9afba3c050d365245a36616ae3f425a7
>>> # bad: [c7a198c700763ac89abbb166378f546aeb9afb33] RDMA/ucma: Fix use after free in destroy id flow
>>> git bisect bad c7a198c700763ac89abbb166378f546aeb9afb33
>>> # bad: [5ce2dced8e95e76ff7439863a118a053a7fc6f91] RDMA/ipoib: Set rtnl_link_ops for ipoib interfaces
>>> git bisect bad 5ce2dced8e95e76ff7439863a118a053a7fc6f91
>>> # bad: [a03bfc37d59de316436c46f5691c5a972ed57c82] RDMA/mlx5: Sync device with CPU pages upon ODP MR registration
>>> git bisect bad a03bfc37d59de316436c46f5691c5a972ed57c82
>>> # good: [a6f0b08dbaf289c3c57284e16ac8043140f2139b] RDMA/core: Remove ucontext->closing
>>> git bisect good a6f0b08dbaf289c3c57284e16ac8043140f2139b
>>> # bad: [36f30e486dce22345c2dd3a3ba439c12cd67f6ba] IB/core: Improve ODP to use hmm_range_fault()
>>> git bisect bad 36f30e486dce22345c2dd3a3ba439c12cd67f6ba
>>> # good: [2ee9bf346fbfd1dad0933b9eb3a4c2c0979b633e] RDMA/addr: Fix race with netevent_callback()/rdma_addr_cancel()
>>> git bisect good 2ee9bf346fbfd1dad0933b9eb3a4c2c0979b633e
>>> # first bad commit: [36f30e486dce22345c2dd3a3ba439c12cd67f6ba] IB/core: Improve ODP to use hmm_range_fault()
> This is perhaps not so surprising, but I think you should report it to
> the dax people that hmm_range_fault and dax don't work together..
>
> Though I think it is supposed to, and I'm surprised it doesn't.
>
> Jason
>
>


2021-08-27 08:27:08

by Zhijian Li (Fujitsu)

[permalink] [raw]
Subject: Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

Hi all

I have verified that below changes can solve this problem

diff --git a/mm/hmm.c b/mm/hmm.c
index 24f9ff95f3ae..2c9a3e3eefce 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -294,7 +294,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
         * Since each architecture defines a struct page for the zero page, just
         * fall through and treat it like a normal page.
         */
-       if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {
+       if (!pte_devmap(pte) && pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {
                if (hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0)) {
                        pte_unmap(ptep);
                        return -EFAULT;


i looked over the change-log of hmm_vma_handle_pte(), and found that before
4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling")

hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) is true.

when we reached
"if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {"
the pte have already presented and its pte's flag already fulfilled the request flags.


My question is that
Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits,
pte_devmap(pte) and pte_special(pte) could be both true in fsdax user case, right ?

if so, what's the expected code path for fsdax case ?


commit 4055062749229101365e5f9e87cb1c5a93e292f8
Author: Jason Gunthorpe <[email protected]>
Date:   Thu Mar 5 14:27:20 2020 -0400

    mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling

    Currently if a special PTE is encountered hmm_range_fault() immediately
    returns EFAULT and sets the HMM_PFN_SPECIAL error output (which nothing
    uses).

    EFAULT should only be returned after testing with hmm_pte_need_fault().

    Also pte_devmap() and pte_special() are exclusive, and there is no need to
    check IS_ENABLED, pte_special() is stubbed out to return false on
    unsupported architectures.

    Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem")
    Reviewed-by: Ralph Campbell <[email protected]>
    Reviewed-by: Christoph Hellwig <[email protected]>
    Signed-off-by: Jason Gunthorpe <[email protected]>

diff --git a/mm/hmm.c b/mm/hmm.c
index 3a03fcf..9c82ea9 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -339,16 +339,21 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
                        pte_unmap(ptep);
                        return -EBUSY;
                }
-       } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
-               if (!is_zero_pfn(pte_pfn(pte))) {
+       }
+
+       /*
+        * Since each architecture defines a struct page for the zero page, just
+        * fall through and treat it like a normal page.
+        */
+       if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {
+               hmm_pte_need_fault(hmm_vma_walk, orig_pfn, 0, &fault,
+                                  &write_fault);
+               if (fault || write_fault) {
                        pte_unmap(ptep);
-                       *pfn = range->values[HMM_PFN_SPECIAL];
                        return -EFAULT;
                }
-               /*
-                * Since each architecture defines a struct page for the zero
-                * page, just fall through and treat it like a normal page.
-                */
+               *pfn = range->values[HMM_PFN_SPECIAL];
+               return 0;
        }

        *pfn = hmm_device_entry_from_pfn(range, pte_pfn(pte)) | cpu_flags;


Thanks
Zhijian



On 06/08/2021 11:20, Li, Zhijian wrote:
> Hi Jason
>
> thank you for your advice.
>
> CCing nvdimm
>
> both ext4 and xfs are impacted.
>
> Thanks
>
>
> at 2021/8/6 9:45, Jason Gunthorpe wrote:
>> On Wed, Aug 04, 2021 at 04:06:53PM +0800, Li, Zhijian/李 智坚 wrote:
>>> convert to text and send again
>>>
>>> 2021/8/4 15:55, Li, Zhijian wrote:
>>>> Hey all:
>>>>
>>>> Recently, i reported a issue to rpmahttps://github.com/pmem/rpma/issues/1142
>>>> where we found that the native rpma + fsdax example failed in recent kernel.
>>>>
>>>> Below is the bisect log
>>>>
>>>> [lizhijian@yl linux]$ git bisect log
>>>> git bisect start
>>>> # good: [bbf5c979011a099af5dc76498918ed7df445635b] Linux 5.9
>>>> git bisect good bbf5c979011a099af5dc76498918ed7df445635b
>>>> # bad: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10
>>>> git bisect bad 2c85ebc57b3e1817b6ce1a6b703928e113a90442
>>>> # good: [4d0e9df5e43dba52d38b251e3b909df8fa1110be] lib, uaccess: add failure injection to usercopy functions
>>>> git bisect good 4d0e9df5e43dba52d38b251e3b909df8fa1110be
>>>> # bad: [6694875ef8045cdb1e6712ee9b68fe08763507d8] ext4: indicate that fast_commit is available via /sys/fs/ext4/feature/...
>>>> git bisect bad 6694875ef8045cdb1e6712ee9b68fe08763507d8
>>>> # good: [14c914fcb515c424177bb6848cc2858ebfe717a8] Merge tag 'wireless-drivers-next-2020-10-02' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
>>>> git bisect good 14c914fcb515c424177bb6848cc2858ebfe717a8
>>>> # good: [6f78b9acf04fbf9ede7f4265e7282f9fb39d2c8c] Merge tag 'mtd/for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
>>>> git bisect good 6f78b9acf04fbf9ede7f4265e7282f9fb39d2c8c
>>>> # bad: [bbe85027ce8019c73ab99ad1c2603e2dcd1afa49] Merge tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
>>>> git bisect bad bbe85027ce8019c73ab99ad1c2603e2dcd1afa49
>>>> # bad: [9d9af1007bc08971953ae915d88dc9bb21344b53] Merge tag 'perf-tools-for-v5.10-2020-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
>>>> git bisect bad 9d9af1007bc08971953ae915d88dc9bb21344b53
>>>> # good: [21c2fe94abb2abe894e6aabe6b4e84a255c8d339] RDMA/mthca: Combine special QP struct with mthca QP
>>>> git bisect good 21c2fe94abb2abe894e6aabe6b4e84a255c8d339
>>>> # good: [dbaa1b3d9afba3c050d365245a36616ae3f425a7] Merge branch 'perf/urgent' into perf/core
>>>> git bisect good dbaa1b3d9afba3c050d365245a36616ae3f425a7
>>>> # bad: [c7a198c700763ac89abbb166378f546aeb9afb33] RDMA/ucma: Fix use after free in destroy id flow
>>>> git bisect bad c7a198c700763ac89abbb166378f546aeb9afb33
>>>> # bad: [5ce2dced8e95e76ff7439863a118a053a7fc6f91] RDMA/ipoib: Set rtnl_link_ops for ipoib interfaces
>>>> git bisect bad 5ce2dced8e95e76ff7439863a118a053a7fc6f91
>>>> # bad: [a03bfc37d59de316436c46f5691c5a972ed57c82] RDMA/mlx5: Sync device with CPU pages upon ODP MR registration
>>>> git bisect bad a03bfc37d59de316436c46f5691c5a972ed57c82
>>>> # good: [a6f0b08dbaf289c3c57284e16ac8043140f2139b] RDMA/core: Remove ucontext->closing
>>>> git bisect good a6f0b08dbaf289c3c57284e16ac8043140f2139b
>>>> # bad: [36f30e486dce22345c2dd3a3ba439c12cd67f6ba] IB/core: Improve ODP to use hmm_range_fault()
>>>> git bisect bad 36f30e486dce22345c2dd3a3ba439c12cd67f6ba
>>>> # good: [2ee9bf346fbfd1dad0933b9eb3a4c2c0979b633e] RDMA/addr: Fix race with netevent_callback()/rdma_addr_cancel()
>>>> git bisect good 2ee9bf346fbfd1dad0933b9eb3a4c2c0979b633e
>>>> # first bad commit: [36f30e486dce22345c2dd3a3ba439c12cd67f6ba] IB/core: Improve ODP to use hmm_range_fault()
>> This is perhaps not so surprising, but I think you should report it to
>> the dax people that hmm_range_fault and dax don't work together..
>>
>> Though I think it is supposed to, and I'm surprised it doesn't.
>>
>> Jason
>>
>>

2021-08-27 12:14:44

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

On Fri, Aug 27, 2021 at 08:15:40AM +0000, [email protected] wrote:
> i looked over the change-log of hmm_vma_handle_pte(), and found that before
> 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling")
>
> hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) is true.
>
> when we reached
> "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {"
> the pte have already presented and its pte's flag already fulfilled the request flags.
>
>
> My question is that
> Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits,
> pte_devmap(pte) and pte_special(pte) could be both true in fsdax user case, right ?

How? what code creates that?

I see:

insert_pfn():
/* Ok, finally just insert the thing.. */
if (pfn_t_devmap(pfn))
entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
else
entry = pte_mkspecial(pfn_t_pte(pfn, prot));

So what code path ends up setting both bits?

Jason

2021-08-27 13:10:01

by Li Zhijian

[permalink] [raw]
Subject: Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d


on 2021/8/27 20:10, Jason Gunthorpe wrote:
> On Fri, Aug 27, 2021 at 08:15:40AM +0000, [email protected] wrote:
>> i looked over the change-log of hmm_vma_handle_pte(), and found that before
>> 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling")
>>
>> hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) is true.
>>
>> when we reached
>> "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {"
>> the pte have already presented and its pte's flag already fulfilled the request flags.
>>
>>
>> My question is that
>> Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits,
>> pte_devmap(pte) and pte_special(pte) could be both true in fsdax user case, right ?
> How? what code creates that?
>
> I see:
>
> insert_pfn():
> /* Ok, finally just insert the thing.. */
> if (pfn_t_devmap(pfn))
> entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
> else
> entry = pte_mkspecial(pfn_t_pte(pfn, prot));
>
> So what code path ends up setting both bits?

 pte_mkdevmap() will set both _PAGE_SPECIAL | PAGE_DEVMAP

 395 static inline pte_t pte_mkdevmap(pte_t pte)
 396 {
 397         return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
 398 }

below is a calltrace example

[  400.728559] Call Trace:
[  400.731595] dump_stack+0x6d/0x8b
[  400.735536] insert_pfn+0x16c/0x180
[  400.739596] __vm_insert_mixed+0x84/0xc0
[  400.744144] dax_iomap_pte_fault+0x845/0x870
[  400.749089] ext4_dax_huge_fault+0x171/0x1e0
[  400.754096] __do_fault+0x31/0xe0
[  400.758090]  ? pmd_devmap_trans_unstable+0x37/0x90
[  400.763541] handle_mm_fault+0x11b1/0x1680
[  400.768260] exc_page_fault+0x2f4/0x570
[  400.772788]  ? asm_exc_page_fault+0x8/0x30
[  400.777539]  asm_exc_page_fault+0x1e/0x30


So is my previous change reasonable ?

Thanks

Zhijian



2021-08-27 13:17:57

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

On Fri, Aug 27, 2021 at 09:05:21PM +0800, Li, Zhijian wrote:
>
> on 2021/8/27 20:10, Jason Gunthorpe wrote:
> > On Fri, Aug 27, 2021 at 08:15:40AM +0000, [email protected] wrote:
> > > i looked over the change-log of hmm_vma_handle_pte(), and found that before
> > > 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling")
> > >
> > > hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) is true.
> > >
> > > when we reached
> > > "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {"
> > > the pte have already presented and its pte's flag already fulfilled the request flags.
> > >
> > >
> > > My question is that
> > > Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits,
> > > pte_devmap(pte) and pte_special(pte) could be both true in fsdax user case, right ?
> > How? what code creates that?
> >
> > I see:
> >
> > insert_pfn():
> > /* Ok, finally just insert the thing.. */
> > if (pfn_t_devmap(pfn))
> > entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
> > else
> > entry = pte_mkspecial(pfn_t_pte(pfn, prot));
> >
> > So what code path ends up setting both bits?
>
>  pte_mkdevmap() will set both _PAGE_SPECIAL | PAGE_DEVMAP
>
>  395 static inline pte_t pte_mkdevmap(pte_t pte)
>  396 {
>  397         return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
>  398 }
>
> below is a calltrace example
>
> [  400.728559] Call Trace:
> [  400.731595] dump_stack+0x6d/0x8b
> [  400.735536] insert_pfn+0x16c/0x180
> [  400.739596] __vm_insert_mixed+0x84/0xc0
> [  400.744144] dax_iomap_pte_fault+0x845/0x870
> [  400.749089] ext4_dax_huge_fault+0x171/0x1e0
> [  400.754096] __do_fault+0x31/0xe0
> [  400.758090]  ? pmd_devmap_trans_unstable+0x37/0x90
> [  400.763541] handle_mm_fault+0x11b1/0x1680
> [  400.768260] exc_page_fault+0x2f4/0x570
> [  400.772788]  ? asm_exc_page_fault+0x8/0x30
> [  400.777539]  asm_exc_page_fault+0x1e/0x30
>
>
> So is my previous change reasonable ?

Yes, can you send a proper patch and include the mm mailing list?

Jason

2021-08-27 13:41:14

by Li Zhijian

[permalink] [raw]
Subject: Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d


on 2021/8/27 21:16, Jason Gunthorpe wrote:
> On Fri, Aug 27, 2021 at 09:05:21PM +0800, Li, Zhijian wrote:
>
> Yes, can you send a proper patch and include the mm mailing list?

Of course, my pleasure

Thanks



2021-08-27 16:43:59

by Dan Williams

[permalink] [raw]
Subject: Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

On Fri, Aug 27, 2021 at 6:05 AM Li, Zhijian <[email protected]> wrote:
>
>
> on 2021/8/27 20:10, Jason Gunthorpe wrote:
> > On Fri, Aug 27, 2021 at 08:15:40AM +0000, [email protected] wrote:
> >> i looked over the change-log of hmm_vma_handle_pte(), and found that before
> >> 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling")
> >>
> >> hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) is true.
> >>
> >> when we reached
> >> "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {"
> >> the pte have already presented and its pte's flag already fulfilled the request flags.
> >>
> >>
> >> My question is that
> >> Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits,
> >> pte_devmap(pte) and pte_special(pte) could be both true in fsdax user case, right ?
> > How? what code creates that?
> >
> > I see:
> >
> > insert_pfn():
> > /* Ok, finally just insert the thing.. */
> > if (pfn_t_devmap(pfn))
> > entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
> > else
> > entry = pte_mkspecial(pfn_t_pte(pfn, prot));
> >
> > So what code path ends up setting both bits?
>
> pte_mkdevmap() will set both _PAGE_SPECIAL | PAGE_DEVMAP
>
> 395 static inline pte_t pte_mkdevmap(pte_t pte)
> 396 {
> 397 return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
> 398 }

I can't recall why _PAGE_SPECIAL is there. I'll take a look, but I
think setting _PAGE_SPECIAL in pte_mkdevmap() is overkill.

2021-08-27 16:56:34

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

On Fri, Aug 27, 2021 at 09:42:21AM -0700, Dan Williams wrote:
> On Fri, Aug 27, 2021 at 6:05 AM Li, Zhijian <[email protected]> wrote:
> >
> >
> > on 2021/8/27 20:10, Jason Gunthorpe wrote:
> > > On Fri, Aug 27, 2021 at 08:15:40AM +0000, [email protected] wrote:
> > >> i looked over the change-log of hmm_vma_handle_pte(), and found that before
> > >> 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling")
> > >>
> > >> hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) is true.
> > >>
> > >> when we reached
> > >> "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {"
> > >> the pte have already presented and its pte's flag already fulfilled the request flags.
> > >>
> > >>
> > >> My question is that
> > >> Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits,
> > >> pte_devmap(pte) and pte_special(pte) could be both true in fsdax user case, right ?
> > > How? what code creates that?
> > >
> > > I see:
> > >
> > > insert_pfn():
> > > /* Ok, finally just insert the thing.. */
> > > if (pfn_t_devmap(pfn))
> > > entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
> > > else
> > > entry = pte_mkspecial(pfn_t_pte(pfn, prot));
> > >
> > > So what code path ends up setting both bits?
> >
> > pte_mkdevmap() will set both _PAGE_SPECIAL | PAGE_DEVMAP
> >
> > 395 static inline pte_t pte_mkdevmap(pte_t pte)
> > 396 {
> > 397 return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
> > 398 }
>
> I can't recall why _PAGE_SPECIAL is there. I'll take a look, but I
> think setting _PAGE_SPECIAL in pte_mkdevmap() is overkill.

This is my feeling too, but every arch does it, so hmm should check
it, at least for now as a stable fix

devmap has a struct page so it should be refcounted inside the VMA and
that is the main thing that PAGE_SPECIAL disabled, AFAICR..

The only places where pte_special are used that I wonder if are OK for
devmap have to do with CPU cache maintenance

vm_normal_page(), hmm_vma_handle_pte(), gup_pte_range() all look OK to
drop the special bit

Jason