2016-03-11 21:13:17

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 0/3] Make pfn_t suitable for placing in the radix tree

I did some experimenting with converting the DAX radix tree from storing
sector numbers to storing pfn_t. While we're not ready to do that
conversion yet, these pieces make sense to at least get reviewed now,
and maybe get upstream.

I think the first patch is worthwhile all by itself as a stepping stone to
making SG lists contain PFNs instead of pages.

Matthew Wilcox (3):
pfn_t: Change the encoding
pfn_t: Support for huge PFNs
pfn_t: New functions pfn_t_add and pfn_t_cmp

include/linux/pfn_t.h | 72 +++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 61 insertions(+), 11 deletions(-)

--
2.7.0


2016-03-11 21:13:10

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 2/3] pfn_t: Support for huge PFNs

We need to put entries in the radix tree that represent PMDs and PUDs.
Use another bit to determine if this PFN entry represents a huge page
or not. If it does, we know the bottom few bits of the PFN are zero,
so we can reuse them to distinguish between PMDs and PUDs. Thanks to
Neil Brown for suggesting this more compact encoding.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/pfn_t.h | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
index 37c596c..0ff4c4e 100644
--- a/include/linux/pfn_t.h
+++ b/include/linux/pfn_t.h
@@ -8,14 +8,21 @@
* PFN_SG_LAST - pfn references a page and is the last scatterlist entry
* PFN_DEV - pfn is not covered by system memmap by default
* PFN_MAP - pfn has a dynamic page mapping established by a device driver
+ * PFN_HUGE - pfn represents a huge page
*
* Note that the bottom two bits in the pfn_t match the bottom two bits in the
* scatterlist so sg_is_chain() and sg_is_last() work. These bits are also
* used by the radix tree for its own purposes, but a PFN cannot be in both a
* radix tree and a scatterlist simultaneously. If a PFN is moved between the
* two usages, care should be taken to clear/set these bits appropriately.
+ *
+ * If PFN_HUGE is set, the bottom PMD_SHIFT-PAGE_SHIFT bits are known to be
+ * zero. We reuse them to indicate whether the PFN represents a PMD or a PUD.
+ *
+ * This scheme supports 27 bits on a 32-bit system (512GB of physical address
+ * space) and 59 bits on a 64-bit system (2048EB of physical address space).
*/
-#define PFN_FLAG_BITS 4
+#define PFN_FLAG_BITS 5
#define PFN_FLAGS_MASK ((1 << PFN_FLAG_BITS) - 1)
#define __PFN_MAX ((1 << (BITS_PER_LONG - PFN_FLAG_BITS)) - 1)
#define PFN_SG_CHAIN 0x01UL
@@ -23,6 +30,11 @@
#define PFN_SG_MASK (PFN_SG_CHAIN | PFN_SG_LAST)
#define PFN_DEV 0x04UL
#define PFN_MAP 0x08UL
+#define PFN_HUGE 0x10UL
+#define PFN_SIZE_MASK 0x30UL
+#define PFN_SIZE_PTE 0x00UL
+#define PFN_SIZE_PMD 0x10UL
+#define PFN_SIZE_PUD 0x30UL

#if 0
#define PFN_T_BUG_ON(x) BUG_ON(x)
@@ -35,7 +47,7 @@ static inline pfn_t __pfn_to_pfn_t(unsigned long pfn, u64 flags)
pfn_t pfn_t = { .val = (pfn << PFN_FLAG_BITS) | flags };

PFN_T_BUG_ON(pfn & ~__PFN_MAX);
- PFN_T_BUG_ON(flags & ~PFN_FLAGS_MASK);
+ PFN_T_BUG_ON(flags & ~(PFN_FLAGS_MASK | PFN_SIZE_MASK));

return pfn_t;
}
@@ -46,9 +58,12 @@ static inline pfn_t pfn_to_pfn_t(unsigned long pfn)
return __pfn_to_pfn_t(pfn, 0);
}

+/* The bottom bits of the PFN may be set if PFN_HUGE is set */
static inline unsigned long pfn_t_to_pfn(pfn_t pfn)
{
unsigned long v = pfn.val;
+ if (v & PFN_HUGE)
+ v &= ~PFN_SIZE_MASK;
return v >> PFN_FLAG_BITS;
}

--
2.7.0

2016-03-11 21:13:12

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 1/3] pfn_t: Change the encoding

By moving the flag bits to the bottom, we encourage commonality
between SGs with pages and those using pfn_t. We can also then insert
a pfn_t into a radix tree, as it uses the same two bits for indirect &
exceptional indicators.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/pfn_t.h | 41 ++++++++++++++++++++++++++++++-----------
1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
index 9499481..37c596c 100644
--- a/include/linux/pfn_t.h
+++ b/include/linux/pfn_t.h
@@ -8,16 +8,34 @@
* PFN_SG_LAST - pfn references a page and is the last scatterlist entry
* PFN_DEV - pfn is not covered by system memmap by default
* PFN_MAP - pfn has a dynamic page mapping established by a device driver
+ *
+ * Note that the bottom two bits in the pfn_t match the bottom two bits in the
+ * scatterlist so sg_is_chain() and sg_is_last() work. These bits are also
+ * used by the radix tree for its own purposes, but a PFN cannot be in both a
+ * radix tree and a scatterlist simultaneously. If a PFN is moved between the
+ * two usages, care should be taken to clear/set these bits appropriately.
*/
-#define PFN_FLAGS_MASK (((u64) ~PAGE_MASK) << (BITS_PER_LONG_LONG - PAGE_SHIFT))
-#define PFN_SG_CHAIN (1ULL << (BITS_PER_LONG_LONG - 1))
-#define PFN_SG_LAST (1ULL << (BITS_PER_LONG_LONG - 2))
-#define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3))
-#define PFN_MAP (1ULL << (BITS_PER_LONG_LONG - 4))
+#define PFN_FLAG_BITS 4
+#define PFN_FLAGS_MASK ((1 << PFN_FLAG_BITS) - 1)
+#define __PFN_MAX ((1 << (BITS_PER_LONG - PFN_FLAG_BITS)) - 1)
+#define PFN_SG_CHAIN 0x01UL
+#define PFN_SG_LAST 0x02UL
+#define PFN_SG_MASK (PFN_SG_CHAIN | PFN_SG_LAST)
+#define PFN_DEV 0x04UL
+#define PFN_MAP 0x08UL
+
+#if 0
+#define PFN_T_BUG_ON(x) BUG_ON(x)
+#else
+#define PFN_T_BUG_ON(x) BUILD_BUG_ON_INVALID(x)
+#endif

static inline pfn_t __pfn_to_pfn_t(unsigned long pfn, u64 flags)
{
- pfn_t pfn_t = { .val = pfn | (flags & PFN_FLAGS_MASK), };
+ pfn_t pfn_t = { .val = (pfn << PFN_FLAG_BITS) | flags };
+
+ PFN_T_BUG_ON(pfn & ~__PFN_MAX);
+ PFN_T_BUG_ON(flags & ~PFN_FLAGS_MASK);

return pfn_t;
}
@@ -28,6 +46,12 @@ static inline pfn_t pfn_to_pfn_t(unsigned long pfn)
return __pfn_to_pfn_t(pfn, 0);
}

+static inline unsigned long pfn_t_to_pfn(pfn_t pfn)
+{
+ unsigned long v = pfn.val;
+ return v >> PFN_FLAG_BITS;
+}
+
extern pfn_t phys_to_pfn_t(phys_addr_t addr, u64 flags);

static inline bool pfn_t_has_page(pfn_t pfn)
@@ -35,11 +59,6 @@ static inline bool pfn_t_has_page(pfn_t pfn)
return (pfn.val & PFN_MAP) == PFN_MAP || (pfn.val & PFN_DEV) == 0;
}

-static inline unsigned long pfn_t_to_pfn(pfn_t pfn)
-{
- return pfn.val & ~PFN_FLAGS_MASK;
-}
-
static inline struct page *pfn_t_to_page(pfn_t pfn)
{
if (pfn_t_has_page(pfn))
--
2.7.0

2016-03-11 21:13:20

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 3/3] pfn_t: New functions pfn_t_add and pfn_t_cmp

When we find a huge PFN in the radix tree, we need to add the low
bits of the index to it in order to find the PFN we are looking for.
Since we want the result to stay in pfn_t form, create pfn_t_add().

We also need to compare PFNs, for example to determine if the PFN
represents a zero page. At the moment, we only have use for comparing
equality, but a general compare operation is no more code and may prove
useful in the future.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/pfn_t.h | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
index 0ff4c4e..c3bb5fe 100644
--- a/include/linux/pfn_t.h
+++ b/include/linux/pfn_t.h
@@ -67,6 +67,22 @@ static inline unsigned long pfn_t_to_pfn(pfn_t pfn)
return v >> PFN_FLAG_BITS;
}

+static inline __must_check pfn_t pfn_t_add(const pfn_t pfn, int val)
+{
+ pfn_t tmp = pfn;
+ if (tmp.val & PFN_HUGE)
+ tmp.val &= ~PFN_SIZE_MASK;
+ tmp.val &= ~PFN_SG_MASK;
+ tmp.val += val << PFN_FLAG_BITS;
+ return tmp;
+}
+
+/* Like memcmp, returns <0 if a<b, 0 if a=b and >0 if b>a */
+static inline int pfn_t_cmp(pfn_t a, pfn_t b)
+{
+ return pfn_t_to_pfn(b) - pfn_t_to_pfn(a);
+}
+
extern pfn_t phys_to_pfn_t(phys_addr_t addr, u64 flags);

static inline bool pfn_t_has_page(pfn_t pfn)
--
2.7.0

2016-03-11 21:40:23

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 1/3] pfn_t: Change the encoding

On Fri, Mar 11, 2016 at 1:13 PM, Matthew Wilcox
<[email protected]> wrote:
> By moving the flag bits to the bottom, we encourage commonality
> between SGs with pages and those using pfn_t. We can also then insert
> a pfn_t into a radix tree, as it uses the same two bits for indirect &
> exceptional indicators.

It's not immediately clear to me what we gain with SG entry
commonality. The down side is that we lose the property that
pfn_to_pfn_t() is a nop. This was Dave's suggestion so that the
nominal case did not change the binary layout of a typical pfn.

Can we just bit swizzle a pfn_t on insertion/retrieval from the radix?

2016-03-12 18:29:52

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/3] pfn_t: Change the encoding

On Fri, Mar 11, 2016 at 01:40:20PM -0800, Dan Williams wrote:
> On Fri, Mar 11, 2016 at 1:13 PM, Matthew Wilcox
> <[email protected]> wrote:
> > By moving the flag bits to the bottom, we encourage commonality
> > between SGs with pages and those using pfn_t. We can also then insert
> > a pfn_t into a radix tree, as it uses the same two bits for indirect &
> > exceptional indicators.
>
> It's not immediately clear to me what we gain with SG entry
> commonality. The down side is that we lose the property that
> pfn_to_pfn_t() is a nop. This was Dave's suggestion so that the
> nominal case did not change the binary layout of a typical pfn.

I understand that motivation!

> Can we just bit swizzle a pfn_t on insertion/retrieval from the radix?

Of course we *can*, but we end up doing more swizzling that way than we
do this way. In the Brave New Future where we're storing pfn_t in the
radix tree, on a page fault we find the pfn_t in the radix tree then
we want to insert it into the page tables. So DAX would first have to
convert the radix tree entry to a pfn_t, then the page table code has to
convert the pfn_t into a pte/pmd/pud (which we currently do by converting
a pfn_t to a pfn, then converting the pfn to a pte/pmd/pud, but I assume
that either the compiler optimises that into a single conversion, or we'll
add pfn_t_pte to each architecture in future if it's actually a problem).

Much easier to look up a pfn_t in the radix tree and pass it directly
to vm_insert_mixed().

If there's any part of the kernel that is doing a *lot* of conversion
between pfn_t and pfn, that surely indicates a place in the kernel where
we need to convert an interface from pfn to pfn_t.

(It occurs to me we can make the code simpler on architectures that
don't support PUDs. The PFN_HUGE bit is still available to distinguish
between PMDs and PTEs, but we won't need to clear the bottom bit of the
PFN if PFN_HUGE is set, since nobody can add a PUD pfn to the radix tree).

2016-03-13 23:09:46

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 1/3] pfn_t: Change the encoding

On Sat, Mar 12, 2016 at 10:30 AM, Matthew Wilcox <[email protected]> wrote:
> On Fri, Mar 11, 2016 at 01:40:20PM -0800, Dan Williams wrote:
>> On Fri, Mar 11, 2016 at 1:13 PM, Matthew Wilcox
>> <[email protected]> wrote:
>> > By moving the flag bits to the bottom, we encourage commonality
>> > between SGs with pages and those using pfn_t. We can also then insert
>> > a pfn_t into a radix tree, as it uses the same two bits for indirect &
>> > exceptional indicators.
>>
>> It's not immediately clear to me what we gain with SG entry
>> commonality. The down side is that we lose the property that
>> pfn_to_pfn_t() is a nop. This was Dave's suggestion so that the
>> nominal case did not change the binary layout of a typical pfn.
>
> I understand that motivation!
>
>> Can we just bit swizzle a pfn_t on insertion/retrieval from the radix?
>
> Of course we *can*, but we end up doing more swizzling that way than we
> do this way. In the Brave New Future where we're storing pfn_t in the
> radix tree, on a page fault we find the pfn_t in the radix tree then
> we want to insert it into the page tables. So DAX would first have to
> convert the radix tree entry to a pfn_t, then the page table code has to
> convert the pfn_t into a pte/pmd/pud (which we currently do by converting
> a pfn_t to a pfn, then converting the pfn to a pte/pmd/pud, but I assume
> that either the compiler optimises that into a single conversion, or we'll
> add pfn_t_pte to each architecture in future if it's actually a problem).
>
> Much easier to look up a pfn_t in the radix tree and pass it directly
> to vm_insert_mixed().
>
> If there's any part of the kernel that is doing a *lot* of conversion
> between pfn_t and pfn, that surely indicates a place in the kernel where
> we need to convert an interface from pfn to pfn_t.

So this is dependent on where pfn_t gets pushed in the future. For
example, if we revive using a pfn_t in a bio then I think the
pfn_to_pfn_t() conversions will be more prevalent than the fs/dax.c
radix usages.

2016-03-14 15:00:08

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/3] pfn_t: Change the encoding

On Sun, Mar 13, 2016 at 04:09:38PM -0700, Dan Williams wrote:
> On Sat, Mar 12, 2016 at 10:30 AM, Matthew Wilcox <[email protected]> wrote:
> > On Fri, Mar 11, 2016 at 01:40:20PM -0800, Dan Williams wrote:
> >> Can we just bit swizzle a pfn_t on insertion/retrieval from the radix?
> >
> > Of course we *can*, but we end up doing more swizzling that way than we
> > do this way. In the Brave New Future where we're storing pfn_t in the
> > radix tree, on a page fault we find the pfn_t in the radix tree then
> > we want to insert it into the page tables. So DAX would first have to
> > convert the radix tree entry to a pfn_t, then the page table code has to
> > convert the pfn_t into a pte/pmd/pud (which we currently do by converting
> > a pfn_t to a pfn, then converting the pfn to a pte/pmd/pud, but I assume
> > that either the compiler optimises that into a single conversion, or we'll
> > add pfn_t_pte to each architecture in future if it's actually a problem).
> >
> > Much easier to look up a pfn_t in the radix tree and pass it directly
> > to vm_insert_mixed().
> >
> > If there's any part of the kernel that is doing a *lot* of conversion
> > between pfn_t and pfn, that surely indicates a place in the kernel where
> > we need to convert an interface from pfn to pfn_t.
>
> So this is dependent on where pfn_t gets pushed in the future. For
> example, if we revive using a pfn_t in a bio then I think the
> pfn_to_pfn_t() conversions will be more prevalent than the fs/dax.c
> radix usages.

Yes, we'll be converting to a pfn_t in more places than we are now
... but what do we do with that pfn_t once we've got it into a bio?
Except for some rare cases (brd, maybe pmem), it gets converted into an
sg list which then gets DMA mapped, then the DMA addresses are converted
into whatever format the hardware wants. As long as we convert the sg
list before we convert the bio, there aren't going to be any additional
conversions from pfn_t to pfn. So I don't see this showing up as an
additional per-I/O cost.