2021-01-28 00:28:57

by Pasha Tatashin

[permalink] [raw]
Subject: dax alignment problem on arm64 (and other achitectures)

This is something that Dan Williams and I discussed off the mailing
list sometime ago, but I want to have a broader discussion about this
problem so I could send out a fix that would be acceptable.

We have a 2G pmem device that is carved out of regular memory that we
use to pass data across reboots. After the machine is rebooted we
hotplug that memory back, so we do not lose 2G of system memory
(machine is small, only 8G of RAM total).

In order to hotplug pmem memory it first must be converted to devdax.
Devdax has a label 2M in size that is placed at the beginning of the
pmem device memory which brings the problem.

The section size is a hotplugging unit on Linux. Whatever gets
hot-plugged or hot-removed must be section size aligned. On x86
section size is 128M on arm64 it is 1G (because arm64 supports 64K
pages, and 128M does not work with 64K pages). Because the first 2M
are subtracted from the pmem device to create devdax, that actual
hot-pluggable memory is not 1G/128M aligned, and instead we lose 126M
on x86 or 1022M on arm64 of memory that is getting hot-plugged, the
whole first section is skipped when memory gets hot plugged because of
2M label.

As a workaround, so we do not lose 1022M out of 8G of memory on arm64
we have section size reduced to 128M. We are using this patch [1].
This way we are losing 126M (which I still hate!)

I would like to get rid of this workaround. First, because I would
like us to switch to 64K pages to gain performance, and second so we
do not depend on an unofficial patch which already has given us some
headache with kdump support.

Here are some solutions that I think we can do:

1. Instead of carving the memory at 1G aligned address, do it at 1G -
2M address, this way when devdax is created it is perfectly 1G
aligned. On ARM64 it causes a panic because there is a 2M hole in
memory. Even if panic is fixed, I do not think this is a proper fix.
This is simply a workaround to the underlying problem.

2. Dan Williams introduced subsections [2]. They, however do not work
with devdax, and hot-plugging in general. Those patches take care of
__add_pages() side of things, and not add_memory(). Also, it is
unclear what kind of user interface changes need to be made in order
to enable subsection features to online/offline pages.

3. Allow to hot plug daxdev together with the label, but teach the
kernel not to touch label (i.e. allocate its memory). IMO, kind of
ugly solution, because when devdax is hot-plugged it is not even aware
of label size. But, perhaps that can be changed.

4. Other ideas? (move dax label to the end? a special case without a
label? label outside of data?)

Thank you,
Pasha

[1] https://lore.kernel.org/lkml/[email protected]
[2] https://lore.kernel.org/lkml/156092349300.979959.17603710711957735135.stgit@dwillia2-desk3.amr.corp.intel.com


2021-01-28 00:30:32

by David Hildenbrand

[permalink] [raw]
Subject: Re: dax alignment problem on arm64 (and other achitectures)

On 27.01.21 21:43, Pavel Tatashin wrote:
> This is something that Dan Williams and I discussed off the mailing
> list sometime ago, but I want to have a broader discussion about this
> problem so I could send out a fix that would be acceptable.
>
> We have a 2G pmem device that is carved out of regular memory that we
> use to pass data across reboots. After the machine is rebooted we

Ordinary reboots or kexec-style reboots? I assume the latter, because
otherwise there is no guarantee about persistence, right?

I remember for kexec-style reboots there is a different approach (using
tmpfs) on the list.

> hotplug that memory back, so we do not lose 2G of system memory
> (machine is small, only 8G of RAM total).
>
> In order to hotplug pmem memory it first must be converted to devdax.
> Devdax has a label 2M in size that is placed at the beginning of the
> pmem device memory which brings the problem.
>
> The section size is a hotplugging unit on Linux. Whatever gets
> hot-plugged or hot-removed must be section size aligned. On x86
> section size is 128M on arm64 it is 1G (because arm64 supports 64K
> pages, and 128M does not work with 64K pages). Because the first 2M

Note that it's soon 128M with 4k and 16k base pages and 512MB with 64k.
The arm64 patch for that is already queued.

> are subtracted from the pmem device to create devdax, that actual
> hot-pluggable memory is not 1G/128M aligned, and instead we lose 126M
> on x86 or 1022M on arm64 of memory that is getting hot-plugged, the
> whole first section is skipped when memory gets hot plugged because of
> 2M label.
>
> As a workaround, so we do not lose 1022M out of 8G of memory on arm64
> we have section size reduced to 128M. We are using this patch [1].
> This way we are losing 126M (which I still hate!)
>
> I would like to get rid of this workaround. First, because I would
> like us to switch to 64K pages to gain performance, and second so we
> do not depend on an unofficial patch which already has given us some
> headache with kdump support.

I'd want to see 128M sections on arm64 with 64k base pages. "How?" you
might ask. One idea would be to switch from 512M THP to 2MB THP (using
cont pages), and instead implement 512MB gigantic pages. Then we can
reduce pageblock_order / MAX_ORDER - 1 and no longer have the section
limitations. Stuff for the future, though (if even ever).

>
> Here are some solutions that I think we can do:
>
> 1. Instead of carving the memory at 1G aligned address, do it at 1G -
> 2M address, this way when devdax is created it is perfectly 1G
> aligned. On ARM64 it causes a panic because there is a 2M hole in
> memory. Even if panic is fixed, I do not think this is a proper fix.
> This is simply a workaround to the underlying problem.

I remember arm64 already has to deal with all different kinds of memory
holes (including huge ones). I don't think this should be a fundamental
issue.

I think it might be a reasonable thing to do for such a special use
case. Does it work on x86-64?

>
> 2. Dan Williams introduced subsections [2]. They, however do not work
> with devdax, and hot-plugging in general. Those patches take care of
> __add_pages() side of things, and not add_memory(). Also, it is
> unclear what kind of user interface changes need to be made in order
> to enable subsection features to online/offline pages.

I am absolutely no fan of teaching add_memory() and friends in general
about sub-sections.

>
> 3. Allow to hot plug daxdev together with the label, but teach the
> kernel not to touch label (i.e. allocate its memory). IMO, kind of
> ugly solution, because when devdax is hot-plugged it is not even aware
> of label size. But, perhaps that can be changed.

I mean, we could teach add_memory() to "skip the first X pages" when
onlining/offlining, not exposing them to the buddy. Something similar we
already do with Oscars vmemmap-on-memory series.

But I guess the issue is that the memmap for the label etc. is already
allocated? Is the label memremapped ZONE_DEVICE memory or what is it? Is
the label exposed in the resource tree?

In case "it's just untouched/unexposed memory", it's fairly simple. In
case the label is exposed as ZONE_DEVICE already, it's more of an issue
and might require further tweaks.

>
> 4. Other ideas? (move dax label to the end? a special case without a
> label? label outside of data?)

What does the label include in your example? Sorry, I have no idea about
devdax labels.

I read "ndctl-create-namespace" - "--no-autolabel: Manage labels for
legacy NVDIMMs that do not support labels". So I assume there is at
least some theoretical way to not have a label on the memory?

>
> Thank you,
> Pasha
>
> [1] https://lore.kernel.org/lkml/[email protected]
> [2] https://lore.kernel.org/lkml/156092349300.979959.17603710711957735135.stgit@dwillia2-desk3.amr.corp.intel.com
>


--
Thanks,

David / dhildenb

2021-01-28 00:39:42

by Pasha Tatashin

[permalink] [raw]
Subject: Re: dax alignment problem on arm64 (and other achitectures)

On Wed, Jan 27, 2021 at 4:09 PM David Hildenbrand <[email protected]> wrote:
>
> On 27.01.21 21:43, Pavel Tatashin wrote:
> > This is something that Dan Williams and I discussed off the mailing
> > list sometime ago, but I want to have a broader discussion about this
> > problem so I could send out a fix that would be acceptable.
> >
> > We have a 2G pmem device that is carved out of regular memory that we
> > use to pass data across reboots. After the machine is rebooted we
>
> Ordinary reboots or kexec-style reboots? I assume the latter, because
> otherwise there is no guarantee about persistence, right?

Both, our firmware supports cold and warm reboot. When we do warm
reboot, memory content is not initialized. However, for performance
reasons, we mostly do kexec reboots.

>
> I remember for kexec-style reboots there is a different approach (using
> tmpfs) on the list.

Right, we are using a similar approach to that tmpfs, but that tmpfs
approach was never upstreamed.

>
> > hotplug that memory back, so we do not lose 2G of system memory
> > (machine is small, only 8G of RAM total).
> >
> > In order to hotplug pmem memory it first must be converted to devdax.
> > Devdax has a label 2M in size that is placed at the beginning of the
> > pmem device memory which brings the problem.
> >
> > The section size is a hotplugging unit on Linux. Whatever gets
> > hot-plugged or hot-removed must be section size aligned. On x86
> > section size is 128M on arm64 it is 1G (because arm64 supports 64K
> > pages, and 128M does not work with 64K pages). Because the first 2M
>
> Note that it's soon 128M with 4k and 16k base pages and 512MB with 64k.
> The arm64 patch for that is already queued.

This is great. Do you have a pointer to that series? It means we can
get rid of our special section size workaround patch, and use the 128M
section size for 4K pages. However, we still can't move to 64K because
losing 510M is too much.

>
> > are subtracted from the pmem device to create devdax, that actual
> > hot-pluggable memory is not 1G/128M aligned, and instead we lose 126M
> > on x86 or 1022M on arm64 of memory that is getting hot-plugged, the
> > whole first section is skipped when memory gets hot plugged because of
> > 2M label.
> >
> > As a workaround, so we do not lose 1022M out of 8G of memory on arm64
> > we have section size reduced to 128M. We are using this patch [1].
> > This way we are losing 126M (which I still hate!)
> >
> > I would like to get rid of this workaround. First, because I would
> > like us to switch to 64K pages to gain performance, and second so we
> > do not depend on an unofficial patch which already has given us some
> > headache with kdump support.
>
> I'd want to see 128M sections on arm64 with 64k base pages. "How?" you
> might ask. One idea would be to switch from 512M THP to 2MB THP (using
> cont pages), and instead implement 512MB gigantic pages. Then we can
> reduce pageblock_order / MAX_ORDER - 1 and no longer have the section
> limitations. Stuff for the future, though (if even ever).

Interesting, but this is not something that would address the
immediate issue. Because, even losing 126M is something I would like
to fix. However, what other benefits reducing section size on arm64
would bring? Do we have requirement where reducing section size is
actually needed?

>
> >
> > Here are some solutions that I think we can do:
> >
> > 1. Instead of carving the memory at 1G aligned address, do it at 1G -
> > 2M address, this way when devdax is created it is perfectly 1G
> > aligned. On ARM64 it causes a panic because there is a 2M hole in
> > memory. Even if panic is fixed, I do not think this is a proper fix.
> > This is simply a workaround to the underlying problem.
>
> I remember arm64 already has to deal with all different kinds of memory
> holes (including huge ones). I don't think this should be a fundamental
> issue.

Perhaps not. I can root cause, and report here what actually happens.

>
> I think it might be a reasonable thing to do for such a special use
> case. Does it work on x86-64?

It does.

> > 2. Dan Williams introduced subsections [2]. They, however do not work
> > with devdax, and hot-plugging in general. Those patches take care of
> > __add_pages() side of things, and not add_memory(). Also, it is
> > unclear what kind of user interface changes need to be made in order
> > to enable subsection features to online/offline pages.
>
> I am absolutely no fan of teaching add_memory() and friends in general
> about sub-sections.
>
> >
> > 3. Allow to hot plug daxdev together with the label, but teach the
> > kernel not to touch label (i.e. allocate its memory). IMO, kind of
> > ugly solution, because when devdax is hot-plugged it is not even aware
> > of label size. But, perhaps that can be changed.
>
> I mean, we could teach add_memory() to "skip the first X pages" when
> onlining/offlining, not exposing them to the buddy. Something similar we
> already do with Oscars vmemmap-on-memory series.
>
> But I guess the issue is that the memmap for the label etc. is already
> allocated? Is the label memremapped ZONE_DEVICE memory or what is it? Is
> the label exposed in the resource tree?

It is exposed:

# ndctl create-namespace --mode raw -e namespace0.0 -f
{
"dev":"namespace0.0",
"mode":"raw",
"size":"2.00 GiB (2.15 GB)",
"sector_size":512,
"blockdev":"pmem0"
}

The raw device is exactly 2G

# cat /proc/iomem | grep 'dax\|namespace'
980000000-9ffffffff : namespace0.0

namespace0.0 is 2G, and there is dax0.0.

Create devdax device:
# ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
{
"dev":"namespace0.0",
"mode":"devdax",
"map":"mem",
"size":"2046.00 MiB (2145.39 MB)",
"uuid":"ed4d6a34-6a11-4ced-8a4f-b2487bddf5d7",
"daxregion":{
"id":0,
"size":"2046.00 MiB (2145.39 MB)",
"align":2097152,
"devices":[
{
"chardev":"dax0.0",
"size":"2046.00 MiB (2145.39 MB)",
"mode":"devdax"
}
]
},
"align":2097152
}

Now, the device is 2046M in size instead of 2G.

root@dplat-cp22:/# cat /proc/iomem | grep 'namespace\|dax'
980000000-9801fffff : namespace0.0
980200000-9ffffffff : dax0.0

We can see the namespace0.0 is 2M, which is label, and dax0.0 is 2046M.


>
> In case "it's just untouched/unexposed memory", it's fairly simple. In
> case the label is exposed as ZONE_DEVICE already, it's more of an issue
> and might require further tweaks.
>
> >
> > 4. Other ideas? (move dax label to the end? a special case without a
> > label? label outside of data?)
>
> What does the label include in your example? Sorry, I have no idea about
> devdax labels.
>
> I read "ndctl-create-namespace" - "--no-autolabel: Manage labels for
> legacy NVDIMMs that do not support labels". So I assume there is at
> least some theoretical way to not have a label on the memory?

Right, but I do not think it is possible to do for dax devices (as of
right now). I assume, it contains information about what kind of
device it is: devdax, fsdax, sector, uuid etc.
See [1] namespaces tabel. It contains summary of pmem devices types,
and which of them have label (all except for raw).

[1] https://nvdimm.wiki.kernel.org/
>
> >
> > Thank you,
> > Pasha
> >
> > [1] https://lore.kernel.org/lkml/[email protected]
> > [2] https://lore.kernel.org/lkml/156092349300.979959.17603710711957735135.stgit@dwillia2-desk3.amr.corp.intel.com
> >
>
>
> --
> Thanks,
>
> David / dhildenb
>

2021-01-28 00:44:14

by David Hildenbrand

[permalink] [raw]
Subject: Re: dax alignment problem on arm64 (and other achitectures)

>> Ordinary reboots or kexec-style reboots? I assume the latter, because
>> otherwise there is no guarantee about persistence, right?
>
> Both, our firmware supports cold and warm reboot. When we do warm
> reboot, memory content is not initialized. However, for performance
> reasons, we mostly do kexec reboots.
>

One issue usually is that often firmware can allocate from available
system RAM and/or modify/initialize it. I assume you're running some
custom firmware :)

>>
>> I remember for kexec-style reboots there is a different approach (using
>> tmpfs) on the list.
>
> Right, we are using a similar approach to that tmpfs, but that tmpfs
> approach was never upstreamed.

I assume that people will follow up on that, because it's getting used
for fast hypervisor reboots by some companies IIRC.

>
>>
>>> hotplug that memory back, so we do not lose 2G of system memory
>>> (machine is small, only 8G of RAM total).
>>>
>>> In order to hotplug pmem memory it first must be converted to devdax.
>>> Devdax has a label 2M in size that is placed at the beginning of the
>>> pmem device memory which brings the problem.
>>>
>>> The section size is a hotplugging unit on Linux. Whatever gets
>>> hot-plugged or hot-removed must be section size aligned. On x86
>>> section size is 128M on arm64 it is 1G (because arm64 supports 64K
>>> pages, and 128M does not work with 64K pages). Because the first 2M
>>
>> Note that it's soon 128M with 4k and 16k base pages and 512MB with 64k.
>> The arm64 patch for that is already queued.
>
> This is great. Do you have a pointer to that series? It means we can
> get rid of our special section size workaround patch, and use the 128M
> section size for 4K pages. However, we still can't move to 64K because
> losing 510M is too much.
>

Sure

https://lkml.kernel.org/r/[email protected]

Personally, I think the future is 4k, especially for smaller machines.
(also, imagine right now how many 512MB THP you can actually use in your
8GB VM ..., simply not suitable for small machines).

>>
>>> are subtracted from the pmem device to create devdax, that actual
>>> hot-pluggable memory is not 1G/128M aligned, and instead we lose 126M
>>> on x86 or 1022M on arm64 of memory that is getting hot-plugged, the
>>> whole first section is skipped when memory gets hot plugged because of
>>> 2M label.
>>>
>>> As a workaround, so we do not lose 1022M out of 8G of memory on arm64
>>> we have section size reduced to 128M. We are using this patch [1].
>>> This way we are losing 126M (which I still hate!)
>>>
>>> I would like to get rid of this workaround. First, because I would
>>> like us to switch to 64K pages to gain performance, and second so we
>>> do not depend on an unofficial patch which already has given us some
>>> headache with kdump support.
>>
>> I'd want to see 128M sections on arm64 with 64k base pages. "How?" you
>> might ask. One idea would be to switch from 512M THP to 2MB THP (using
>> cont pages), and instead implement 512MB gigantic pages. Then we can
>> reduce pageblock_order / MAX_ORDER - 1 and no longer have the section
>> limitations. Stuff for the future, though (if even ever).
>
> Interesting, but this is not something that would address the
> immediate issue. Because, even losing 126M is something I would like
> to fix. However, what other benefits reducing section size on arm64
> would bring? Do we have requirement where reducing section size is
> actually needed?

E.g., Memory hot(un)plug granularity/flexibility (DIMMs, virtio-mem in
the future) and handling large memory holes in a better way (e.g.,
avoiding custom pfn_valid(), not wasting memmap for memory holes).

Reducing pageblock_order / MAX_ORDER - 1 will have other benefits as well.

>
>>
>>>
>>> Here are some solutions that I think we can do:
>>>
>>> 1. Instead of carving the memory at 1G aligned address, do it at 1G -
>>> 2M address, this way when devdax is created it is perfectly 1G
>>> aligned. On ARM64 it causes a panic because there is a 2M hole in
>>> memory. Even if panic is fixed, I do not think this is a proper fix.
>>> This is simply a workaround to the underlying problem.
>>
>> I remember arm64 already has to deal with all different kinds of memory
>> holes (including huge ones). I don't think this should be a fundamental
>> issue.
>
> Perhaps not. I can root cause, and report here what actually happens.
>

Might be related to the broken custom pfn_valid() implementation for
ZONE_DEVICE.

https://lkml.kernel.org/r/[email protected]

And essentially ignoring sub-section data in there for now as well (but
might not be that relevant yet). In addition, this might also be related to

https://lkml.kernel.org/r/161058499000.1840162.702316708443239771.stgit@dwillia2-desk3.amr.corp.intel.com

>>
>> I think it might be a reasonable thing to do for such a special use
>> case. Does it work on x86-64?
>
> It does.

So eventually related to custom pfn_valid() + pfn_to_online_page().

[...]

>>> 3. Allow to hot plug daxdev together with the label, but teach the
>>> kernel not to touch label (i.e. allocate its memory). IMO, kind of
>>> ugly solution, because when devdax is hot-plugged it is not even aware
>>> of label size. But, perhaps that can be changed.
>>
>> I mean, we could teach add_memory() to "skip the first X pages" when
>> onlining/offlining, not exposing them to the buddy. Something similar we
>> already do with Oscars vmemmap-on-memory series.
>>
>> But I guess the issue is that the memmap for the label etc. is already
>> allocated? Is the label memremapped ZONE_DEVICE memory or what is it? Is
>> the label exposed in the resource tree?
>
> It is exposed:
>
> # ndctl create-namespace --mode raw -e namespace0.0 -f
> {
> "dev":"namespace0.0",
> "mode":"raw",
> "size":"2.00 GiB (2.15 GB)",
> "sector_size":512,
> "blockdev":"pmem0"
> }
>
> The raw device is exactly 2G
>
> # cat /proc/iomem | grep 'dax\|namespace'
> 980000000-9ffffffff : namespace0.0
>
> namespace0.0 is 2G, and there is dax0.0.
>
> Create devdax device:
> # ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
> {
> "dev":"namespace0.0",
> "mode":"devdax",
> "map":"mem",
> "size":"2046.00 MiB (2145.39 MB)",
> "uuid":"ed4d6a34-6a11-4ced-8a4f-b2487bddf5d7",
> "daxregion":{
> "id":0,
> "size":"2046.00 MiB (2145.39 MB)",
> "align":2097152,
> "devices":[
> {
> "chardev":"dax0.0",
> "size":"2046.00 MiB (2145.39 MB)",
> "mode":"devdax"
> }
> ]
> },
> "align":2097152
> }
>
> Now, the device is 2046M in size instead of 2G.
>
> root@dplat-cp22:/# cat /proc/iomem | grep 'namespace\|dax'
> 980000000-9801fffff : namespace0.0
> 980200000-9ffffffff : dax0.0
>
> We can see the namespace0.0 is 2M, which is label, and dax0.0 is 2046M.

Thanks, now I recall seeing this when playing with dax/kmem :)

Okay, so add_memory()/remove_memory() would have to deal with starting
with an offset of sub-sections within a section --- whereby all
remaining part of the section is either ZONE_DEVICE memory or not
existent (reading: not system RAM). Then we can just create/remove the
memory block devices and everything will be fine. In addition
online_pages()/offline_pages() would have to be tweaked to skip over the
first X pages.

Not impossible, but I'd like to avoid such hacks if there are better
alternatives (especially, the trick in 1. sounds appealing to me; but
also trying to avoid the label sounds interesting).

>>
>> In case "it's just untouched/unexposed memory", it's fairly simple. In
>> case the label is exposed as ZONE_DEVICE already, it's more of an issue
>> and might require further tweaks.
>>
>>>
>>> 4. Other ideas? (move dax label to the end? a special case without a
>>> label? label outside of data?)
>>
>> What does the label include in your example? Sorry, I have no idea about
>> devdax labels.
>>
>> I read "ndctl-create-namespace" - "--no-autolabel: Manage labels for
>> legacy NVDIMMs that do not support labels". So I assume there is at
>> least some theoretical way to not have a label on the memory?
>
> Right, but I do not think it is possible to do for dax devices (as of
> right now). I assume, it contains information about what kind of
> device it is: devdax, fsdax, sector, uuid etc.
> See [1] namespaces tabel. It contains summary of pmem devices types,
> and which of them have label (all except for raw).

Interesting, I wonder if the label is really required to get this
special use case running. I mean, all you want is to have dax/kmem
expose the whole thing as system RAM. You don't want to lose even 2MB if
it's just for the sake of unnecessary metadata - this is not a real
device, it's "fake" already.

--
Thanks,

David / dhildenb

2021-01-28 01:42:16

by Pasha Tatashin

[permalink] [raw]
Subject: Re: dax alignment problem on arm64 (and other achitectures)

On Wed, Jan 27, 2021 at 5:18 PM David Hildenbrand <[email protected]> wrote:
>
> >> Ordinary reboots or kexec-style reboots? I assume the latter, because
> >> otherwise there is no guarantee about persistence, right?
> >
> > Both, our firmware supports cold and warm reboot. When we do warm
> > reboot, memory content is not initialized. However, for performance
> > reasons, we mostly do kexec reboots.
> >
>
> One issue usually is that often firmware can allocate from available
> system RAM and/or modify/initialize it. I assume you're running some
> custom firmware :)

We have a special firmware that does not touch the last 2G of physical
memory for its allocations :)

>
> >>
> >> I remember for kexec-style reboots there is a different approach (using
> >> tmpfs) on the list.
> >
> > Right, we are using a similar approach to that tmpfs, but that tmpfs
> > approach was never upstreamed.
>
> I assume that people will follow up on that, because it's getting used
> for fast hypervisor reboots by some companies IIRC.

Yes, I am using the same pmem memory for hypervisor updates: convert
pmem to fsdax and start VMs with their memory represented as files on
that device. During hypervisor update, VMs are suspended, but memory
content is skipped during suspend, as it is simply a DAX files.
Reboot hypervisor to a new version via kexec or do warm reboot through
firmware, and resume suspended VMs. The whole update can be made
within less than 5 seconds.

I think the issue that is raised in this discussion is different. With
hypervisor update the memory is never needed to be hot-plugged.

>
> >
> >>
> >>> hotplug that memory back, so we do not lose 2G of system memory
> >>> (machine is small, only 8G of RAM total).
> >>>
> >>> In order to hotplug pmem memory it first must be converted to devdax.
> >>> Devdax has a label 2M in size that is placed at the beginning of the
> >>> pmem device memory which brings the problem.
> >>>
> >>> The section size is a hotplugging unit on Linux. Whatever gets
> >>> hot-plugged or hot-removed must be section size aligned. On x86
> >>> section size is 128M on arm64 it is 1G (because arm64 supports 64K
> >>> pages, and 128M does not work with 64K pages). Because the first 2M
> >>
> >> Note that it's soon 128M with 4k and 16k base pages and 512MB with 64k.
> >> The arm64 patch for that is already queued.
> >
> > This is great. Do you have a pointer to that series? It means we can
> > get rid of our special section size workaround patch, and use the 128M
> > section size for 4K pages. However, we still can't move to 64K because
> > losing 510M is too much.
> >
>
> Sure
>
> https://lkml.kernel.org/r/[email protected]

Excellent, thank you!

>
> Personally, I think the future is 4k, especially for smaller machines.
> (also, imagine right now how many 512MB THP you can actually use in your
> 8GB VM ..., simply not suitable for small machines).

Um, this is not really about 512THP. Yes, this is smaller machine, but
performance is very important to us. Boot budget for the kernel is
under half a second. With 64K we save 0.2s 0.35s vs 0.55s. This is
because fewer struct pages need to be initialized. Also, fewer TLB
misses, and 3-level page tables add up as performance benefits.

For larger servers 64K pages make total sense: Less memory is wasted as metdata.

>
> >>
> >>> are subtracted from the pmem device to create devdax, that actual
> >>> hot-pluggable memory is not 1G/128M aligned, and instead we lose 126M
> >>> on x86 or 1022M on arm64 of memory that is getting hot-plugged, the
> >>> whole first section is skipped when memory gets hot plugged because of
> >>> 2M label.
> >>>
> >>> As a workaround, so we do not lose 1022M out of 8G of memory on arm64
> >>> we have section size reduced to 128M. We are using this patch [1].
> >>> This way we are losing 126M (which I still hate!)
> >>>
> >>> I would like to get rid of this workaround. First, because I would
> >>> like us to switch to 64K pages to gain performance, and second so we
> >>> do not depend on an unofficial patch which already has given us some
> >>> headache with kdump support.
> >>
> >> I'd want to see 128M sections on arm64 with 64k base pages. "How?" you
> >> might ask. One idea would be to switch from 512M THP to 2MB THP (using
> >> cont pages), and instead implement 512MB gigantic pages. Then we can
> >> reduce pageblock_order / MAX_ORDER - 1 and no longer have the section
> >> limitations. Stuff for the future, though (if even ever).
> >
> > Interesting, but this is not something that would address the
> > immediate issue. Because, even losing 126M is something I would like
> > to fix. However, what other benefits reducing section size on arm64
> > would bring? Do we have requirement where reducing section size is
> > actually needed?
>
> E.g., Memory hot(un)plug granularity/flexibility (DIMMs, virtio-mem in
> the future) and handling large memory holes in a better way (e.g.,
> avoiding custom pfn_valid(), not wasting memmap for memory holes).
>
> Reducing pageblock_order / MAX_ORDER - 1 will have other benefits as well.

I see, it makes sense.

>
> >
> >>
> >>>
> >>> Here are some solutions that I think we can do:
> >>>
> >>> 1. Instead of carving the memory at 1G aligned address, do it at 1G -
> >>> 2M address, this way when devdax is created it is perfectly 1G
> >>> aligned. On ARM64 it causes a panic because there is a 2M hole in
> >>> memory. Even if panic is fixed, I do not think this is a proper fix.
> >>> This is simply a workaround to the underlying problem.
> >>
> >> I remember arm64 already has to deal with all different kinds of memory
> >> holes (including huge ones). I don't think this should be a fundamental
> >> issue.
> >
> > Perhaps not. I can root cause, and report here what actually happens.
> >
>
> Might be related to the broken custom pfn_valid() implementation for
> ZONE_DEVICE.
>
> https://lkml.kernel.org/r/[email protected]
>
> And essentially ignoring sub-section data in there for now as well (but
> might not be that relevant yet). In addition, this might also be related to
>
> https://lkml.kernel.org/r/161058499000.1840162.702316708443239771.stgit@dwillia2-desk3.amr.corp.intel.com

I will check it, and see what I find. I saw that panic almost a year
ago, things might have changed since then.

>
> >>
> >> I think it might be a reasonable thing to do for such a special use
> >> case. Does it work on x86-64?
> >
> > It does.
>
> So eventually related to custom pfn_valid() + pfn_to_online_page().
>
> [...]
>
> >>> 3. Allow to hot plug daxdev together with the label, but teach the
> >>> kernel not to touch label (i.e. allocate its memory). IMO, kind of
> >>> ugly solution, because when devdax is hot-plugged it is not even aware
> >>> of label size. But, perhaps that can be changed.
> >>
> >> I mean, we could teach add_memory() to "skip the first X pages" when
> >> onlining/offlining, not exposing them to the buddy. Something similar we
> >> already do with Oscars vmemmap-on-memory series.
> >>
> >> But I guess the issue is that the memmap for the label etc. is already
> >> allocated? Is the label memremapped ZONE_DEVICE memory or what is it? Is
> >> the label exposed in the resource tree?
> >
> > It is exposed:
> >
> > # ndctl create-namespace --mode raw -e namespace0.0 -f
> > {
> > "dev":"namespace0.0",
> > "mode":"raw",
> > "size":"2.00 GiB (2.15 GB)",
> > "sector_size":512,
> > "blockdev":"pmem0"
> > }
> >
> > The raw device is exactly 2G
> >
> > # cat /proc/iomem | grep 'dax\|namespace'
> > 980000000-9ffffffff : namespace0.0
> >
> > namespace0.0 is 2G, and there is dax0.0.
> >
> > Create devdax device:
> > # ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
> > {
> > "dev":"namespace0.0",
> > "mode":"devdax",
> > "map":"mem",
> > "size":"2046.00 MiB (2145.39 MB)",
> > "uuid":"ed4d6a34-6a11-4ced-8a4f-b2487bddf5d7",
> > "daxregion":{
> > "id":0,
> > "size":"2046.00 MiB (2145.39 MB)",
> > "align":2097152,
> > "devices":[
> > {
> > "chardev":"dax0.0",
> > "size":"2046.00 MiB (2145.39 MB)",
> > "mode":"devdax"
> > }
> > ]
> > },
> > "align":2097152
> > }
> >
> > Now, the device is 2046M in size instead of 2G.
> >
> > root@dplat-cp22:/# cat /proc/iomem | grep 'namespace\|dax'
> > 980000000-9801fffff : namespace0.0
> > 980200000-9ffffffff : dax0.0
> >
> > We can see the namespace0.0 is 2M, which is label, and dax0.0 is 2046M.
>
> Thanks, now I recall seeing this when playing with dax/kmem :)
>
> Okay, so add_memory()/remove_memory() would have to deal with starting
> with an offset of sub-sections within a section --- whereby all
> remaining part of the section is either ZONE_DEVICE memory or not
> existent (reading: not system RAM). Then we can just create/remove the
> memory block devices and everything will be fine. In addition
> online_pages()/offline_pages() would have to be tweaked to skip over the
> first X pages.
>
> Not impossible, but I'd like to avoid such hacks if there are better
> alternatives (especially, the trick in 1. sounds appealing to me; but
> also trying to avoid the label sounds interesting).
>
> >>
> >> In case "it's just untouched/unexposed memory", it's fairly simple. In
> >> case the label is exposed as ZONE_DEVICE already, it's more of an issue
> >> and might require further tweaks.
> >>
> >>>
> >>> 4. Other ideas? (move dax label to the end? a special case without a
> >>> label? label outside of data?)
> >>
> >> What does the label include in your example? Sorry, I have no idea about
> >> devdax labels.
> >>
> >> I read "ndctl-create-namespace" - "--no-autolabel: Manage labels for
> >> legacy NVDIMMs that do not support labels". So I assume there is at
> >> least some theoretical way to not have a label on the memory?
> >
> > Right, but I do not think it is possible to do for dax devices (as of
> > right now). I assume, it contains information about what kind of
> > device it is: devdax, fsdax, sector, uuid etc.
> > See [1] namespaces tabel. It contains summary of pmem devices types,
> > and which of them have label (all except for raw).
>
> Interesting, I wonder if the label is really required to get this
> special use case running. I mean, all you want is to have dax/kmem
> expose the whole thing as system RAM. You don't want to lose even 2MB if
> it's just for the sake of unnecessary metadata - this is not a real
> device, it's "fake" already.

Hm, would not it essentially mean allowing memory hot-plug for raw
pmem devices? Something like create mmap, and hot-add raw pmem?

>
> --
> Thanks,
>
> David / dhildenb
>

2021-01-28 15:11:31

by David Hildenbrand

[permalink] [raw]
Subject: Re: dax alignment problem on arm64 (and other achitectures)

>> One issue usually is that often firmware can allocate from available
>> system RAM and/or modify/initialize it. I assume you're running some
>> custom firmware :)
>
> We have a special firmware that does not touch the last 2G of physical
> memory for its allocations :)
>

Fancy :)

[...]

>> Personally, I think the future is 4k, especially for smaller machines.
>> (also, imagine right now how many 512MB THP you can actually use in your
>> 8GB VM ..., simply not suitable for small machines).
>
> Um, this is not really about 512THP. Yes, this is smaller machine, but
> performance is very important to us. Boot budget for the kernel is
> under half a second. With 64K we save 0.2s 0.35s vs 0.55s. This is
> because fewer struct pages need to be initialized. Also, fewer TLB
> misses, and 3-level page tables add up as performance benefits. >
> For larger servers 64K pages make total sense: Less memory is wasted as metdata.

Yes, indeed, for very large servers it might make sense in that regard.
However, once we can eventually free vmemmap of hugetlbfs things could
change; assuming user space will be consuming huge pages (which large
machines better be doing ... databases, hypervisors ... ).

Also, some hypervisors try allocating the memmap completely ... but I
consider that rather a special case.

Personally, I consider being able to use THP/huge pages more important
than having 64k base pages and saving some TLB space there. Also, with
64k you have other drawbacks: for example, each stack, each TLS for
threads in applications suddenly consumes 16 times more memory as "minimum".

Optimizing boot time/memmap initialization further is certainly an
interesting topic.

Anyhow, you know your use case best, just sharing my thoughts :)

[...]

>>>
>>> Right, but I do not think it is possible to do for dax devices (as of
>>> right now). I assume, it contains information about what kind of
>>> device it is: devdax, fsdax, sector, uuid etc.
>>> See [1] namespaces tabel. It contains summary of pmem devices types,
>>> and which of them have label (all except for raw).
>>
>> Interesting, I wonder if the label is really required to get this
>> special use case running. I mean, all you want is to have dax/kmem
>> expose the whole thing as system RAM. You don't want to lose even 2MB if
>> it's just for the sake of unnecessary metadata - this is not a real
>> device, it's "fake" already.
>
> Hm, would not it essentially mean allowing memory hot-plug for raw
> pmem devices? Something like create mmap, and hot-add raw pmem?

Theoretically yes, but I have no idea if that would make sense for real
"raw pmem" as well. Hope some of the pmem/nvdimm experts can clarify
what's possible and what's not :)


--
Thanks,

David / dhildenb

2021-01-29 02:10:29

by Pasha Tatashin

[permalink] [raw]
Subject: Re: dax alignment problem on arm64 (and other achitectures)

> > Might be related to the broken custom pfn_valid() implementation for
> > ZONE_DEVICE.
> >
> > https://lkml.kernel.org/r/[email protected]
> >
> > And essentially ignoring sub-section data in there for now as well (but
> > might not be that relevant yet). In addition, this might also be related to
> >
> > https://lkml.kernel.org/r/161058499000.1840162.702316708443239771.stgit@dwillia2-desk3.amr.corp.intel.com
>
> I will check it, and see what I find. I saw that panic almost a year
> ago, things might have changed since then.

Hi David,

There is no panic anymore, but I also can't offset by 2M anymore, the
minimum that works now is 16M, and if alignment is less than 16M
creating devdax device fails.

So, I tried the new ARM64 patch that reduces section sizes, and two
alignments for pmem: regular 2G alignment, and 2G+16M alignment.
(subtracted 16M from the bottom)

***** 4K page, 6G RAM, 2G PRAM *****
BOOT:
40000000-1bfffffff : System RAM
1c0000000-23fffffff : namespace0.0
DEVDAX:
40000000-1bfffffff : System RAM
1c0000000-1c21fffff : namespace0.0
1c2200000-23fffffff : dax0.0
HOTPLUG:
40000000-1bfffffff : System RAM
1c0000000-1c21fffff : namespace0.0
1c8000000-23fffffff : dax0.0
1c8000000-23fffffff : System RAM (kmem) 128M Wasted (Expected)

***** 4K page, 6G-16M RAM, 2G+16M PRAM *****
BOOT:
40000000-1beffffff : System RAM
1bf000000-23fffffff : namespace0.0
DEVDAX:
40000000-1beffffff : System RAM
1bf000000-1c11fffff : namespace0.0
1c1200000-23fffffff : dax0.0
HOTPLUG:
40000000-1beffffff : System RAM
1bf000000-1c11fffff : namespace0.0
1c8000000-23fffffff : dax0.0
1c8000000-23fffffff : System RAM (kmem) 144M Wasted (????)

***** 64K page, 6G RAM, 2G PRAM *****
BOOT:
40000000-1bfffffff : System RAM
1c0000000-23fffffff : namespace0.0
DEVDAX:
40000000-1bfffffff : System RAM
1c0000000-1dfffffff : namespace0.0
1e0000000-23fffffff : dax0.0
HOTPLUG:
40000000-1bfffffff : System RAM
1c0000000-1dfffffff : namespace0.0
1e0000000-23fffffff : dax0.0
1e0000000-23fffffff : System RAM (kmem) 512M Wasted (Expected)

***** 64K page, 6G-16M RAM, 2G+16M PRAM *****
BOOT:
40000000-1beffffff : System RAM
1bf000000-23fffffff : namespace0.0
DEVDAX:
40000000-1beffffff : System RAM
1bf000000-1bf3fffff : namespace0.0
1bf400000-23fffffff : dax0.0
HOTPLUG:
40000000-1beffffff : System RAM
1bf000000-1bf3fffff : namespace0.0
1c0000000-23fffffff : dax0.0
1c0000000-23fffffff : System RAM (kmem) 16M Wasted (Optimal)

In all three cases only System RAM, namespace0.0, and dax0.0 were
printed from /proc/iomem.
BOOT content of iomem right after boot
DEVDAX content of iomem after devdax is created
ndctl create-namespace --mode devdax -e namespace0.0"
HOTPLUG content of imem after dax0.0 is hotplugged:
echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id


The most surprising part is why with 4K pages and 16M offset 144M is
wasted? For whatever reason, when devdax is created 34 goes wasted to
the label? Something is wrong here.. However, I am happy with 64K
pages result, and that only 16M is wasted, of course optimally, we
should be using any memory here, but it is still much better than what
we have now.

Pasha

2021-01-29 02:57:02

by Dan Williams

[permalink] [raw]
Subject: Re: dax alignment problem on arm64 (and other achitectures)

On Wed, Jan 27, 2021 at 1:50 PM Pavel Tatashin
<[email protected]> wrote:
>
> On Wed, Jan 27, 2021 at 4:09 PM David Hildenbrand <[email protected]> wrote:
> >
> > On 27.01.21 21:43, Pavel Tatashin wrote:
> > > This is something that Dan Williams and I discussed off the mailing
> > > list sometime ago, but I want to have a broader discussion about this
> > > problem so I could send out a fix that would be acceptable.
> > >
> > > We have a 2G pmem device that is carved out of regular memory that we
> > > use to pass data across reboots. After the machine is rebooted we
> >
> > Ordinary reboots or kexec-style reboots? I assume the latter, because
> > otherwise there is no guarantee about persistence, right?
>
> Both, our firmware supports cold and warm reboot. When we do warm
> reboot, memory content is not initialized. However, for performance
> reasons, we mostly do kexec reboots.
>
> >
> > I remember for kexec-style reboots there is a different approach (using
> > tmpfs) on the list.
>
> Right, we are using a similar approach to that tmpfs, but that tmpfs
> approach was never upstreamed.
>
> >
> > > hotplug that memory back, so we do not lose 2G of system memory
> > > (machine is small, only 8G of RAM total).
> > >
> > > In order to hotplug pmem memory it first must be converted to devdax.
> > > Devdax has a label 2M in size that is placed at the beginning of the
> > > pmem device memory which brings the problem.
> > >
> > > The section size is a hotplugging unit on Linux. Whatever gets
> > > hot-plugged or hot-removed must be section size aligned. On x86
> > > section size is 128M on arm64 it is 1G (because arm64 supports 64K
> > > pages, and 128M does not work with 64K pages). Because the first 2M
> >
> > Note that it's soon 128M with 4k and 16k base pages and 512MB with 64k.
> > The arm64 patch for that is already queued.
>
> This is great. Do you have a pointer to that series? It means we can
> get rid of our special section size workaround patch, and use the 128M
> section size for 4K pages. However, we still can't move to 64K because
> losing 510M is too much.
>
> >
> > > are subtracted from the pmem device to create devdax, that actual
> > > hot-pluggable memory is not 1G/128M aligned, and instead we lose 126M
> > > on x86 or 1022M on arm64 of memory that is getting hot-plugged, the
> > > whole first section is skipped when memory gets hot plugged because of
> > > 2M label.
> > >
> > > As a workaround, so we do not lose 1022M out of 8G of memory on arm64
> > > we have section size reduced to 128M. We are using this patch [1].
> > > This way we are losing 126M (which I still hate!)
> > >
> > > I would like to get rid of this workaround. First, because I would
> > > like us to switch to 64K pages to gain performance, and second so we
> > > do not depend on an unofficial patch which already has given us some
> > > headache with kdump support.
> >
> > I'd want to see 128M sections on arm64 with 64k base pages. "How?" you
> > might ask. One idea would be to switch from 512M THP to 2MB THP (using
> > cont pages), and instead implement 512MB gigantic pages. Then we can
> > reduce pageblock_order / MAX_ORDER - 1 and no longer have the section
> > limitations. Stuff for the future, though (if even ever).
>
> Interesting, but this is not something that would address the
> immediate issue. Because, even losing 126M is something I would like
> to fix. However, what other benefits reducing section size on arm64
> would bring? Do we have requirement where reducing section size is
> actually needed?
>
> >
> > >
> > > Here are some solutions that I think we can do:
> > >
> > > 1. Instead of carving the memory at 1G aligned address, do it at 1G -
> > > 2M address, this way when devdax is created it is perfectly 1G
> > > aligned. On ARM64 it causes a panic because there is a 2M hole in
> > > memory. Even if panic is fixed, I do not think this is a proper fix.
> > > This is simply a workaround to the underlying problem.
> >
> > I remember arm64 already has to deal with all different kinds of memory
> > holes (including huge ones). I don't think this should be a fundamental
> > issue.
>
> Perhaps not. I can root cause, and report here what actually happens.
>
> >
> > I think it might be a reasonable thing to do for such a special use
> > case. Does it work on x86-64?
>
> It does.
>
> > > 2. Dan Williams introduced subsections [2]. They, however do not work
> > > with devdax, and hot-plugging in general. Those patches take care of
> > > __add_pages() side of things, and not add_memory(). Also, it is
> > > unclear what kind of user interface changes need to be made in order
> > > to enable subsection features to online/offline pages.
> >
> > I am absolutely no fan of teaching add_memory() and friends in general
> > about sub-sections.
> >
> > >
> > > 3. Allow to hot plug daxdev together with the label, but teach the
> > > kernel not to touch label (i.e. allocate its memory). IMO, kind of
> > > ugly solution, because when devdax is hot-plugged it is not even aware
> > > of label size. But, perhaps that can be changed.
> >
> > I mean, we could teach add_memory() to "skip the first X pages" when
> > onlining/offlining, not exposing them to the buddy. Something similar we
> > already do with Oscars vmemmap-on-memory series.
> >
> > But I guess the issue is that the memmap for the label etc. is already
> > allocated? Is the label memremapped ZONE_DEVICE memory or what is it? Is
> > the label exposed in the resource tree?
>
> It is exposed:
>
> # ndctl create-namespace --mode raw -e namespace0.0 -f

Since we last talked about this the enabling for EFI "Special Purpose"
/ Soft Reserved Memory has gone upstream and instantiates device-dax
instances for address ranges marked with EFI_MEMORY_SP attribute.
Critically this way of declaring device-dax removes the consideration
of it as persistent memory and as such no metadata reservation. So, if
you are willing to maintain the metadata external to the device (which
seems reasonable for your environment) and have your platform firmware
/ kernel command line mark it as EFI_CONVENTIONAL_MEMORY +
EFI_MEMORY_SP, then these reserve-free dax-devices will surface.

See efi_fake_mem for how to apply that range to existing
EFI_CONVENTIONAL_MEMORY ranges, it requires CONFIG_EFI_SOFT_RESERVE=y.

The daxctl utility has grown mechanisms to subdivide such ranges.

daxctl create-device

...starting with v71.



> {
> "dev":"namespace0.0",
> "mode":"raw",
> "size":"2.00 GiB (2.15 GB)",
> "sector_size":512,
> "blockdev":"pmem0"
> }
>
> The raw device is exactly 2G
>
> # cat /proc/iomem | grep 'dax\|namespace'
> 980000000-9ffffffff : namespace0.0
>
> namespace0.0 is 2G, and there is dax0.0.
>
> Create devdax device:
> # ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
> {
> "dev":"namespace0.0",
> "mode":"devdax",
> "map":"mem",
> "size":"2046.00 MiB (2145.39 MB)",
> "uuid":"ed4d6a34-6a11-4ced-8a4f-b2487bddf5d7",
> "daxregion":{
> "id":0,
> "size":"2046.00 MiB (2145.39 MB)",
> "align":2097152,
> "devices":[
> {
> "chardev":"dax0.0",
> "size":"2046.00 MiB (2145.39 MB)",
> "mode":"devdax"
> }
> ]
> },
> "align":2097152
> }
>
> Now, the device is 2046M in size instead of 2G.
>
> root@dplat-cp22:/# cat /proc/iomem | grep 'namespace\|dax'
> 980000000-9801fffff : namespace0.0
> 980200000-9ffffffff : dax0.0
>
> We can see the namespace0.0 is 2M, which is label, and dax0.0 is 2046M.
>
>
> >
> > In case "it's just untouched/unexposed memory", it's fairly simple. In
> > case the label is exposed as ZONE_DEVICE already, it's more of an issue
> > and might require further tweaks.
> >
> > >
> > > 4. Other ideas? (move dax label to the end? a special case without a
> > > label? label outside of data?)
> >
> > What does the label include in your example? Sorry, I have no idea about
> > devdax labels.
> >
> > I read "ndctl-create-namespace" - "--no-autolabel: Manage labels for
> > legacy NVDIMMs that do not support labels". So I assume there is at
> > least some theoretical way to not have a label on the memory?
>
> Right, but I do not think it is possible to do for dax devices (as of
> right now). I assume, it contains information about what kind of
> device it is: devdax, fsdax, sector, uuid etc.
> See [1] namespaces tabel. It contains summary of pmem devices types,
> and which of them have label (all except for raw).
>
> [1] https://nvdimm.wiki.kernel.org/
> >
> > >
> > > Thank you,
> > > Pasha
> > >
> > > [1] https://lore.kernel.org/lkml/[email protected]
> > > [2] https://lore.kernel.org/lkml/156092349300.979959.17603710711957735135.stgit@dwillia2-desk3.amr.corp.intel.com
> > >
> >
> >
> > --
> > Thanks,
> >
> > David / dhildenb
> >

2021-01-29 16:36:06

by Pasha Tatashin

[permalink] [raw]
Subject: Re: dax alignment problem on arm64 (and other achitectures)

On Fri, Jan 29, 2021 at 8:19 AM David Hildenbrand <[email protected]> wrote:
>
> On 29.01.21 03:06, Pavel Tatashin wrote:
> >>> Might be related to the broken custom pfn_valid() implementation for
> >>> ZONE_DEVICE.
> >>>
> >>> https://lkml.kernel.org/r/[email protected]
> >>>
> >>> And essentially ignoring sub-section data in there for now as well (but
> >>> might not be that relevant yet). In addition, this might also be related to
> >>>
> >>> https://lkml.kernel.org/r/161058499000.1840162.702316708443239771.stgit@dwillia2-desk3.amr.corp.intel.com
> >>
> >> I will check it, and see what I find. I saw that panic almost a year
> >> ago, things might have changed since then.
> >
> > Hi David,
> >
> > There is no panic anymore, but I also can't offset by 2M anymore, the
> > minimum that works now is 16M, and if alignment is less than 16M
> > creating devdax device fails.
>
> I wonder why we get such different namespace sizes? Where do the
> differences come from? This looks very weird.
>
> >
> > So, I tried the new ARM64 patch that reduces section sizes, and two
> > alignments for pmem: regular 2G alignment, and 2G+16M alignment.
> > (subtracted 16M from the bottom)
> >
> > ***** 4K page, 6G RAM, 2G PRAM *****
> > BOOT:
> > 40000000-1bfffffff : System RAM
> > 1c0000000-23fffffff : namespace0.0
> > DEVDAX:
> > 40000000-1bfffffff : System RAM
> > 1c0000000-1c21fffff : namespace0.0
> > 1c2200000-23fffffff : dax0.0
> > HOTPLUG:
> > 40000000-1bfffffff : System RAM
> > 1c0000000-1c21fffff : namespace0.0
> > 1c8000000-23fffffff : dax0.0
> > 1c8000000-23fffffff : System RAM (kmem) 128M Wasted (Expected)
>
> The namespace spans 34MB??
>
> >
> > ***** 4K page, 6G-16M RAM, 2G+16M PRAM *****
> > BOOT:
> > 40000000-1beffffff : System RAM
> > 1bf000000-23fffffff : namespace0.0
> > DEVDAX:
> > 40000000-1beffffff : System RAM
> > 1bf000000-1c11fffff : namespace0.0
> > 1c1200000-23fffffff : dax0.0
> > HOTPLUG:
> > 40000000-1beffffff : System RAM
> > 1bf000000-1c11fffff : namespace0.0
> > 1c8000000-23fffffff : dax0.0
> > 1c8000000-23fffffff : System RAM (kmem) 144M Wasted (????)
>
> The namespace spans 34MB??

Right, this seems like a bug

>
> >
> > ***** 64K page, 6G RAM, 2G PRAM *****
> > BOOT:
> > 40000000-1bfffffff : System RAM
> > 1c0000000-23fffffff : namespace0.0
> > DEVDAX:
> > 40000000-1bfffffff : System RAM
> > 1c0000000-1dfffffff : namespace0.0
> > 1e0000000-23fffffff : dax0.0
> > HOTPLUG:
> > 40000000-1bfffffff : System RAM
> > 1c0000000-1dfffffff : namespace0.0
>
> The namespace spans 512MB ?!? What?

This is because section size is 512M with 64K pages.

>
> > 1e0000000-23fffffff : dax0.0
> > 1e0000000-23fffffff : System RAM (kmem) 512M Wasted (Expected)
> >
> > ***** 64K page, 6G-16M RAM, 2G+16M PRAM *****
> > BOOT:
> > 40000000-1beffffff : System RAM
> > 1bf000000-23fffffff : namespace0.0
> > DEVDAX:
> > 40000000-1beffffff : System RAM
> > 1bf000000-1bf3fffff : namespace0.0
> > 1bf400000-23fffffff : dax0.0
> > HOTPLUG:
> > 40000000-1beffffff : System RAM
> > 1bf000000-1bf3fffff : namespace0.0
>
> The namespace now consumes 4MB ?!?
>
> > 1c0000000-23fffffff : dax0.0
> > 1c0000000-23fffffff : System RAM (kmem) 16M Wasted (Optimal)
>
> Good :) I guess more optimal would be 2MB/0MB :)

Agree, but for the offset 16M this is optimal, because 16M is smaller
than section size.

>
> >
> > In all three cases only System RAM, namespace0.0, and dax0.0 were
> > printed from /proc/iomem.
> > BOOT content of iomem right after boot
> > DEVDAX content of iomem after devdax is created
> > ndctl create-namespace --mode devdax -e namespace0.0"
> > HOTPLUG content of imem after dax0.0 is hotplugged:
> > echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> > echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> >
> >
> > The most surprising part is why with 4K pages and 16M offset 144M is
> > wasted? For whatever reason, when devdax is created 34 goes wasted to
> > the label? Something is wrong here.. However, I am happy with 64K
> > pages result, and that only 16M is wasted, of course optimally, we
> > should be using any memory here, but it is still much better than what
> > we have now.
>
> Definitely, but we should try figuring out what's going on here. I
> assume on x86-64 it behaves differently?

Yes, we should root cause. I highly suspect that there is somewhere
alignment miscalculations happen that cause this memory waste with the
offset 16M. I am also not sure why the 2M label size was increased,
and why 16M is now an alignment requirement.

I tested on x86, and got pretty much the same results as on ARM64: 2M
offset is not allowed anymore 16M minimum, and even with 16M offset,
144M is wasted. Here is full QEMU command if anyone wants to repro it:


KERNEL_PARAM='console=ttyS0 ip=dhcp'
KERNEL_PARAM+=' memmap=2G!8G'
#KERNEL_PARAM+=' memmap=2064M!8176M'

qemu-system-x86_64
\
-m 8G -smp 1
\
-machine q35
\
-nographic
\
-enable-kvm
\
-kernel pmem/native/arch/x86/boot/bzImage
\
-initrd
../poky/build/tmp/deploy/images/qemux86-64/core-image-minimal-qemux86-64.cpio.gz
\
-chardev stdio,id=console,signal=off,mux=on
\
-mon chardev=console
\
-serial chardev:console
\
-netdev user,hostfwd=tcp::5000-:22,id=netdev0
\
-device virtio-net-pci,netdev=netdev0
\
-append "$KERNEL_PARAM"

Also, I am using current master branch tip for ndctl command:
root@qemux86-64:~# ndctl --version
71.2.gea014c0

***** 4K page, 6G RAM, 2G PRAM: kernel parameter memmap=2G!8G *****
BOOT:
100000000-1ffffffff : System RAM
200000000-27fffffff : Persistent Memory (legacy)
200000000-27fffffff : namespace0.0

DEVDAX:
100000000-1ffffffff : System RAM
200000000-27fffffff : Persistent Memory (legacy)
200000000-2021fffff : namespace0.0
202200000-27fffffff : dax0.0

HOTPLUG:
100000000-1ffffffff : System RAM
200000000-27fffffff : Persistent Memory (legacy)
200000000-2021fffff : namespace0.0
208000000-27fffffff : dax0.0
208000000-27fffffff : System RAM (kmem) (128M Wasted)

***** 4K page, 6G-16M RAM, 2G+16M PRAM: kernel parameter
memmap=2064M!8176M *****
BOOT:
100000000-1feffffff : System RAM
1ff000000-27fffffff : Persistent Memory (legacy)
1ff000000-27fffffff : namespace0.0

DEVDAX:
100000000-1feffffff : System RAM
1ff000000-27fffffff : Persistent Memory (legacy)
1ff000000-2011fffff : namespace0.0
201200000-27fffffff : dax0.0

HOTPLUG:
100000000-1feffffff : System RAM
1ff000000-27fffffff : Persistent Memory (legacy)
1ff000000-2011fffff : namespace0.0
208000000-27fffffff : dax0.0
208000000-27fffffff : System RAM (kmem) (144M Wasted)

The least amount of wasted memory I can get on x86 with this
experiment is with offset that is larger than 34M, and 16M aligned:
48M: memmap=2096M!8144M

root@qemux86-64:~# cat /proc/iomem | grep 'dax\|namespace\|System\|Pers'
100000000-1fcffffff : System RAM
1fd000000-27fffffff : Persistent Memory (legacy)
1fd000000-1ff1fffff : namespace0.0
200000000-27fffffff : dax0.0
200000000-27fffffff : System RAM (kmem) (48M Wasted)

Pasha


>
> Thanks
>
>
> --
> Thanks,
>
> David / dhildenb
>

2021-01-29 16:39:07

by Pasha Tatashin

[permalink] [raw]
Subject: Re: dax alignment problem on arm64 (and other achitectures)

On Fri, Jan 29, 2021 at 9:51 AM Joao Martins <[email protected]> wrote:
>
> Hey Pavel,
>
> On 1/29/21 1:50 PM, Pavel Tatashin wrote:
> >> Since we last talked about this the enabling for EFI "Special Purpose"
> >> / Soft Reserved Memory has gone upstream and instantiates device-dax
> >> instances for address ranges marked with EFI_MEMORY_SP attribute.
> >> Critically this way of declaring device-dax removes the consideration
> >> of it as persistent memory and as such no metadata reservation. So, if
> >> you are willing to maintain the metadata external to the device (which
> >> seems reasonable for your environment) and have your platform firmware
> >> / kernel command line mark it as EFI_CONVENTIONAL_MEMORY +
> >> EFI_MEMORY_SP, then these reserve-free dax-devices will surface.
> >
> > Hi Dan,
> >
> > This is cool. Does it allow conversion between devdax and fsdax so DAX
> > aware filesystem can be installed and data can be put there to be
> > preserved across the reboot?
> >
>
> fwiw wrt to the 'preserved across kexec' part, you are going to need
> something conceptually similar to snippet below the scissors mark.
> Alternatively, we could fix kexec userspace to add conventional memory
> ranges (without the SP attribute part) when it sees a Soft-Reserved region.
> But can't tell which one is the right thing to do.

Hi Joao,

Is not it just a matter of appending arguments to the kernel parameter
during kexec reboot with Soft-Reserved region specified, or am I
missing something? I understand with fileload kexec syscall we might
accidently load segments onto reserved region, but with the original
kexec syscall, where we can specify destinations for each segment that
should not be a problem with today's kexec tools.

I agree that preserving it automatically as you are proposing, would
make more sense, instead of fiddling with kernel parameters and
segment destinations.

Thank you,
Pasha

>
> At the moment, HMAT ranges (or those defined with efi_fake_mem=) aren't
> preserved not because of anything special with HMAT, but simply because
> the EFI memmap conventional ram ranges are not preserved (only runtime
> services). And HMAT/efi_fake_mem expects these to based on EFI memmap.
>
> ---------------->8------------------
>
> From: Joao Martins <[email protected]>
> Subject: x86/efi: add Conventional Memory ranges to runtime-map
>
> Through EFI/HMAT certain ranges are marked with Specific Purpose
> EFI attribute (EFI_MEMORY_SP). These ranges are usually
> specified in a memory descriptor of type Conventional Memory.
>
> We only ever expose regions to the runtime-map that were marked
> with efi_mem_reserve(). Currently these comprise the Runtime
> Data/Code and Boot data. Everything else gets lost, so on a kexec
> boot, if we had an HMAT (or efi_fake_mem= marked regions) the second
> kernel kexec will lose this information, and expose this memory
> as regular RAM.
>
> To address that, let's add the Conventional Memory ranges from the
> firmware EFI memory map to the runtime. kexec then picks these up
> on kexec load. Specifically, we save the fw memmap first, and when
> we enter EFI virtual mode which on x86 is the latest point where
> we filter the EFI memmap to construct one with only runtime services.
>
> Signed-off-by: Joao Martins <[email protected]>
> ---
> diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
> index 8a26e705cb06..c244da8b185d 100644
> --- a/arch/x86/platform/efi/efi.c
> +++ b/arch/x86/platform/efi/efi.c
> @@ -663,6 +663,53 @@ static bool should_map_region(efi_memory_desc_t *md)
> return false;
> }
>
> +static void __init efi_fw_memmap_restore(void **map, int left,
> + int *count, int *pg_shift)
> +{
> + struct efi_memory_map_data *data = &efi_fw_memmap;
> + void *fw_memmap, *new_memmap = *map;
> + unsigned long desc_size;
> + int i, nr_map;
> +
> + if (!data->phys_map)
> + return;
> +
> + /* create new EFI memmap */
> + fw_memmap = early_memremap(data->phys_map, data->size);
> + if (!fw_memmap) {
> + return;
> + }
> +
> + desc_size = data->desc_size;
> + nr_map = data->size / desc_size;
> +
> + for (i = 0; i < nr_map; i++) {
> + efi_memory_desc_t *md = efi_early_memdesc_ptr(fw_memmap,
> + desc_size, i);
> +
> + if (md->type != EFI_CONVENTIONAL_MEMORY)
> + continue;
> +
> + if (left < desc_size) {
> + new_memmap = realloc_pages(new_memmap, *pg_shift);
> + if (!new_memmap) {
> + early_memunmap(fw_memmap, data->size);
> + return;
> + }
> +
> + left += PAGE_SIZE << *pg_shift;
> + (*pg_shift)++;
> + }
> +
> + memcpy(new_memmap + (*count * desc_size), md, desc_size);
> +
> + left -= desc_size;
> + (*count)++;
> + }
> +
> + early_memunmap(fw_memmap, data->size);
> +}
> +
> /*
> * Map the efi memory ranges of the runtime services and update new_mmap with
> * virtual addresses.
> @@ -700,6 +747,8 @@ static void * __init efi_map_regions(int *count, int *pg_shift)
> (*count)++;
> }
>
> + efi_fw_memmap_restore(&new_memmap, left, count, pg_shift);
> +
> return new_memmap;
> }
>
> diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c
> index 6e0f34a38171..5fd075503764 100644
> --- a/drivers/firmware/efi/fake_mem.c
> +++ b/drivers/firmware/efi/fake_mem.c
> @@ -19,9 +19,30 @@
> #include <linux/sort.h>
> #include "fake_mem.h"
>
> +struct efi_memory_map_data efi_fw_memmap;
> struct efi_mem_range efi_fake_mems[EFI_MAX_FAKEMEM];
> int nr_fake_mem;
>
> +static void __init efi_fw_memmap_save(void)
> +{
> + struct efi_memory_map_data *data = &efi_fw_memmap;
> + int new_nr_map = efi.memmap.nr_map;
> + void *new_memmap;
> +
> + if (efi_memmap_alloc(new_nr_map, data) != 0)
> + return;
> +
> + new_memmap = early_memremap(data->phys_map, data->size);
> + if (!new_memmap) {
> + __efi_memmap_free(data->phys_map, data->size, data->flags);
> + return;
> + }
> +
> + efi_runtime_map_copy(new_memmap, data->size);
> +
> + early_memunmap(new_memmap, data->size);
> +}
> +
> static int __init cmp_fake_mem(const void *x1, const void *x2)
> {
> const struct efi_mem_range *m1 = x1;
> @@ -68,7 +89,12 @@ void __init efi_fake_memmap(void)
> {
> int i;
>
> - if (!efi_enabled(EFI_MEMMAP) || !nr_fake_mem)
> + if (!efi_enabled(EFI_MEMMAP))
> + return;
> +
> + efi_fw_memmap_save();
> +
> + if (!nr_fake_mem)
> return;
>
> for (i = 0; i < nr_fake_mem; i++)
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index 8710f5710c1d..72803b1a7a39 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -1280,4 +1280,6 @@ static inline struct efi_mokvar_table_entry *efi_mokvar_entry_find(
> }
> #endif
>
> +extern struct efi_memory_map_data efi_fw_memmap;
> +
> #endif /* _LINUX_EFI_H */

2021-01-29 17:33:52

by Joao Martins

[permalink] [raw]
Subject: Re: dax alignment problem on arm64 (and other achitectures)



On 1/29/21 4:32 PM, Pavel Tatashin wrote:
> On Fri, Jan 29, 2021 at 9:51 AM Joao Martins <[email protected]> wrote:
>>
>> Hey Pavel,
>>
>> On 1/29/21 1:50 PM, Pavel Tatashin wrote:
>>>> Since we last talked about this the enabling for EFI "Special Purpose"
>>>> / Soft Reserved Memory has gone upstream and instantiates device-dax
>>>> instances for address ranges marked with EFI_MEMORY_SP attribute.
>>>> Critically this way of declaring device-dax removes the consideration
>>>> of it as persistent memory and as such no metadata reservation. So, if
>>>> you are willing to maintain the metadata external to the device (which
>>>> seems reasonable for your environment) and have your platform firmware
>>>> / kernel command line mark it as EFI_CONVENTIONAL_MEMORY +
>>>> EFI_MEMORY_SP, then these reserve-free dax-devices will surface.
>>>
>>> Hi Dan,
>>>
>>> This is cool. Does it allow conversion between devdax and fsdax so DAX
>>> aware filesystem can be installed and data can be put there to be
>>> preserved across the reboot?
>>>
>>
>> fwiw wrt to the 'preserved across kexec' part, you are going to need
>> something conceptually similar to snippet below the scissors mark.
>> Alternatively, we could fix kexec userspace to add conventional memory
>> ranges (without the SP attribute part) when it sees a Soft-Reserved region.
>> But can't tell which one is the right thing to do.
>
> Hi Joao,
>
> Is not it just a matter of appending arguments to the kernel parameter
> during kexec reboot with Soft-Reserved region specified, or am I
> missing something? I understand with fileload kexec syscall we might
> accidently load segments onto reserved region, but with the original
> kexec syscall, where we can specify destinations for each segment that
> should not be a problem with today's kexec tools.
>
efi_fake_mem only works with EFI_MEMMAP conventional memory ranges, thus
not having a EFI_MEMMAP with RAM ranges means it's a nop for the soft-reserved
regions. Unless, you trying to suggest something like:

memmap=<start>%<size>+0xefffffff

... To mark soft reserved on top an existing RAM? Sadly don't know if there's
an equivalent for ARM.


> I agree that preserving it automatically as you are proposing, would
> make more sense, instead of fiddling with kernel parameters and
> segment destinations.
>
> Thank you,
> Pasha
>
>>
>> At the moment, HMAT ranges (or those defined with efi_fake_mem=) aren't
>> preserved not because of anything special with HMAT, but simply because
>> the EFI memmap conventional ram ranges are not preserved (only runtime
>> services). And HMAT/efi_fake_mem expects these to based on EFI memmap.
>>

[snip]

2021-01-29 19:12:53

by Pasha Tatashin

[permalink] [raw]
Subject: Re: dax alignment problem on arm64 (and other achitectures)

> > Definitely, but we should try figuring out what's going on here. I
> > assume on x86-64 it behaves differently?
>
> Yes, we should root cause. I highly suspect that there is somewhere
> alignment miscalculations happen that cause this memory waste with the
> offset 16M. I am also not sure why the 2M label size was increased,
> and why 16M is now an alignment requirement.

This appears to be because even if we set vmemmap to be outside of the
dax device, the alignment calculates the maximum size of vmemmap for
this device, and subtracts it from the devdax size.
See [1], line 795 is where this offset is calculated.

This also explains why with 64K pages, the 16M offset worked: because
fewer struct pages were able to fit within 16M - label size.

[1] https://soleen.com/source/xref/linux/drivers/nvdimm/pfn_devs.c?r=b7b3c01b&mo=18459&fi=718#795

2021-01-29 19:18:11

by Pasha Tatashin

[permalink] [raw]
Subject: Re: dax alignment problem on arm64 (and other achitectures)

On Fri, Jan 29, 2021 at 2:06 PM Pavel Tatashin
<[email protected]> wrote:
>
> > > Definitely, but we should try figuring out what's going on here. I
> > > assume on x86-64 it behaves differently?
> >
> > Yes, we should root cause. I highly suspect that there is somewhere
> > alignment miscalculations happen that cause this memory waste with the
> > offset 16M. I am also not sure why the 2M label size was increased,
> > and why 16M is now an alignment requirement.
>
> This appears to be because even if we set vmemmap to be outside of the
> dax device, the alignment calculates the maximum size of vmemmap for
> this device, and subtracts it from the devdax size.
> See [1], line 795 is where this offset is calculated.
>
> This also explains why with 64K pages, the 16M offset worked: because
> fewer struct pages were able to fit within 16M - label size.
>
> [1] https://soleen.com/source/xref/linux/drivers/nvdimm/pfn_devs.c?r=b7b3c01b&mo=18459&fi=718#795

Actually, strike the previous e-mail. The extra space is when we
reserve vmemmap from devdax. IFF we do it from mem, the extra space is
not added. Now, this alignment makes total sense.

Pasha

2021-01-29 19:46:22

by Pasha Tatashin

[permalink] [raw]
Subject: Re: dax alignment problem on arm64 (and other achitectures)

On Fri, Jan 29, 2021 at 2:12 PM Pavel Tatashin
<[email protected]> wrote:
>
> On Fri, Jan 29, 2021 at 2:06 PM Pavel Tatashin
> <[email protected]> wrote:
> >
> > > > Definitely, but we should try figuring out what's going on here. I
> > > > assume on x86-64 it behaves differently?
> > >
> > > Yes, we should root cause. I highly suspect that there is somewhere
> > > alignment miscalculations happen that cause this memory waste with the
> > > offset 16M. I am also not sure why the 2M label size was increased,
> > > and why 16M is now an alignment requirement.
> >
> > This appears to be because even if we set vmemmap to be outside of the
> > dax device, the alignment calculates the maximum size of vmemmap for
> > this device, and subtracts it from the devdax size.
> > See [1], line 795 is where this offset is calculated.
> >
> > This also explains why with 64K pages, the 16M offset worked: because
> > fewer struct pages were able to fit within 16M - label size.
> >
> > [1] https://soleen.com/source/xref/linux/drivers/nvdimm/pfn_devs.c?r=b7b3c01b&mo=18459&fi=718#795
>
> Actually, strike the previous e-mail. The extra space is when we
> reserve vmemmap from devdax. IFF we do it from mem, the extra space is
> not added. Now, this alignment makes total sense.

commit 2522afb86a8cceba0f67dbf05772d21b76d79f06
Author: Dan Williams <[email protected]>
Date: Thu Jan 30 12:06:23 2020 -0800

libnvdimm/region: Introduce an 'align' attribute


This is the patch that introduced the 16M alignment.

/*
* PowerPC requires this alignment for memremap_pages(). All other archs
* should be ok with SUBSECTION_SIZE (see memremap_compat_align()).
*/
#define MEMREMAP_COMPAT_ALIGN_MAX SZ_16M

static unsigned long default_align(struct nd_region *nd_region)
{
unsigned long align;
int i, mappings;
u32 remainder;

if (is_nd_blk(&nd_region->dev))
align = PAGE_SIZE;
else
align = MEMREMAP_COMPAT_ALIGN_MAX;

Dan, is this logic correct? Why is_nd_pmem() cannot be set to
SUBSECTION_SIZE alignment?

Thank you,
Pasha

>
> Pasha

2021-01-29 20:30:17

by Dan Williams

[permalink] [raw]
Subject: Re: dax alignment problem on arm64 (and other achitectures)

On Fri, Jan 29, 2021 at 5:51 AM Pavel Tatashin
<[email protected]> wrote:
>
> > Since we last talked about this the enabling for EFI "Special Purpose"
> > / Soft Reserved Memory has gone upstream and instantiates device-dax
> > instances for address ranges marked with EFI_MEMORY_SP attribute.
> > Critically this way of declaring device-dax removes the consideration
> > of it as persistent memory and as such no metadata reservation. So, if
> > you are willing to maintain the metadata external to the device (which
> > seems reasonable for your environment) and have your platform firmware
> > / kernel command line mark it as EFI_CONVENTIONAL_MEMORY +
> > EFI_MEMORY_SP, then these reserve-free dax-devices will surface.
>
> Hi Dan,
>
> This is cool. Does it allow conversion between devdax and fsdax so DAX
> aware filesystem can be installed and data can be put there to be
> preserved across the reboot?
>

It does not because it's not "pmem" by this designation.

Instead if you want fsdax, zero metadata on the device, and the
ability to switch from fsdax to devdax I think that could be achieved
with a new sysfs attribute at the region-device level. Currently the
mode of a namespace with no metadata on it defaults to "raw" mode
where "raw" treats the pmem as a persistent memory block device with
no DAX capability. There's no reason the default could instead be
devdax with pages mapped.

Something like:
ndctl disable-region region0
echo 1 > /sys/bus/nd/devices/region0/pagemap
echo devdax > /sys/bus/nd/devices/region0/raw_default
ndctl enable-region region0

...where the new pagemap attribute does set_bit(ND_REGION_PAGEMAP,
&nd_region->flags), and raw_default arranges for the namespace to be
shunted over to devdax.