LinuxLists.cc - Re: [RFC PATCH 3/3] mm/migrate: Create move_phys

2023-09-19 18:02:53

Subject: Re: [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall

On Tue, Sep 19, 2023, at 9:31 AM, Gregory Price wrote:
> On Mon, Sep 18, 2023 at 08:34:16PM -0700, Andy Lutomirski wrote:
>>
>>
>> On Sun, Sep 10, 2023, at 4:49 AM, Gregory Price wrote:
>> > On Sun, Sep 10, 2023 at 02:36:40PM -0600, Jonathan Corbet wrote:
>> >>
>> >> So this is probably a silly question, but just to be sure ... what is
>> >> the permission model for this system call? As far as I can tell, the
>> >> ability to move pages is entirely unrestricted, with the exception of
>> >> pages that would need MPOL_MF_MOVE_ALL. If so, that seems undesirable,
>> >> but probably I'm just missing something ... ?
>> >>
>> >> Thanks,
>> >>
>> >> jon
>> >
>> > Not silly, looks like when U dropped the CAP_SYS_NICE check (no task to
>> > check against), check i neglected to add a CAP_SYS_ADMIN check.
>>
>> Global, I presume?
>>
>> I have to admit that I don’t think this patch set makes sense at all.
>>
>> As I understand it, there are two kinds of physical memory resource in CXL: those that live on a device and those that live in host memory.
>>
>> Device memory doesn’t migrate as such: if a page is on an accelerator, it’s on that accelerator. (If someone makes an accelerator with *two* PCIe targets and connects each target to a different node, that’s a different story.)
>>
>> Host memory is host memory. CXL may access it, and the CXL access from a given device may be faster if that device is connected closer to the memory. And the device may or may not know the virtual address and PASID of the memory.
>>
>
> The CXL memory description here is a bit inaccurate. Memory on the CXL
> bus is not limited to host and accelerator, CXL memory devices may also
> present memory for use by the system as-if it were just DRAM as well.
> The accessing mechanisms are the same (i.e. you can 'mov rax, [rbx]'
> and the result is a cacheline fetch that goes over the cxl bus rather
> than the DRAM memory controllers).
>
> Small CXL background for the sake of clarity:
>
> type-2 devices are "accelerators", and the memory relationships you
> describe here are roughly accurate. The intent of this interface is not
> really for the purpose of managing type-2/accelerator device memory.
>
> type-3 devices are "memory devices", whose intent is to provide the
> system additional memory resources that get mapped into one or more numa
> nodes. The intent of these devices is to present memory to the kernel
> *as-if* it were regular old DRAM just with different latency and
> bandwidth attributes. This is a simplification of the overall goal.
>
>
> So from the perspective of the kernel and a memory-tiering system, we
> have numa nodes which abstract physical memory, and that physical memory
> may actually live anywhere (DRAM, CXL, where-ever). This memory is
> fungible with the exception that CXL memory should be placed in
> ZONE_MOVABLE to ensure the hot-plugability of those memory devices.
>
>
> The intent of this interface is to make page-migration decisions without
> the need to track individual processes or virtual address mappings.
>
> One example would be to utilize the idle page tracking mechanism from
> userland to make migration decisions.
>
> https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html
>
> This mechanism allows a user to determine which PFNs are idle. Combine
> this information with a move_phys_page syscall, you can now implement
> demotion/promotion in user-land without having to identify the virtual
> address mapping of those PFN's in user-land.
>
>
>> I fully believe that there’s some use for migrating host memory to a node that's closer to a device. But I don't think this API is the right way. First, something needs to figure out that the host memory should be migrated. Doing this presumably involves identifying which (logical!) memory is being accessed and deciding to move it. Maybe new APIs are needed to enable this.
>>
>
> The intent is not to migrate memory to making it "closer to a device",
> assuming you mean the intent is to make that data closer to a device
> that is using it (i.e. an accelerator).
>
> The intent is to allow migration of memory based on a user-defined
> policy via the usage of physical addresses.
>
> Lets consider a bandwidth-expansion focused tiering policy. Each
> additional CXL Type-3 Memory device provides additional memory
> bandwidth to a processor via its pcie/cxl lanes.
>
> If all you care about is latency, moving/migrating pages closer to the
> processor is beneficial. However, if you care about maximizing
> bandwidth, distributing memory across all possible devices with some
> statistical distribution is a better goal.
>
> So sometimes you actually want hot data further away because it allows
> for more concurrent cacheline fetches to occur.
>
>
> The question as to whether getting the logical memory address is
> required, useful, or performant depends on what sources of information
> you can pull physical address information from.
>
> Above I explained idle page tracking, but another example would be the
> CXL device directly, which knows 2 pieces of information (generally):
>
> 1) The extent of the memory it is hosting (some size)
> 2) The physical-to-device address mapping for the system contacting it.
>
> The device talks (internally) in 0-based addressing (0x0 up to 0x...),
> but the host places the host physical address (HPA) on the bus
> (0x123450000). The device receives and converts 0x123450000 (HPA) into
> a 0-base address (device-physical-address, DPA).
>
> How does this relate to this interface?
>
> Consider a device which provides a "heat-map" for the memory it is
> hosting. If a user or system requests this heat-map, the device can
> only provide that information in terms of either HPA or DPA. If DPA,
> then the host can recover the HPA by simply looking at the mapping it
> programmed the device with. This reverse-transaction (DPA-to-HPA) is
> relatively inexpensive.
>
> The idle-page tracking interface is actually a good example of this. It
> is functionally an heat-map for the entire system.
>
> However, things get extraordinary expensive after this. HPA to host
> virtual address translation (HPA to HVA) requires inspecting every task
> that may map that HPA in its page tables. When the cacheline fetch hits
> the bus, you are well below the construct of a "task", and the devices
> has no way of telling you what task is using memory on that device.
>
> This makes any kind of tiering operation based on this type of heat-map
> information somewhat of a non-starter. You steal so much performance
> just converting that information into task-specific information, that
> you may as well not bother doing it.
>
> Instead, this interface would allow for a tiering policy to operate on
> such heat-map information directly, and since all CXL memory is intended
> to be placed in ZONE_MOVABLE, that memory should always be migratable.
>
>> But this API is IMO rather silly. Just as a trivial observation, if you migrate a page you identify by physical address, *that physical address changes*. So the only way it possibly works is that whatever heuristic is using the API knows to invalidate itself after calling the API, but of course it also needs to invalidate itself if the kernel becomes intelligent enough to migrate the page on its own or the owner of the logical page triggers migration, etc.
>>
>> Put differently, the operation "migrate physical page 0xABCD000 to node 3" makes no sense. That physical address belongs to whatever node its on, and without some magic hardware support that does not currently exist, it's not going anywhere at runtime.
>>
>> I just don't see it this code working well, never mind the security issues.
>
> I think this is more of a terminology issue. I'm not married to the
> name, but to me move_phys_page is intuitively easier to understand
> because move_page exists and the only difference between the two
> interfaces is virtual vs physical addressing.
>
> move_pages doesn't "migrate a virtual page" either, it "migrates the
> data pointed to by this virtual address to another physical page located
> on the target numa node".
>
> Likewise this interface "migrates the data located at the physical address,
> assuming the physical address is mapped, to another page on the target numa
> node".

I'm not complaining about the name. I'm objecting about the semantics.

Apparently you have a system to collect usage statistics of physical addresses, but you have no idea what those pages map do (without crawling /proc or /sys, anyway). But that means you have no idea when the logical contents of those pages *changes*. So you fundamentally have a nasty race: anything else that swaps or migrates those pages will mess up your statistics, and you'll start trying to migrate the wrong thing.

2023-09-19 20:23:10

by Gregory Price

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall

On Tue, Sep 19, 2023 at 10:59:33AM -0700, Andy Lutomirski wrote:
>
> I'm not complaining about the name. I'm objecting about the semantics.
>
> Apparently you have a system to collect usage statistics of physical addresses, but you have no idea what those pages map do (without crawling /proc or /sys, anyway). But that means you have no idea when the logical contents of those pages *changes*. So you fundamentally have a nasty race: anything else that swaps or migrates those pages will mess up your statistics, and you'll start trying to migrate the wrong thing.

How does this change if I use virtual address based migration?

I could do sampling based on virtual address (page faults, IBS/PEBs,
whatever), and by the time I make a decision, the kernel could have
migrated the data or even my task from Node A to Node B. The sample I
took is now stale, and I could make a poor migration decision.

If I do move_pages(pid, some_virt_addr, some_node) and it migrates the
page from NodeA to NodeB, then the device-side collection is likewise
no longer valid. This problem doesn't change because I used virtual
address compared to physical address.

But if i have a 512GB memory device, and i can see a wide swath of that
512GB is hot, while a good chunk of my local DRAM is not - then I
probably don't care *what* gets migrated up to DRAM, i just care that a
vast majority of that hot data does.

The goal here isn't 100% precision, you will never get there. The goal
here is broad-scope performance enhancements of the overall system
while minimizing the cost to compute the migration actions to be taken.

I don't think the contents of the page are always relevant. The entire
concept here is to enable migration without caring about what programs
are using the memory for - just so long as the memcg's and zoning is
respected.

~Gregory