2023-01-10 15:37:47

by Mike Rapoport

[permalink] [raw]
Subject: [PATCH v2 0/2] docs/mm: start filling out new structure

From: "Mike Rapoport (IBM)" <[email protected]>

Hi,

Last year at LSF/MM Matthew promptly created the new structure for MM
documentation, but there still was no patches with content.

I've started to work on it a while ago and I wanted to send it out in a
more complete form, but I've got distracted and didn't have time to work
on this.

With fast changes around struct page and the threat of Lorenzo's book,
I've decided to send out what I have till now with a hope that we can
really make this a collaborative effort with people filling paragraph
here and there.

If somebody does not feel like sending formal patches, just send me the
"raw" text my way and I'll deal with the rest.

The text is relatively heavily formatted because I believe the target
audience will prefer html version.

v2:
* rephrase the paragraph inroducing zones (Lorenzo)
* update formatting (Bagas)
* add section stubs (Bagas)
* small fixes here and there

v1: https://lore.kernel.org/all/[email protected]


Mike Rapoport (IBM) (2):
docs/mm: Page Reclaim: add page label to allow external references
docs/mm: Physical Memory: add structure, introduction and nodes
description

Documentation/mm/page_reclaim.rst | 2 +
Documentation/mm/physical_memory.rst | 340 +++++++++++++++++++++++++++
2 files changed, 342 insertions(+)

--
2.35.1


2023-01-10 15:38:53

by Mike Rapoport

[permalink] [raw]
Subject: [PATCH v2 2/2] docs/mm: Physical Memory: add structure, introduction and nodes description

From: "Mike Rapoport (IBM)" <[email protected]>

Add structure, introduction and Nodes section to Physical Memory
chapter.

Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
Documentation/mm/physical_memory.rst | 340 +++++++++++++++++++++++++++
1 file changed, 340 insertions(+)

diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst
index 2ab7b8c1c863..9ad42ff22d88 100644
--- a/Documentation/mm/physical_memory.rst
+++ b/Documentation/mm/physical_memory.rst
@@ -3,3 +3,343 @@
===============
Physical Memory
===============
+
+Linux is available for a wide range of architectures so there is a need for an
+architecture-independent abstraction to represent the physical memory. This
+chapter describes the structures used to manage physical memory in a running
+system.
+
+The first principal concept prevalent in the memory management is
+`Non-Uniform Memory Access (NUMA)
+<https://en.wikipedia.org/wiki/Non-uniform_memory_access>`_.
+With multi-core and multi-socket machines, memory may be arranged into banks
+that incur a different cost to access depending on the “distance” from the
+processor. For example, there might be a bank of memory assigned to each CPU or
+a bank of memory very suitable for DMA near peripheral devices.
+
+Each bank is called a node and the concept is represented under Linux by a
+``struct pglist_data`` even if the architecture is UMA. This structure is
+always referenced to by it's typedef ``pg_data_t``. ``A pg_data_t`` structure
+for a particular node can be referenced by ``NODE_DATA(nid)`` macro where
+``nid`` is the ID of that node.
+
+For NUMA architectures, the node structures are allocated by the architecture
+specific code early during boot. Usually, these structures are allocated
+locally on the memory bank they represent. For UMA architectures, only one
+static ``pg_data_t`` structure called ``contig_page_data`` is used. Nodes will
+be discussed further in Section :ref:`Nodes <nodes>`
+
+The entire physical address space is partitioned into one or more blocks
+called zones which represent ranges within memory. These ranges are usually
+determined by architectural constraints for accessing the physical memory.
+The memory range within a node that corresponds to a particular zone is
+described by a ``struct zone``, typedeffed to ``zone_t``. Each zone has
+one of the types described below.
+
+* ``ZONE_DMA`` and ``ZONE_DMA32`` represent memory suitable for DMA by
+ peripheral devices that cannot access all of the addressable memory.
+ Depending on the architecture, either of these zone types or even they both
+ can be disabled at build time using ``CONFIG_ZONE_DMA`` and
+ ``CONFIG_ZONE_DMA32`` configuration options. Some 64-bit platforms may need
+ both zones as they support peripherals with different DMA addressing
+ limitations.
+
+* ``ZONE_NORMAL`` is for normal memory that can be accessed by the kernel all
+ the time. DMA operations can be performed on pages in this zone if the DMA
+ devices support transfers to all addressable memory. ``ZONE_NORMAL`` is
+ always enabled.
+
+* ``ZONE_HIGHMEM`` is the part of the physical memory that is not covered by a
+ permanent mapping in the kernel page tables. The memory in this zone is only
+ accessible to the kernel using temporary mappings. This zone is available
+ only on some 32-bit architectures and is enabled with ``CONFIG_HIGHMEM``.
+
+* ``ZONE_MOVABLE`` is for normal accessible memory, just like ``ZONE_NORMAL``.
+ The difference is that most pages in ``ZONE_MOVABLE`` are movable. That means
+ that while virtual addresses of these pages do not change, their content may
+ move between different physical pages. ``ZONE_MOVABLE`` is only enabled when
+ one of ``kernelcore``, ``movablecore`` and ``movable_node`` parameters is
+ present in the kernel command line. See :ref:`Page migration
+ <page_migration>` for additional details.
+
+* ``ZONE_DEVICE`` represents memory residing on devices such as PMEM and GPU.
+ It has different characteristics than RAM zone types and it exists to provide
+ :ref:`struct page <Pages>` and memory map services for device driver
+ identified physical address ranges. ``ZONE_DEVICE`` is enabled with
+ configuration option ``CONFIG_ZONE_DEVICE``.
+
+It is important to note that many kernel operations can only take place using
+``ZONE_NORMAL`` so it is the most performance critical zone. Zones are
+discussed further in Section :ref:`Zones <zones>`.
+
+The relation between node and zone extents is determined by the physical memory
+map reported by the firmware, architectural constraints for memory addressing
+and certain parameters in the kernel command line.
+
+For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM the
+entire memory will be on node 0 and there will be three zones: ``ZONE_DMA``,
+``ZONE_NORMAL`` and ``ZONE_HIGHMEM``::
+
+ 0 2G
+ +-------------------------------------------------------------+
+ | node 0 |
+ +-------------------------------------------------------------+
+
+ 0 16M 896M 2G
+ +----------+-----------------------+--------------------------+
+ | ZONE_DMA | ZONE_NORMAL | ZONE_HIGHMEM |
+ +----------+-----------------------+--------------------------+
+
+
+With a kernel built with ``ZONE_DMA`` disabled and ``ZONE_DMA32`` enabled and
+booted with ``movablecore=80%`` parameter on an arm64 machine with 16 Gbytes of
+RAM equally split between two nodes, there will be ``ZONE_DMA32``,
+``ZONE_NORMAL`` and ``ZONE_MOVABLE`` on node 0, and ``ZONE_NORMAL`` and
+``ZONE_MOVABLE`` on node 1::
+
+
+ 1G 9G 17G
+ +--------------------------------+ +--------------------------+
+ | node 0 | | node 1 |
+ +--------------------------------+ +--------------------------+
+
+ 1G 4G 4200M 9G 9320M 17G
+ +---------+----------+-----------+ +------------+-------------+
+ | DMA32 | NORMAL | MOVABLE | | NORMAL | MOVABLE |
+ +---------+----------+-----------+ +------------+-------------+
+
+.. _nodes:
+
+Nodes
+=====
+
+As we have mentioned, each node in memory is described by a ``pg_data_t`` which
+is a typedef for a ``struct pglist_data``. When allocating a page, by default
+Linux uses a node-local allocation policy to allocate memory from the node
+closest to the running CPU. As processes tend to run on the same CPU, it is
+likely the memory from the current node will be used. The allocation policy can
+be controlled by users as described in
+Documentation/admin-guide/mm/numa_memory_policy.rst.
+
+Most NUMA architectures maintain an array of pointers to the node
+structures. The actual structures are allocated early during boot when
+architecture specific code parses the physical memory map reported by the
+firmware. The bulk of the node initialization happens slightly later in the
+boot process by free_area_init() function, described later in Section
+:ref:`Initialization <initialization>`.
+
+
+Along with the node structures, kernel maintains an array of ``nodemask_t``
+bitmasks called ``node_states``. Each bitmask in this array represents a set of
+nodes with particular properties as defined by ``enum node_states``:
+
+``N_POSSIBLE``
+ The node could become online at some point.
+``N_ONLINE``
+ The node is online.
+``N_NORMAL_MEMORY``
+ The node has regular memory.
+``N_HIGH_MEMORY``
+ The node has regular or high memory. When ``CONFIG_HIGHMEM`` is disabled
+ aliased to ``N_NORMAL_MEMORY``.
+``N_MEMORY``
+ The node has memory(regular, high, movable)
+``N_CPU``
+ The node has one or more CPUs
+
+For each node that has a property described above, the bit corresponding to the
+node ID in the ``node_states[<property>]`` bitmask is set.
+
+For example, for node 2 with normal memory and CPUs, bit 2 will be set in ::
+
+ node_states[N_POSSIBLE]
+ node_states[N_ONLINE]
+ node_states[N_NORMAL_MEMORY]
+ node_states[N_MEMORY]
+ node_states[N_CPU]
+
+For various operations possible with nodemasks please refer to
+``include/linux/nodemask.h``.
+
+Among other things, nodemasks are used to provide macros for node traversal,
+namely ``for_each_node()`` and ``for_each_online_node()``.
+
+For instance, to call a function foo() for each online node::
+
+ for_each_online_node(nid) {
+ pg_data_t *pgdat = NODE_DATA(nid);
+
+ foo(pgdat);
+ }
+
+Node structure
+--------------
+
+The nodes structure ``struct pglist_data`` is declared in
+``include/linux/mmzone.h``. Here we briefly describe fields of this
+structure:
+
+General
+~~~~~~~
+
+``node_zones``
+ The zones for this node. Not all of the zones may be populated, but it is
+ the full list. It is referenced by this node's node_zonelists as well as
+ other node's node_zonelists.
+
+``node_zonelists``
+ The list of all zones in all nodes. This list defines the order of zones
+ that allocations are preferred from. The ``node_zonelists`` is set up by
+ ``build_zonelists()`` in ``mm/page_alloc.c`` during the initialization of
+ core memory management structures.
+
+``nr_zones``
+ Number of populated zones in this node.
+
+``node_mem_map``
+ For UMA systems that use FLATMEM memory model the 0's node
+ ``node_mem_map`` is array of struct pages representing each physical frame.
+
+``node_page_ext``
+ For UMA systems that use FLATMEM memory model the 0's node
+ ``node_page_ext`` is array of extensions of struct pages. Available only
+ in the kernels built with ``CONFIG_PAGE_EXTENTION`` enabled.
+
+``node_start_pfn``
+ The page frame number of the starting page frame in this node.
+
+``node_present_pages``
+ Total number of physical pages present in this node.
+
+``node_spanned_pages``
+ Total size of physical page range, including holes.
+
+``node_size_lock``
+ A lock that protects the fields defining the node extents. Only defined when
+ at least one of ``CONFIG_MEMORY_HOTPLUG`` or
+ ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` configuration options are enabled.
+ ``pgdat_resize_lock()`` and ``pgdat_resize_unlock()`` are provided to
+ manipulate ``node_size_lock`` without checking for ``CONFIG_MEMORY_HOTPLUG``
+ or ``CONFIG_DEFERRED_STRUCT_PAGE_INIT``.
+
+``node_id``
+ The Node ID (NID) of the node, starts at 0.
+
+``totalreserve_pages``
+ This is a per-node reserve of pages that are not available to userspace
+ allocations.
+
+``first_deferred_pfn``
+ If memory initialization on large machines is deferred then this is the first
+ PFN that needs to be initialized. Defined only when
+ ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` is enabled
+
+``deferred_split_queue``
+ Per-node queue of huge pages that their split was deferred. Defined only when ``CONFIG_TRANSPARENT_HUGEPAGE`` is enabled.
+
+``__lruvec``
+ Per-node lruvec holding LRU lists and related parameters. Used only when
+ memory cgroups are disabled. It should not be accessed directly, use
+ ``mem_cgroup_lruvec()`` to look up lruvecs instead.
+
+Reclaim control
+~~~~~~~~~~~~~~~
+
+See also :ref:`Page Reclaim <page_reclaim>`.
+
+``kswapd``
+ Per-node instance of kswapd kernel thread.
+
+``kswapd_wait``, ``pfmemalloc_wait``, ``reclaim_wait``
+ Workqueues used to synchronize memory reclaim tasks
+
+``nr_writeback_throttled``
+ Number of tasks that are throttled waiting on dirty pages to clean.
+
+``nr_reclaim_start``
+ Number of pages written while reclaim is throttled waiting for writeback.
+
+``kswapd_order``
+ Controls the order kswapd tries to reclaim
+
+``kswapd_highest_zoneidx``
+ The highest zone index to be reclaimed by kswapd
+
+``kswapd_failures``
+ Number of runs kswapd was unable to reclaim any pages
+
+``min_unmapped_pages``
+ Minimal number of unmapped file backed pages that cannot be reclaimed.
+ Determined by ``vm.min_unmapped_ratio`` sysctl. Only defined when
+ ``CONFIG_NUMA`` is enabled.
+
+``min_slab_pages``
+ Minimal number of SLAB pages that cannot be reclaimed. Determined by
+ ``vm.min_slab_ratio sysctl``. Only defined when ``CONFIG_NUMA`` is enabled
+
+``flags``
+ Flags controlling reclaim behavior.
+
+Compaction control
+~~~~~~~~~~~~~~~~~~
+
+``kcompactd_max_order``
+ Page order that kcompactd should try to achieve.
+
+``kcompactd_highest_zoneidx``
+ The highest zone index to be compacted by kcompactd.
+
+``kcompactd_wait``
+ Workqueue used to synchronize memory compaction tasks.
+
+``kcompactd``
+ Per-node instance of kcompactd kernel thread.
+
+``proactive_compact_trigger``
+ Determines if proactive compaction is enabled. Controlled by
+ ``vm.compaction_proactiveness`` sysctl.
+
+Statistics
+~~~~~~~~~~
+
+``per_cpu_nodestats``
+ Per-CPU VM statistics for the node
+
+``vm_stat``
+ VM statistics for the node.
+
+.. _zones:
+
+Zones
+=====
+
+.. admonition:: Stub
+
+ This section is incomplete. Please list and describe the appropriate fields.
+
+.. _pages:
+
+Pages
+=====
+
+.. admonition:: Stub
+
+ This section is incomplete. Please list and describe the appropriate fields.
+
+.. _folios:
+
+Folios
+======
+
+.. admonition:: Stub
+
+ This section is incomplete. Please list and describe the appropriate fields.
+
+.. _initialization:
+
+Initialization
+==============
+
+.. admonition:: Stub
+
+ This section is incomplete. Please list and describe the appropriate fields.
--
2.35.1

2023-01-10 17:08:34

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] docs/mm: Physical Memory: add structure, introduction and nodes description

On Tue 10-01-23 17:23:58, Mike Rapoport wrote:
[...]
> +* ``ZONE_DMA`` and ``ZONE_DMA32`` represent memory suitable for DMA by
> + peripheral devices that cannot access all of the addressable memory.

I think it would be better to not keep the historical DMA based menaning
and teach that future developers. You can say something like

ZONE_DMA and ZONE_DMA32 have historically been used for memory suitable
for DMA. For many years there are better more robust interfaces to
get memory with DMA specific requirements (Documentation/core-api/dma-api.rst).

> + Depending on the architecture, either of these zone types or even they both
> + can be disabled at build time using ``CONFIG_ZONE_DMA`` and
> + ``CONFIG_ZONE_DMA32`` configuration options. Some 64-bit platforms may need
> + both zones as they support peripherals with different DMA addressing
> + limitations.
> +
> +* ``ZONE_NORMAL`` is for normal memory that can be accessed by the kernel all
> + the time. DMA operations can be performed on pages in this zone if the DMA
> + devices support transfers to all addressable memory. ``ZONE_NORMAL`` is
> + always enabled.
> +
> +* ``ZONE_HIGHMEM`` is the part of the physical memory that is not covered by a
> + permanent mapping in the kernel page tables. The memory in this zone is only
> + accessible to the kernel using temporary mappings. This zone is available
> + only on some 32-bit architectures and is enabled with ``CONFIG_HIGHMEM``.
> +
> +* ``ZONE_MOVABLE`` is for normal accessible memory, just like ``ZONE_NORMAL``.
> + The difference is that most pages in ``ZONE_MOVABLE`` are movable.

This is really confusing because those pages are not really movable. You
cannot move a page itself. I guess you meant to say something like

The difference is that there are means to migrate memory via
migrate_pages interface. A typical example would be a memory mapped to
userspace which can be rellocate the underlying memory content and
update page tables so that userspace doesn't notice the physical data
placement has changed.

> That means
> + that while virtual addresses of these pages do not change, their content may
> + move between different physical pages. ``ZONE_MOVABLE`` is only enabled when
> + one of ``kernelcore``, ``movablecore`` and ``movable_node`` parameters is
> + present in the kernel command line. See :ref:`Page migration
> + <page_migration>` for additional details.

This is not really true. The movable zone can be also enabled by memory
hotplug. In fact it is one of the more common usecases for the zone
because memory hot remove largerly depends on memory to be migrated for
offlining to succeed in most cases.

> +
> +* ``ZONE_DEVICE`` represents memory residing on devices such as PMEM and GPU.
> + It has different characteristics than RAM zone types and it exists to provide
> + :ref:`struct page <Pages>` and memory map services for device driver
> + identified physical address ranges. ``ZONE_DEVICE`` is enabled with
> + configuration option ``CONFIG_ZONE_DEVICE``.
> +
> +It is important to note that many kernel operations can only take place using
> +``ZONE_NORMAL`` so it is the most performance critical zone. Zones are
> +discussed further in Section :ref:`Zones <zones>`.
> +
> +The relation between node and zone extents is determined by the physical memory
> +map reported by the firmware, architectural constraints for memory addressing
> +and certain parameters in the kernel command line.
> +
> +For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM the
> +entire memory will be on node 0 and there will be three zones: ``ZONE_DMA``,
> +``ZONE_NORMAL`` and ``ZONE_HIGHMEM``::
> +
> + 0 2G
> + +-------------------------------------------------------------+
> + | node 0 |
> + +-------------------------------------------------------------+
> +
> + 0 16M 896M 2G
> + +----------+-----------------------+--------------------------+
> + | ZONE_DMA | ZONE_NORMAL | ZONE_HIGHMEM |
> + +----------+-----------------------+--------------------------+
> +
> +
> +With a kernel built with ``ZONE_DMA`` disabled and ``ZONE_DMA32`` enabled and
> +booted with ``movablecore=80%`` parameter on an arm64 machine with 16 Gbytes of
> +RAM equally split between two nodes, there will be ``ZONE_DMA32``,
> +``ZONE_NORMAL`` and ``ZONE_MOVABLE`` on node 0, and ``ZONE_NORMAL`` and
> +``ZONE_MOVABLE`` on node 1::
> +
> +
> + 1G 9G 17G
> + +--------------------------------+ +--------------------------+
> + | node 0 | | node 1 |
> + +--------------------------------+ +--------------------------+
> +
> + 1G 4G 4200M 9G 9320M 17G
> + +---------+----------+-----------+ +------------+-------------+
> + | DMA32 | NORMAL | MOVABLE | | NORMAL | MOVABLE |
> + +---------+----------+-----------+ +------------+-------------+

I think it is useful to note that nodes and zones can overlap in the
physical address range. It is not uncommong to interleave two nodes and
it is also possible that memory holes are memory hotplugged into MOVABLE
zone arbitrarily in the physical address range.

Other than that looks good to me and thanks for taking care of filling
up these gaps! This is highly appreciated.
--
Michal Hocko
SUSE Labs

2023-01-11 12:54:51

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] docs/mm: Physical Memory: add structure, introduction and nodes description

On Tue, Jan 10, 2023 at 05:54:10PM +0100, Michal Hocko wrote:
> On Tue 10-01-23 17:23:58, Mike Rapoport wrote:
> [...]
> > +* ``ZONE_DMA`` and ``ZONE_DMA32`` represent memory suitable for DMA by
> > + peripheral devices that cannot access all of the addressable memory.
>
> I think it would be better to not keep the historical DMA based menaning
> and teach that future developers. You can say something like
>
> ZONE_DMA and ZONE_DMA32 have historically been used for memory suitable
> for DMA. For many years there are better more robust interfaces to
> get memory with DMA specific requirements (Documentation/core-api/dma-api.rst).

But even today ZONE_DMA(32) means that the memory is suitable for DMA. This
is nicely encapsulated with dma APIs and there should be no new GFP_DMA
users, but still memory outside ZONE_DMA is not suitable for DMA.

> > + Depending on the architecture, either of these zone types or even they both
> > + can be disabled at build time using ``CONFIG_ZONE_DMA`` and
> > + ``CONFIG_ZONE_DMA32`` configuration options. Some 64-bit platforms may need
> > + both zones as they support peripherals with different DMA addressing
> > + limitations.
> > +
> > +* ``ZONE_NORMAL`` is for normal memory that can be accessed by the kernel all
> > + the time. DMA operations can be performed on pages in this zone if the DMA
> > + devices support transfers to all addressable memory. ``ZONE_NORMAL`` is
> > + always enabled.
> > +
> > +* ``ZONE_HIGHMEM`` is the part of the physical memory that is not covered by a
> > + permanent mapping in the kernel page tables. The memory in this zone is only
> > + accessible to the kernel using temporary mappings. This zone is available
> > + only on some 32-bit architectures and is enabled with ``CONFIG_HIGHMEM``.
> > +
> > +* ``ZONE_MOVABLE`` is for normal accessible memory, just like ``ZONE_NORMAL``.
> > + The difference is that most pages in ``ZONE_MOVABLE`` are movable.
>
> This is really confusing because those pages are not really movable. You
> cannot move a page itself. I guess you meant to say something like
>
> The difference is that there are means to migrate memory via
> migrate_pages interface. A typical example would be a memory mapped to
> userspace which can be rellocate the underlying memory content and
> update page tables so that userspace doesn't notice the physical data
> placement has changed.

I agree that this sentence is a bit confusing, but there's a clarification
below. Also, I'd like to keep this at high level without going to the
details about how exactly the pages can be migrated.

> > That means
> > + that while virtual addresses of these pages do not change, their content may
> > + move between different physical pages. ``ZONE_MOVABLE`` is only enabled when
> > + one of ``kernelcore``, ``movablecore`` and ``movable_node`` parameters is
> > + present in the kernel command line. See :ref:`Page migration
> > + <page_migration>` for additional details.
>
> This is not really true. The movable zone can be also enabled by memory
> hotplug. In fact it is one of the more common usecases for the zone
> because memory hot remove largerly depends on memory to be migrated for
> offlining to succeed in most cases.

Right. How about this version of ZONE_MOVABLE description:

* ``ZONE_MOVABLE`` is for normal accessible memory, just like ``ZONE_NORMAL``.
The difference is that the contents of most pages in ``ZONE_MOVABLE`` is
movable. That means that while virtual addresses of these pages do not
change, their content may move between different physical pages. Often
``ZONE_MOVABLE`` is populated during memory hotplug, but it may be
also populated on boot using one of ``kernelcore``, ``movablecore`` and
``movable_node`` kernel command line parameters. See :ref:`Page migration
<page_migration>` and :ref:`Memory Hot(Un)Plug <_admin_guide_memory_hotplug>`
for additional details.

> > +* ``ZONE_DEVICE`` represents memory residing on devices such as PMEM and GPU.
> > + It has different characteristics than RAM zone types and it exists to provide
> > + :ref:`struct page <Pages>` and memory map services for device driver
> > + identified physical address ranges. ``ZONE_DEVICE`` is enabled with
> > + configuration option ``CONFIG_ZONE_DEVICE``.
> > +
> > +It is important to note that many kernel operations can only take place using
> > +``ZONE_NORMAL`` so it is the most performance critical zone. Zones are
> > +discussed further in Section :ref:`Zones <zones>`.
> > +
> > +The relation between node and zone extents is determined by the physical memory
> > +map reported by the firmware, architectural constraints for memory addressing
> > +and certain parameters in the kernel command line.
> > +
> > +For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM the
> > +entire memory will be on node 0 and there will be three zones: ``ZONE_DMA``,
> > +``ZONE_NORMAL`` and ``ZONE_HIGHMEM``::
> > +
> > + 0 2G
> > + +-------------------------------------------------------------+
> > + | node 0 |
> > + +-------------------------------------------------------------+
> > +
> > + 0 16M 896M 2G
> > + +----------+-----------------------+--------------------------+
> > + | ZONE_DMA | ZONE_NORMAL | ZONE_HIGHMEM |
> > + +----------+-----------------------+--------------------------+
> > +
> > +
> > +With a kernel built with ``ZONE_DMA`` disabled and ``ZONE_DMA32`` enabled and
> > +booted with ``movablecore=80%`` parameter on an arm64 machine with 16 Gbytes of
> > +RAM equally split between two nodes, there will be ``ZONE_DMA32``,
> > +``ZONE_NORMAL`` and ``ZONE_MOVABLE`` on node 0, and ``ZONE_NORMAL`` and
> > +``ZONE_MOVABLE`` on node 1::
> > +
> > +
> > + 1G 9G 17G
> > + +--------------------------------+ +--------------------------+
> > + | node 0 | | node 1 |
> > + +--------------------------------+ +--------------------------+
> > +
> > + 1G 4G 4200M 9G 9320M 17G
> > + +---------+----------+-----------+ +------------+-------------+
> > + | DMA32 | NORMAL | MOVABLE | | NORMAL | MOVABLE |
> > + +---------+----------+-----------+ +------------+-------------+
>
> I think it is useful to note that nodes and zones can overlap in the
> physical address range. It is not uncommong to interleave two nodes and
> it is also possible that memory holes are memory hotplugged into MOVABLE
> zone arbitrarily in the physical address range.

Hmm, not sure I understand what you mean by "overlap".
For interleaved nodes you mean that node 0 may span, say [0x0, 0x2000) and
[0x4000, 06000) and node 1 spans [0x2000, 0x4000) and [0x6000, 0x8000)?

And as for MOVABLE zone, you mean that it can appear between ranges of
NORMAL zone?

> Other than that looks good to me and thanks for taking care of filling
> up these gaps! This is highly appreciated.

Thanks!

I'd appreciate more inputs ;-)

> --
> Michal Hocko
> SUSE Labs

--
Sincerely yours,
Mike.

2023-01-11 13:52:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] docs/mm: Physical Memory: add structure, introduction and nodes description

On Wed 11-01-23 14:24:43, Mike Rapoport wrote:
> On Tue, Jan 10, 2023 at 05:54:10PM +0100, Michal Hocko wrote:
> > On Tue 10-01-23 17:23:58, Mike Rapoport wrote:
> > [...]
> > > +* ``ZONE_DMA`` and ``ZONE_DMA32`` represent memory suitable for DMA by
> > > + peripheral devices that cannot access all of the addressable memory.
> >
> > I think it would be better to not keep the historical DMA based menaning
> > and teach that future developers. You can say something like
> >
> > ZONE_DMA and ZONE_DMA32 have historically been used for memory suitable
> > for DMA. For many years there are better more robust interfaces to
> > get memory with DMA specific requirements (Documentation/core-api/dma-api.rst).
>
> But even today ZONE_DMA(32) means that the memory is suitable for DMA. This
> is nicely encapsulated with dma APIs and there should be no new GFP_DMA
> users, but still memory outside ZONE_DMA is not suitable for DMA.

Well, the thing is that ZONE_DMA means different thing for different
architectures. For x86 it is effectivelly about ISA attached HW - which
means almost nothing these days. There is plethora of other HW with
different address range constrains for DMA transfer so binding the zone
with DMA is more likely to cause confusion than it helps.

> > > + Depending on the architecture, either of these zone types or even they both
> > > + can be disabled at build time using ``CONFIG_ZONE_DMA`` and
> > > + ``CONFIG_ZONE_DMA32`` configuration options. Some 64-bit platforms may need
> > > + both zones as they support peripherals with different DMA addressing
> > > + limitations.
> > > +
> > > +* ``ZONE_NORMAL`` is for normal memory that can be accessed by the kernel all
> > > + the time. DMA operations can be performed on pages in this zone if the DMA
> > > + devices support transfers to all addressable memory. ``ZONE_NORMAL`` is
> > > + always enabled.
> > > +
> > > +* ``ZONE_HIGHMEM`` is the part of the physical memory that is not covered by a
> > > + permanent mapping in the kernel page tables. The memory in this zone is only
> > > + accessible to the kernel using temporary mappings. This zone is available
> > > + only on some 32-bit architectures and is enabled with ``CONFIG_HIGHMEM``.
> > > +
> > > +* ``ZONE_MOVABLE`` is for normal accessible memory, just like ``ZONE_NORMAL``.
> > > + The difference is that most pages in ``ZONE_MOVABLE`` are movable.
> >
> > This is really confusing because those pages are not really movable. You
> > cannot move a page itself. I guess you meant to say something like
> >
> > The difference is that there are means to migrate memory via
> > migrate_pages interface. A typical example would be a memory mapped to
> > userspace which can be rellocate the underlying memory content and
> > update page tables so that userspace doesn't notice the physical data
> > placement has changed.
>
> I agree that this sentence is a bit confusing, but there's a clarification
> below. Also, I'd like to keep this at high level without going to the
> details about how exactly the pages can be migrated.

Yes, ZONE_MOVABLE is confusing as well. I do not think you do not have
to elaborate more than just state that the memory should be migrateable.

> > > That means
> > > + that while virtual addresses of these pages do not change, their content may
> > > + move between different physical pages. ``ZONE_MOVABLE`` is only enabled when
> > > + one of ``kernelcore``, ``movablecore`` and ``movable_node`` parameters is
> > > + present in the kernel command line. See :ref:`Page migration
> > > + <page_migration>` for additional details.
> >
> > This is not really true. The movable zone can be also enabled by memory
> > hotplug. In fact it is one of the more common usecases for the zone
> > because memory hot remove largerly depends on memory to be migrated for
> > offlining to succeed in most cases.
>
> Right. How about this version of ZONE_MOVABLE description:
>
> * ``ZONE_MOVABLE`` is for normal accessible memory, just like ``ZONE_NORMAL``.
> The difference is that the contents of most pages in ``ZONE_MOVABLE`` is
> movable. That means that while virtual addresses of these pages do not
> change, their content may move between different physical pages. Often
> ``ZONE_MOVABLE`` is populated during memory hotplug, but it may be
> also populated on boot using one of ``kernelcore``, ``movablecore`` and
> ``movable_node`` kernel command line parameters. See :ref:`Page migration
> <page_migration>` and :ref:`Memory Hot(Un)Plug <_admin_guide_memory_hotplug>`
> for additional details.

Yes, sounds much better!

[...]
> > > + 1G 9G 17G
> > > + +--------------------------------+ +--------------------------+
> > > + | node 0 | | node 1 |
> > > + +--------------------------------+ +--------------------------+
> > > +
> > > + 1G 4G 4200M 9G 9320M 17G
> > > + +---------+----------+-----------+ +------------+-------------+
> > > + | DMA32 | NORMAL | MOVABLE | | NORMAL | MOVABLE |
> > > + +---------+----------+-----------+ +------------+-------------+
> >
> > I think it is useful to note that nodes and zones can overlap in the
> > physical address range. It is not uncommong to interleave two nodes and
> > it is also possible that memory holes are memory hotplugged into MOVABLE
> > zone arbitrarily in the physical address range.
>
> Hmm, not sure I understand what you mean by "overlap".
> For interleaved nodes you mean that node 0 may span, say [0x0, 0x2000) and
> [0x4000, 06000) and node 1 spans [0x2000, 0x4000) and [0x6000, 0x8000)?

Yes. that would be represented by
NODE_DATA(0)->start_pfn = 0
NODE_DATA(0)->node_spanned_pages= 0x6000
NODE_DATA(1)->start_pfn = 0x4000
NODE_DATA(1)->node_spanned_pages= 0x6000


> And as for MOVABLE zone, you mean that it can appear between ranges of
> NORMAL zone?

Yes and also other zones as well but that is less likely as those tend
to be populated from the early boot. But theoretically it can be placed
in any physical range with page block granularity.

--
Michal Hocko
SUSE Labs

2023-01-11 16:12:43

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] docs/mm: Physical Memory: add structure, introduction and nodes description

On Wed, Jan 11, 2023 at 02:36:16PM +0100, Michal Hocko wrote:
> On Wed 11-01-23 14:24:43, Mike Rapoport wrote:
> > On Tue, Jan 10, 2023 at 05:54:10PM +0100, Michal Hocko wrote:
> > > On Tue 10-01-23 17:23:58, Mike Rapoport wrote:
> > > [...]
> > > > +* ``ZONE_DMA`` and ``ZONE_DMA32`` represent memory suitable for DMA by
> > > > + peripheral devices that cannot access all of the addressable memory.
> > >
> > > I think it would be better to not keep the historical DMA based menaning
> > > and teach that future developers. You can say something like
> > >
> > > ZONE_DMA and ZONE_DMA32 have historically been used for memory suitable
> > > for DMA. For many years there are better more robust interfaces to
> > > get memory with DMA specific requirements (Documentation/core-api/dma-api.rst).
> >
> > But even today ZONE_DMA(32) means that the memory is suitable for DMA. This
> > is nicely encapsulated with dma APIs and there should be no new GFP_DMA
> > users, but still memory outside ZONE_DMA is not suitable for DMA.
>
> Well, the thing is that ZONE_DMA means different thing for different
> architectures. For x86 it is effectivelly about ISA attached HW - which
> means almost nothing these days. There is plethora of other HW with
> different address range constrains for DMA transfer so binding the zone
> with DMA is more likely to cause confusion than it helps.

Ok, how about

* ``ZONE_DMA`` and ``ZONE_DMA32`` historically represented memory suitable for
DMA by peripheral devices that cannot access all of the addressable
memory. For many years there are better more and robust interfaces to get
memory with DMA specific requirements (:ref:`DMA API <_dma_api>`), but
``ZONE_DMA`` and ``ZONE_DMA32`` still represent memory ranges that have
restrictions on how they can be accessed.
Depending on the architecture, either of these zone types or even they both
can be disabled at build time using ``CONFIG_ZONE_DMA`` and
``CONFIG_ZONE_DMA32`` configuration options. Some 64-bit platforms may need
both zones as they support peripherals with different DMA addressing
limitations.

> > > > + Depending on the architecture, either of these zone types or even they both
> > > > + can be disabled at build time using ``CONFIG_ZONE_DMA`` and
> > > > + ``CONFIG_ZONE_DMA32`` configuration options. Some 64-bit platforms may need
> > > > + both zones as they support peripherals with different DMA addressing
> > > > + limitations.
> > > > +
> > > > +* ``ZONE_NORMAL`` is for normal memory that can be accessed by the kernel all
> > > > + the time. DMA operations can be performed on pages in this zone if the DMA
> > > > + devices support transfers to all addressable memory. ``ZONE_NORMAL`` is
> > > > + always enabled.
> > > > +
> > > > +* ``ZONE_HIGHMEM`` is the part of the physical memory that is not covered by a
> > > > + permanent mapping in the kernel page tables. The memory in this zone is only
> > > > + accessible to the kernel using temporary mappings. This zone is available
> > > > + only on some 32-bit architectures and is enabled with ``CONFIG_HIGHMEM``.
> > > > +
> > > > +* ``ZONE_MOVABLE`` is for normal accessible memory, just like ``ZONE_NORMAL``.
> > > > + The difference is that most pages in ``ZONE_MOVABLE`` are movable.
> > >
> > > This is really confusing because those pages are not really movable. You
> > > cannot move a page itself. I guess you meant to say something like
> > >
> > > The difference is that there are means to migrate memory via
> > > migrate_pages interface. A typical example would be a memory mapped to
> > > userspace which can be rellocate the underlying memory content and
> > > update page tables so that userspace doesn't notice the physical data
> > > placement has changed.
> >
> > I agree that this sentence is a bit confusing, but there's a clarification
> > below. Also, I'd like to keep this at high level without going to the
> > details about how exactly the pages can be migrated.
>
> Yes, ZONE_MOVABLE is confusing as well. I do not think you do not have
> to elaborate more than just state that the memory should be migrateable.
>
> > > > That means
> > > > + that while virtual addresses of these pages do not change, their content may
> > > > + move between different physical pages. ``ZONE_MOVABLE`` is only enabled when
> > > > + one of ``kernelcore``, ``movablecore`` and ``movable_node`` parameters is
> > > > + present in the kernel command line. See :ref:`Page migration
> > > > + <page_migration>` for additional details.
> > >
> > > This is not really true. The movable zone can be also enabled by memory
> > > hotplug. In fact it is one of the more common usecases for the zone
> > > because memory hot remove largerly depends on memory to be migrated for
> > > offlining to succeed in most cases.
> >
> > Right. How about this version of ZONE_MOVABLE description:
> >
> > * ``ZONE_MOVABLE`` is for normal accessible memory, just like ``ZONE_NORMAL``.
> > The difference is that the contents of most pages in ``ZONE_MOVABLE`` is
> > movable. That means that while virtual addresses of these pages do not
> > change, their content may move between different physical pages. Often
> > ``ZONE_MOVABLE`` is populated during memory hotplug, but it may be
> > also populated on boot using one of ``kernelcore``, ``movablecore`` and
> > ``movable_node`` kernel command line parameters. See :ref:`Page migration
> > <page_migration>` and :ref:`Memory Hot(Un)Plug <_admin_guide_memory_hotplug>`
> > for additional details.
>
> Yes, sounds much better!
>
> [...]
> > > > + 1G 9G 17G
> > > > + +--------------------------------+ +--------------------------+
> > > > + | node 0 | | node 1 |
> > > > + +--------------------------------+ +--------------------------+
> > > > +
> > > > + 1G 4G 4200M 9G 9320M 17G
> > > > + +---------+----------+-----------+ +------------+-------------+
> > > > + | DMA32 | NORMAL | MOVABLE | | NORMAL | MOVABLE |
> > > > + +---------+----------+-----------+ +------------+-------------+
> > >
> > > I think it is useful to note that nodes and zones can overlap in the
> > > physical address range. It is not uncommong to interleave two nodes and
> > > it is also possible that memory holes are memory hotplugged into MOVABLE
> > > zone arbitrarily in the physical address range.
> >
> > Hmm, not sure I understand what you mean by "overlap".
> > For interleaved nodes you mean that node 0 may span, say [0x0, 0x2000) and
> > [0x4000, 06000) and node 1 spans [0x2000, 0x4000) and [0x6000, 0x8000)?
>
> Yes. that would be represented by
> NODE_DATA(0)->start_pfn = 0
> NODE_DATA(0)->node_spanned_pages= 0x6000
> NODE_DATA(1)->start_pfn = 0x4000
> NODE_DATA(1)->node_spanned_pages= 0x6000
>
>
> > And as for MOVABLE zone, you mean that it can appear between ranges of
> > NORMAL zone?
>
> Yes and also other zones as well but that is less likely as those tend
> to be populated from the early boot. But theoretically it can be placed
> in any physical range with page block granularity.

Hmm, these are not easy to explain, but I'll try to come up with something.
I'd prefer to have this as a followup patch, though.

> --
> Michal Hocko
> SUSE Labs

--
Sincerely yours,
Mike.

2023-01-11 17:09:51

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] docs/mm: Physical Memory: add structure, introduction and nodes description

On Wed 11-01-23 18:02:03, Mike Rapoport wrote:
> On Wed, Jan 11, 2023 at 02:36:16PM +0100, Michal Hocko wrote:
> > On Wed 11-01-23 14:24:43, Mike Rapoport wrote:
> > > On Tue, Jan 10, 2023 at 05:54:10PM +0100, Michal Hocko wrote:
> > > > On Tue 10-01-23 17:23:58, Mike Rapoport wrote:
> > > > [...]
> > > > > +* ``ZONE_DMA`` and ``ZONE_DMA32`` represent memory suitable for DMA by
> > > > > + peripheral devices that cannot access all of the addressable memory.
> > > >
> > > > I think it would be better to not keep the historical DMA based menaning
> > > > and teach that future developers. You can say something like
> > > >
> > > > ZONE_DMA and ZONE_DMA32 have historically been used for memory suitable
> > > > for DMA. For many years there are better more robust interfaces to
> > > > get memory with DMA specific requirements (Documentation/core-api/dma-api.rst).
> > >
> > > But even today ZONE_DMA(32) means that the memory is suitable for DMA. This
> > > is nicely encapsulated with dma APIs and there should be no new GFP_DMA
> > > users, but still memory outside ZONE_DMA is not suitable for DMA.
> >
> > Well, the thing is that ZONE_DMA means different thing for different
> > architectures. For x86 it is effectivelly about ISA attached HW - which
> > means almost nothing these days. There is plethora of other HW with
> > different address range constrains for DMA transfer so binding the zone
> > with DMA is more likely to cause confusion than it helps.
>
> Ok, how about
>
> * ``ZONE_DMA`` and ``ZONE_DMA32`` historically represented memory suitable for
> DMA by peripheral devices that cannot access all of the addressable
> memory. For many years there are better more and robust interfaces to get
> memory with DMA specific requirements (:ref:`DMA API <_dma_api>`), but
> ``ZONE_DMA`` and ``ZONE_DMA32`` still represent memory ranges that have
> restrictions on how they can be accessed.
> Depending on the architecture, either of these zone types or even they both
> can be disabled at build time using ``CONFIG_ZONE_DMA`` and
> ``CONFIG_ZONE_DMA32`` configuration options. Some 64-bit platforms may need
> both zones as they support peripherals with different DMA addressing
> limitations.

Sounds better to me. Thanks!
At least ZONE_DMA32 is somehow better defined as it represents 32b
address range constrain. DMA can be really different on different
arches. Probably good to have it here. Ideally we would have a reference
how that range is established but architectures are not unified in that
respect.


--
Michal Hocko
SUSE Labs

2023-01-12 10:24:04

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] docs/mm: Physical Memory: add structure, introduction and nodes description

On Tue, Jan 10, 2023 at 05:23:58PM +0200, Mike Rapoport wrote:
> diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst
> index 2ab7b8c1c863..9ad42ff22d88 100644
> --- a/Documentation/mm/physical_memory.rst
> +++ b/Documentation/mm/physical_memory.rst
> @@ -3,3 +3,343 @@
> ===============
> Physical Memory
> ===============
> +
> +Linux is available for a wide range of architectures so there is a need for an
> +architecture-independent abstraction to represent the physical memory. This
> +chapter describes the structures used to manage physical memory in a running
> +system.
> +
> +The first principal concept prevalent in the memory management is
> +`Non-Uniform Memory Access (NUMA)
> +<https://en.wikipedia.org/wiki/Non-uniform_memory_access>`_.
> +With multi-core and multi-socket machines, memory may be arranged into banks
> +that incur a different cost to access depending on the “distance” from the
> +processor. For example, there might be a bank of memory assigned to each CPU or
> +a bank of memory very suitable for DMA near peripheral devices.
> +
> +Each bank is called a node and the concept is represented under Linux by a
> +``struct pglist_data`` even if the architecture is UMA. This structure is
> +always referenced to by it's typedef ``pg_data_t``. ``A pg_data_t`` structure
> +for a particular node can be referenced by ``NODE_DATA(nid)`` macro where
> +``nid`` is the ID of that node.
> +
> +For NUMA architectures, the node structures are allocated by the architecture
> +specific code early during boot. Usually, these structures are allocated
> +locally on the memory bank they represent. For UMA architectures, only one
> +static ``pg_data_t`` structure called ``contig_page_data`` is used. Nodes will
> +be discussed further in Section :ref:`Nodes <nodes>`
> +
> +The entire physical address space is partitioned into one or more blocks
> +called zones which represent ranges within memory. These ranges are usually
> +determined by architectural constraints for accessing the physical memory.
> +The memory range within a node that corresponds to a particular zone is
> +described by a ``struct zone``, typedeffed to ``zone_t``. Each zone has
> +one of the types described below.
> +
> +* ``ZONE_DMA`` and ``ZONE_DMA32`` represent memory suitable for DMA by
> + peripheral devices that cannot access all of the addressable memory.
> + Depending on the architecture, either of these zone types or even they both
> + can be disabled at build time using ``CONFIG_ZONE_DMA`` and
> + ``CONFIG_ZONE_DMA32`` configuration options. Some 64-bit platforms may need
> + both zones as they support peripherals with different DMA addressing
> + limitations.
> +
> +* ``ZONE_NORMAL`` is for normal memory that can be accessed by the kernel all
> + the time. DMA operations can be performed on pages in this zone if the DMA
> + devices support transfers to all addressable memory. ``ZONE_NORMAL`` is
> + always enabled.
> +
> +* ``ZONE_HIGHMEM`` is the part of the physical memory that is not covered by a
> + permanent mapping in the kernel page tables. The memory in this zone is only
> + accessible to the kernel using temporary mappings. This zone is available
> + only on some 32-bit architectures and is enabled with ``CONFIG_HIGHMEM``.
> +
> +* ``ZONE_MOVABLE`` is for normal accessible memory, just like ``ZONE_NORMAL``.
> + The difference is that most pages in ``ZONE_MOVABLE`` are movable. That means
> + that while virtual addresses of these pages do not change, their content may
> + move between different physical pages. ``ZONE_MOVABLE`` is only enabled when
> + one of ``kernelcore``, ``movablecore`` and ``movable_node`` parameters is
> + present in the kernel command line. See :ref:`Page migration
> + <page_migration>` for additional details.
> +
> +* ``ZONE_DEVICE`` represents memory residing on devices such as PMEM and GPU.
> + It has different characteristics than RAM zone types and it exists to provide
> + :ref:`struct page <Pages>` and memory map services for device driver
> + identified physical address ranges. ``ZONE_DEVICE`` is enabled with
> + configuration option ``CONFIG_ZONE_DEVICE``.
> +
> +It is important to note that many kernel operations can only take place using
> +``ZONE_NORMAL`` so it is the most performance critical zone. Zones are
> +discussed further in Section :ref:`Zones <zones>`.
> +
> +The relation between node and zone extents is determined by the physical memory
> +map reported by the firmware, architectural constraints for memory addressing
> +and certain parameters in the kernel command line.
> +
> +For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM the
> +entire memory will be on node 0 and there will be three zones: ``ZONE_DMA``,
> +``ZONE_NORMAL`` and ``ZONE_HIGHMEM``::
> +
> + 0 2G
> + +-------------------------------------------------------------+
> + | node 0 |
> + +-------------------------------------------------------------+
> +
> + 0 16M 896M 2G
> + +----------+-----------------------+--------------------------+
> + | ZONE_DMA | ZONE_NORMAL | ZONE_HIGHMEM |
> + +----------+-----------------------+--------------------------+
> +
> +
> +With a kernel built with ``ZONE_DMA`` disabled and ``ZONE_DMA32`` enabled and
> +booted with ``movablecore=80%`` parameter on an arm64 machine with 16 Gbytes of
> +RAM equally split between two nodes, there will be ``ZONE_DMA32``,
> +``ZONE_NORMAL`` and ``ZONE_MOVABLE`` on node 0, and ``ZONE_NORMAL`` and
> +``ZONE_MOVABLE`` on node 1::
> +
> +
> + 1G 9G 17G
> + +--------------------------------+ +--------------------------+
> + | node 0 | | node 1 |
> + +--------------------------------+ +--------------------------+
> +
> + 1G 4G 4200M 9G 9320M 17G
> + +---------+----------+-----------+ +------------+-------------+
> + | DMA32 | NORMAL | MOVABLE | | NORMAL | MOVABLE |
> + +---------+----------+-----------+ +------------+-------------+
> +
> +.. _nodes:
> +
> +Nodes
> +=====
> +
> +As we have mentioned, each node in memory is described by a ``pg_data_t`` which
> +is a typedef for a ``struct pglist_data``. When allocating a page, by default
> +Linux uses a node-local allocation policy to allocate memory from the node
> +closest to the running CPU. As processes tend to run on the same CPU, it is
> +likely the memory from the current node will be used. The allocation policy can
> +be controlled by users as described in
> +Documentation/admin-guide/mm/numa_memory_policy.rst.
> +
> +Most NUMA architectures maintain an array of pointers to the node
> +structures. The actual structures are allocated early during boot when
> +architecture specific code parses the physical memory map reported by the
> +firmware. The bulk of the node initialization happens slightly later in the
> +boot process by free_area_init() function, described later in Section
> +:ref:`Initialization <initialization>`.
> +
> +
> +Along with the node structures, kernel maintains an array of ``nodemask_t``
> +bitmasks called ``node_states``. Each bitmask in this array represents a set of
> +nodes with particular properties as defined by ``enum node_states``:
> +
> +``N_POSSIBLE``
> + The node could become online at some point.
> +``N_ONLINE``
> + The node is online.
> +``N_NORMAL_MEMORY``
> + The node has regular memory.
> +``N_HIGH_MEMORY``
> + The node has regular or high memory. When ``CONFIG_HIGHMEM`` is disabled
> + aliased to ``N_NORMAL_MEMORY``.
> +``N_MEMORY``
> + The node has memory(regular, high, movable)
> +``N_CPU``
> + The node has one or more CPUs
> +
> +For each node that has a property described above, the bit corresponding to the
> +node ID in the ``node_states[<property>]`` bitmask is set.
> +
> +For example, for node 2 with normal memory and CPUs, bit 2 will be set in ::
> +
> + node_states[N_POSSIBLE]
> + node_states[N_ONLINE]
> + node_states[N_NORMAL_MEMORY]
> + node_states[N_MEMORY]
> + node_states[N_CPU]
> +
> +For various operations possible with nodemasks please refer to
> +``include/linux/nodemask.h``.
> +
> +Among other things, nodemasks are used to provide macros for node traversal,
> +namely ``for_each_node()`` and ``for_each_online_node()``.
> +
> +For instance, to call a function foo() for each online node::
> +
> + for_each_online_node(nid) {
> + pg_data_t *pgdat = NODE_DATA(nid);
> +
> + foo(pgdat);
> + }
> +
> +Node structure
> +--------------
> +
> +The nodes structure ``struct pglist_data`` is declared in
> +``include/linux/mmzone.h``. Here we briefly describe fields of this
> +structure:
> +
> +General
> +~~~~~~~
> +
> +``node_zones``
> + The zones for this node. Not all of the zones may be populated, but it is
> + the full list. It is referenced by this node's node_zonelists as well as
> + other node's node_zonelists.
> +
> +``node_zonelists``
> + The list of all zones in all nodes. This list defines the order of zones
> + that allocations are preferred from. The ``node_zonelists`` is set up by
> + ``build_zonelists()`` in ``mm/page_alloc.c`` during the initialization of
> + core memory management structures.
> +
> +``nr_zones``
> + Number of populated zones in this node.
> +
> +``node_mem_map``
> + For UMA systems that use FLATMEM memory model the 0's node
> + ``node_mem_map`` is array of struct pages representing each physical frame.
> +
> +``node_page_ext``
> + For UMA systems that use FLATMEM memory model the 0's node
> + ``node_page_ext`` is array of extensions of struct pages. Available only
> + in the kernels built with ``CONFIG_PAGE_EXTENTION`` enabled.
> +
> +``node_start_pfn``
> + The page frame number of the starting page frame in this node.
> +
> +``node_present_pages``
> + Total number of physical pages present in this node.
> +
> +``node_spanned_pages``
> + Total size of physical page range, including holes.
> +
> +``node_size_lock``
> + A lock that protects the fields defining the node extents. Only defined when
> + at least one of ``CONFIG_MEMORY_HOTPLUG`` or
> + ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` configuration options are enabled.
> + ``pgdat_resize_lock()`` and ``pgdat_resize_unlock()`` are provided to
> + manipulate ``node_size_lock`` without checking for ``CONFIG_MEMORY_HOTPLUG``
> + or ``CONFIG_DEFERRED_STRUCT_PAGE_INIT``.
> +
> +``node_id``
> + The Node ID (NID) of the node, starts at 0.
> +
> +``totalreserve_pages``
> + This is a per-node reserve of pages that are not available to userspace
> + allocations.
> +
> +``first_deferred_pfn``
> + If memory initialization on large machines is deferred then this is the first
> + PFN that needs to be initialized. Defined only when
> + ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` is enabled
> +
> +``deferred_split_queue``
> + Per-node queue of huge pages that their split was deferred. Defined only when ``CONFIG_TRANSPARENT_HUGEPAGE`` is enabled.
> +
> +``__lruvec``
> + Per-node lruvec holding LRU lists and related parameters. Used only when
> + memory cgroups are disabled. It should not be accessed directly, use
> + ``mem_cgroup_lruvec()`` to look up lruvecs instead.
> +
> +Reclaim control
> +~~~~~~~~~~~~~~~
> +
> +See also :ref:`Page Reclaim <page_reclaim>`.
> +
> +``kswapd``
> + Per-node instance of kswapd kernel thread.
> +
> +``kswapd_wait``, ``pfmemalloc_wait``, ``reclaim_wait``
> + Workqueues used to synchronize memory reclaim tasks
> +
> +``nr_writeback_throttled``
> + Number of tasks that are throttled waiting on dirty pages to clean.
> +
> +``nr_reclaim_start``
> + Number of pages written while reclaim is throttled waiting for writeback.
> +
> +``kswapd_order``
> + Controls the order kswapd tries to reclaim
> +
> +``kswapd_highest_zoneidx``
> + The highest zone index to be reclaimed by kswapd
> +
> +``kswapd_failures``
> + Number of runs kswapd was unable to reclaim any pages
> +
> +``min_unmapped_pages``
> + Minimal number of unmapped file backed pages that cannot be reclaimed.
> + Determined by ``vm.min_unmapped_ratio`` sysctl. Only defined when
> + ``CONFIG_NUMA`` is enabled.
> +
> +``min_slab_pages``
> + Minimal number of SLAB pages that cannot be reclaimed. Determined by
> + ``vm.min_slab_ratio sysctl``. Only defined when ``CONFIG_NUMA`` is enabled
> +
> +``flags``
> + Flags controlling reclaim behavior.
> +
> +Compaction control
> +~~~~~~~~~~~~~~~~~~
> +
> +``kcompactd_max_order``
> + Page order that kcompactd should try to achieve.
> +
> +``kcompactd_highest_zoneidx``
> + The highest zone index to be compacted by kcompactd.
> +
> +``kcompactd_wait``
> + Workqueue used to synchronize memory compaction tasks.
> +
> +``kcompactd``
> + Per-node instance of kcompactd kernel thread.
> +
> +``proactive_compact_trigger``
> + Determines if proactive compaction is enabled. Controlled by
> + ``vm.compaction_proactiveness`` sysctl.
> +
> +Statistics
> +~~~~~~~~~~
> +
> +``per_cpu_nodestats``
> + Per-CPU VM statistics for the node
> +
> +``vm_stat``
> + VM statistics for the node.
> +
> +.. _zones:
> +
> +Zones
> +=====
> +
> +.. admonition:: Stub
> +
> + This section is incomplete. Please list and describe the appropriate fields.
> +
> +.. _pages:
> +
> +Pages
> +=====
> +
> +.. admonition:: Stub
> +
> + This section is incomplete. Please list and describe the appropriate fields.
> +
> +.. _folios:
> +
> +Folios
> +======
> +
> +.. admonition:: Stub
> +
> + This section is incomplete. Please list and describe the appropriate fields.
> +
> +.. _initialization:
> +
> +Initialization
> +==============
> +
> +.. admonition:: Stub
> +
> + This section is incomplete. Please list and describe the appropriate fields.

The doc LGTM, thanks. I leave the actual content review to mm people.

Reviewed-by: Bagas Sanjaya <[email protected]>

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (14.18 kB)
signature.asc (235.00 B)
Download all attachments