2022-04-07 14:33:39

by Yang Shi

[permalink] [raw]
Subject: [PATCH] mm: swap: determine swap device by using page nid

The swap devices are linked to per node priority lists, the swap device
closer to the node has higher priority on that node's priority list.
This is supposed to improve I/O latency, particularly for some fast
devices. But the current code gets nid by calling numa_node_id() which
actually returns the nid that the reclaimer is running on instead of the
nid that the page belongs to.

Pass the page's nid dow to get_swap_pages() in order to pick up the
right swap device. But it doesn't work for the swap slots cache which
is per cpu. We could skip swap slots cache if the current node is not
the page's node, but it may be overkilling. So keep using the current
node's swap slots cache. The issue was found by visual code inspection
so it is not sure how much improvement could be achieved due to lack of
suitable testing device. But anyway the current code does violate the
design.

Cc: Huang Ying <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
include/linux/swap.h | 3 ++-
mm/swap_slots.c | 7 ++++---
mm/swapfile.c | 5 ++---
3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 27093b477c5f..e442cf6b61ea 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -497,7 +497,8 @@ extern void si_swapinfo(struct sysinfo *);
extern swp_entry_t get_swap_page(struct page *page);
extern void put_swap_page(struct page *page, swp_entry_t entry);
extern swp_entry_t get_swap_page_of_type(int);
-extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size);
+extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size,
+ int node);
extern int add_swap_count_continuation(swp_entry_t, gfp_t);
extern void swap_shmem_alloc(swp_entry_t);
extern int swap_duplicate(swp_entry_t);
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 2b5531840583..a1c5cf6a4302 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -264,7 +264,7 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
cache->cur = 0;
if (swap_slot_cache_active)
cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
- cache->slots, 1);
+ cache->slots, 1, numa_node_id());

return cache->nr;
}
@@ -305,12 +305,13 @@ swp_entry_t get_swap_page(struct page *page)
{
swp_entry_t entry;
struct swap_slots_cache *cache;
+ int nid = page_to_nid(page);

entry.val = 0;

if (PageTransHuge(page)) {
if (IS_ENABLED(CONFIG_THP_SWAP))
- get_swap_pages(1, &entry, HPAGE_PMD_NR);
+ get_swap_pages(1, &entry, HPAGE_PMD_NR, nid);
goto out;
}

@@ -342,7 +343,7 @@ swp_entry_t get_swap_page(struct page *page)
goto out;
}

- get_swap_pages(1, &entry, 1);
+ get_swap_pages(1, &entry, 1, nid);
out:
if (mem_cgroup_try_charge_swap(page, entry)) {
put_swap_page(page, entry);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 63c61f8b2611..151fffe0fd60 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1036,13 +1036,13 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
swap_range_free(si, offset, SWAPFILE_CLUSTER);
}

-int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
+int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size,
+ int node)
{
unsigned long size = swap_entry_size(entry_size);
struct swap_info_struct *si, *next;
long avail_pgs;
int n_ret = 0;
- int node;

/* Only single cluster request supported */
WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
@@ -1060,7 +1060,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
atomic_long_sub(n_goal * size, &nr_swap_pages);

start_over:
- node = numa_node_id();
plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
/* requeue si to after same-priority siblings */
plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
--
2.26.3


2022-04-07 15:26:19

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Wed, Apr 06, 2022 at 07:09:53PM -0700, Yang Shi wrote:
> The swap devices are linked to per node priority lists, the swap device
> closer to the node has higher priority on that node's priority list.
> This is supposed to improve I/O latency, particularly for some fast
> devices. But the current code gets nid by calling numa_node_id() which
> actually returns the nid that the reclaimer is running on instead of the
> nid that the page belongs to.
>

Right.

> Pass the page's nid dow to get_swap_pages() in order to pick up the
> right swap device. But it doesn't work for the swap slots cache which
> is per cpu. We could skip swap slots cache if the current node is not
> the page's node, but it may be overkilling. So keep using the current
> node's swap slots cache. The issue was found by visual code inspection
> so it is not sure how much improvement could be achieved due to lack of
> suitable testing device. But anyway the current code does violate the
> design.
>

I intentionally used the reclaimer's nid because I think when swapping
out to a device, it is faster when the device is on the same node as
the cpu.

Anyway, I think I can make a test case where the workload allocates all
its memory on the remote node and its workingset memory is larger then
the available memory so swap is triggered, then we can see which way
achieves better performance. Sounds reasonable to you?

> Cc: Huang Ying <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>
> ---
> include/linux/swap.h | 3 ++-
> mm/swap_slots.c | 7 ++++---
> mm/swapfile.c | 5 ++---
> 3 files changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 27093b477c5f..e442cf6b61ea 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -497,7 +497,8 @@ extern void si_swapinfo(struct sysinfo *);
> extern swp_entry_t get_swap_page(struct page *page);
> extern void put_swap_page(struct page *page, swp_entry_t entry);
> extern swp_entry_t get_swap_page_of_type(int);
> -extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size);
> +extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size,
> + int node);
> extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> extern void swap_shmem_alloc(swp_entry_t);
> extern int swap_duplicate(swp_entry_t);
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 2b5531840583..a1c5cf6a4302 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -264,7 +264,7 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
> cache->cur = 0;
> if (swap_slot_cache_active)
> cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
> - cache->slots, 1);
> + cache->slots, 1, numa_node_id());
>
> return cache->nr;
> }
> @@ -305,12 +305,13 @@ swp_entry_t get_swap_page(struct page *page)
> {
> swp_entry_t entry;
> struct swap_slots_cache *cache;
> + int nid = page_to_nid(page);
>
> entry.val = 0;
>
> if (PageTransHuge(page)) {
> if (IS_ENABLED(CONFIG_THP_SWAP))
> - get_swap_pages(1, &entry, HPAGE_PMD_NR);
> + get_swap_pages(1, &entry, HPAGE_PMD_NR, nid);
> goto out;
> }
>
> @@ -342,7 +343,7 @@ swp_entry_t get_swap_page(struct page *page)
> goto out;
> }
>
> - get_swap_pages(1, &entry, 1);
> + get_swap_pages(1, &entry, 1, nid);
> out:
> if (mem_cgroup_try_charge_swap(page, entry)) {
> put_swap_page(page, entry);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 63c61f8b2611..151fffe0fd60 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1036,13 +1036,13 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> swap_range_free(si, offset, SWAPFILE_CLUSTER);
> }
>
> -int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> +int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size,
> + int node)
> {
> unsigned long size = swap_entry_size(entry_size);
> struct swap_info_struct *si, *next;
> long avail_pgs;
> int n_ret = 0;
> - int node;
>
> /* Only single cluster request supported */
> WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
> @@ -1060,7 +1060,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> atomic_long_sub(n_goal * size, &nr_swap_pages);
>
> start_over:
> - node = numa_node_id();
> plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
> /* requeue si to after same-priority siblings */
> plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
> --
> 2.26.3
>
>

2022-04-07 16:01:34

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

[Cc Aaron who has introduced the per node swap changes]

On Wed 06-04-22 19:09:53, Yang Shi wrote:
> The swap devices are linked to per node priority lists, the swap device
> closer to the node has higher priority on that node's priority list.
> This is supposed to improve I/O latency, particularly for some fast
> devices. But the current code gets nid by calling numa_node_id() which
> actually returns the nid that the reclaimer is running on instead of the
> nid that the page belongs to.
>
> Pass the page's nid dow to get_swap_pages() in order to pick up the
> right swap device. But it doesn't work for the swap slots cache which
> is per cpu. We could skip swap slots cache if the current node is not
> the page's node, but it may be overkilling. So keep using the current
> node's swap slots cache. The issue was found by visual code inspection
> so it is not sure how much improvement could be achieved due to lack of
> suitable testing device. But anyway the current code does violate the
> design.

Do you have any perf numbers for this change?

> Cc: Huang Ying <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>
> ---
> include/linux/swap.h | 3 ++-
> mm/swap_slots.c | 7 ++++---
> mm/swapfile.c | 5 ++---
> 3 files changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 27093b477c5f..e442cf6b61ea 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -497,7 +497,8 @@ extern void si_swapinfo(struct sysinfo *);
> extern swp_entry_t get_swap_page(struct page *page);
> extern void put_swap_page(struct page *page, swp_entry_t entry);
> extern swp_entry_t get_swap_page_of_type(int);
> -extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size);
> +extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size,
> + int node);
> extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> extern void swap_shmem_alloc(swp_entry_t);
> extern int swap_duplicate(swp_entry_t);
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 2b5531840583..a1c5cf6a4302 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -264,7 +264,7 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
> cache->cur = 0;
> if (swap_slot_cache_active)
> cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
> - cache->slots, 1);
> + cache->slots, 1, numa_node_id());
>
> return cache->nr;
> }
> @@ -305,12 +305,13 @@ swp_entry_t get_swap_page(struct page *page)
> {
> swp_entry_t entry;
> struct swap_slots_cache *cache;
> + int nid = page_to_nid(page);
>
> entry.val = 0;
>
> if (PageTransHuge(page)) {
> if (IS_ENABLED(CONFIG_THP_SWAP))
> - get_swap_pages(1, &entry, HPAGE_PMD_NR);
> + get_swap_pages(1, &entry, HPAGE_PMD_NR, nid);
> goto out;
> }
>
> @@ -342,7 +343,7 @@ swp_entry_t get_swap_page(struct page *page)
> goto out;
> }
>
> - get_swap_pages(1, &entry, 1);
> + get_swap_pages(1, &entry, 1, nid);
> out:
> if (mem_cgroup_try_charge_swap(page, entry)) {
> put_swap_page(page, entry);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 63c61f8b2611..151fffe0fd60 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1036,13 +1036,13 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> swap_range_free(si, offset, SWAPFILE_CLUSTER);
> }
>
> -int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> +int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size,
> + int node)
> {
> unsigned long size = swap_entry_size(entry_size);
> struct swap_info_struct *si, *next;
> long avail_pgs;
> int n_ret = 0;
> - int node;
>
> /* Only single cluster request supported */
> WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
> @@ -1060,7 +1060,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> atomic_long_sub(n_goal * size, &nr_swap_pages);
>
> start_over:
> - node = numa_node_id();
> plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
> /* requeue si to after same-priority siblings */
> plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
> --
> 2.26.3

--
Michal Hocko
SUSE Labs

2022-04-07 20:48:23

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Thu, Apr 7, 2022 at 12:52 AM Michal Hocko <[email protected]> wrote:
>
> [Cc Aaron who has introduced the per node swap changes]
>
> On Wed 06-04-22 19:09:53, Yang Shi wrote:
> > The swap devices are linked to per node priority lists, the swap device
> > closer to the node has higher priority on that node's priority list.
> > This is supposed to improve I/O latency, particularly for some fast
> > devices. But the current code gets nid by calling numa_node_id() which
> > actually returns the nid that the reclaimer is running on instead of the
> > nid that the page belongs to.
> >
> > Pass the page's nid dow to get_swap_pages() in order to pick up the
> > right swap device. But it doesn't work for the swap slots cache which
> > is per cpu. We could skip swap slots cache if the current node is not
> > the page's node, but it may be overkilling. So keep using the current
> > node's swap slots cache. The issue was found by visual code inspection
> > so it is not sure how much improvement could be achieved due to lack of
> > suitable testing device. But anyway the current code does violate the
> > design.
>
> Do you have any perf numbers for this change?

No, it was found by visual code inspection and offline discussion with
Huang Ying.

>
> > Cc: Huang Ying <[email protected]>
> > Signed-off-by: Yang Shi <[email protected]>
> > ---
> > include/linux/swap.h | 3 ++-
> > mm/swap_slots.c | 7 ++++---
> > mm/swapfile.c | 5 ++---
> > 3 files changed, 8 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 27093b477c5f..e442cf6b61ea 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -497,7 +497,8 @@ extern void si_swapinfo(struct sysinfo *);
> > extern swp_entry_t get_swap_page(struct page *page);
> > extern void put_swap_page(struct page *page, swp_entry_t entry);
> > extern swp_entry_t get_swap_page_of_type(int);
> > -extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size);
> > +extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size,
> > + int node);
> > extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> > extern void swap_shmem_alloc(swp_entry_t);
> > extern int swap_duplicate(swp_entry_t);
> > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > index 2b5531840583..a1c5cf6a4302 100644
> > --- a/mm/swap_slots.c
> > +++ b/mm/swap_slots.c
> > @@ -264,7 +264,7 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
> > cache->cur = 0;
> > if (swap_slot_cache_active)
> > cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
> > - cache->slots, 1);
> > + cache->slots, 1, numa_node_id());
> >
> > return cache->nr;
> > }
> > @@ -305,12 +305,13 @@ swp_entry_t get_swap_page(struct page *page)
> > {
> > swp_entry_t entry;
> > struct swap_slots_cache *cache;
> > + int nid = page_to_nid(page);
> >
> > entry.val = 0;
> >
> > if (PageTransHuge(page)) {
> > if (IS_ENABLED(CONFIG_THP_SWAP))
> > - get_swap_pages(1, &entry, HPAGE_PMD_NR);
> > + get_swap_pages(1, &entry, HPAGE_PMD_NR, nid);
> > goto out;
> > }
> >
> > @@ -342,7 +343,7 @@ swp_entry_t get_swap_page(struct page *page)
> > goto out;
> > }
> >
> > - get_swap_pages(1, &entry, 1);
> > + get_swap_pages(1, &entry, 1, nid);
> > out:
> > if (mem_cgroup_try_charge_swap(page, entry)) {
> > put_swap_page(page, entry);
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 63c61f8b2611..151fffe0fd60 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1036,13 +1036,13 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> > swap_range_free(si, offset, SWAPFILE_CLUSTER);
> > }
> >
> > -int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> > +int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size,
> > + int node)
> > {
> > unsigned long size = swap_entry_size(entry_size);
> > struct swap_info_struct *si, *next;
> > long avail_pgs;
> > int n_ret = 0;
> > - int node;
> >
> > /* Only single cluster request supported */
> > WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
> > @@ -1060,7 +1060,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> > atomic_long_sub(n_goal * size, &nr_swap_pages);
> >
> > start_over:
> > - node = numa_node_id();
> > plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
> > /* requeue si to after same-priority siblings */
> > plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
> > --
> > 2.26.3
>
> --
> Michal Hocko
> SUSE Labs

2022-04-07 21:23:38

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Thu, Apr 7, 2022 at 1:12 AM Aaron Lu <[email protected]> wrote:
>
> On Wed, Apr 06, 2022 at 07:09:53PM -0700, Yang Shi wrote:
> > The swap devices are linked to per node priority lists, the swap device
> > closer to the node has higher priority on that node's priority list.
> > This is supposed to improve I/O latency, particularly for some fast
> > devices. But the current code gets nid by calling numa_node_id() which
> > actually returns the nid that the reclaimer is running on instead of the
> > nid that the page belongs to.
> >
>
> Right.
>
> > Pass the page's nid dow to get_swap_pages() in order to pick up the
> > right swap device. But it doesn't work for the swap slots cache which
> > is per cpu. We could skip swap slots cache if the current node is not
> > the page's node, but it may be overkilling. So keep using the current
> > node's swap slots cache. The issue was found by visual code inspection
> > so it is not sure how much improvement could be achieved due to lack of
> > suitable testing device. But anyway the current code does violate the
> > design.
> >
>
> I intentionally used the reclaimer's nid because I think when swapping
> out to a device, it is faster when the device is on the same node as
> the cpu.

OK, the offline discussion with Huang Ying showed the design was to
have page's nid in order to achieve better I/O performance (more
noticeable on faster devices) since the reclaimer may be running on a
different node from the reclaimed page.

>
> Anyway, I think I can make a test case where the workload allocates all
> its memory on the remote node and its workingset memory is larger then
> the available memory so swap is triggered, then we can see which way
> achieves better performance. Sounds reasonable to you?

Yeah, definitely, thank you so much. I don't have a fast enough device
by hand to show the difference right now. If you could get some data
it would be perfect.

BTW, this patch doesn't change the node for swap slots cache, so it
may still use the swap device on a remote node if swap slots cache is
used.

>
> > Cc: Huang Ying <[email protected]>
> > Signed-off-by: Yang Shi <[email protected]>
> > ---
> > include/linux/swap.h | 3 ++-
> > mm/swap_slots.c | 7 ++++---
> > mm/swapfile.c | 5 ++---
> > 3 files changed, 8 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 27093b477c5f..e442cf6b61ea 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -497,7 +497,8 @@ extern void si_swapinfo(struct sysinfo *);
> > extern swp_entry_t get_swap_page(struct page *page);
> > extern void put_swap_page(struct page *page, swp_entry_t entry);
> > extern swp_entry_t get_swap_page_of_type(int);
> > -extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size);
> > +extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size,
> > + int node);
> > extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> > extern void swap_shmem_alloc(swp_entry_t);
> > extern int swap_duplicate(swp_entry_t);
> > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > index 2b5531840583..a1c5cf6a4302 100644
> > --- a/mm/swap_slots.c
> > +++ b/mm/swap_slots.c
> > @@ -264,7 +264,7 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
> > cache->cur = 0;
> > if (swap_slot_cache_active)
> > cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
> > - cache->slots, 1);
> > + cache->slots, 1, numa_node_id());
> >
> > return cache->nr;
> > }
> > @@ -305,12 +305,13 @@ swp_entry_t get_swap_page(struct page *page)
> > {
> > swp_entry_t entry;
> > struct swap_slots_cache *cache;
> > + int nid = page_to_nid(page);
> >
> > entry.val = 0;
> >
> > if (PageTransHuge(page)) {
> > if (IS_ENABLED(CONFIG_THP_SWAP))
> > - get_swap_pages(1, &entry, HPAGE_PMD_NR);
> > + get_swap_pages(1, &entry, HPAGE_PMD_NR, nid);
> > goto out;
> > }
> >
> > @@ -342,7 +343,7 @@ swp_entry_t get_swap_page(struct page *page)
> > goto out;
> > }
> >
> > - get_swap_pages(1, &entry, 1);
> > + get_swap_pages(1, &entry, 1, nid);
> > out:
> > if (mem_cgroup_try_charge_swap(page, entry)) {
> > put_swap_page(page, entry);
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 63c61f8b2611..151fffe0fd60 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1036,13 +1036,13 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> > swap_range_free(si, offset, SWAPFILE_CLUSTER);
> > }
> >
> > -int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> > +int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size,
> > + int node)
> > {
> > unsigned long size = swap_entry_size(entry_size);
> > struct swap_info_struct *si, *next;
> > long avail_pgs;
> > int n_ret = 0;
> > - int node;
> >
> > /* Only single cluster request supported */
> > WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
> > @@ -1060,7 +1060,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> > atomic_long_sub(n_goal * size, &nr_swap_pages);
> >
> > start_over:
> > - node = numa_node_id();
> > plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
> > /* requeue si to after same-priority siblings */
> > plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
> > --
> > 2.26.3
> >
> >

2022-04-20 15:11:44

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Thu, Apr 07, 2022 at 10:36:54AM -0700, Yang Shi wrote:
> On Thu, Apr 7, 2022 at 1:12 AM Aaron Lu <[email protected]> wrote:
> >
> > On Wed, Apr 06, 2022 at 07:09:53PM -0700, Yang Shi wrote:
> > > The swap devices are linked to per node priority lists, the swap device
> > > closer to the node has higher priority on that node's priority list.
> > > This is supposed to improve I/O latency, particularly for some fast
> > > devices. But the current code gets nid by calling numa_node_id() which
> > > actually returns the nid that the reclaimer is running on instead of the
> > > nid that the page belongs to.
> > >
> >
> > Right.
> >
> > > Pass the page's nid dow to get_swap_pages() in order to pick up the
> > > right swap device. But it doesn't work for the swap slots cache which
> > > is per cpu. We could skip swap slots cache if the current node is not
> > > the page's node, but it may be overkilling. So keep using the current
> > > node's swap slots cache. The issue was found by visual code inspection
> > > so it is not sure how much improvement could be achieved due to lack of
> > > suitable testing device. But anyway the current code does violate the
> > > design.
> > >
> >
> > I intentionally used the reclaimer's nid because I think when swapping
> > out to a device, it is faster when the device is on the same node as
> > the cpu.
>
> OK, the offline discussion with Huang Ying showed the design was to
> have page's nid in order to achieve better I/O performance (more
> noticeable on faster devices) since the reclaimer may be running on a
> different node from the reclaimed page.
>
> >
> > Anyway, I think I can make a test case where the workload allocates all
> > its memory on the remote node and its workingset memory is larger then
> > the available memory so swap is triggered, then we can see which way
> > achieves better performance. Sounds reasonable to you?
>
> Yeah, definitely, thank you so much. I don't have a fast enough device
> by hand to show the difference right now. If you could get some data
> it would be perfect.
>

Failed to find a test box that has two NVMe disks attached to different
nodes and since Shanghai is locked down right now, we couldn't install
another NVMe on the box so I figured it might be OK to test on a box that
has a single NVMe attached to node 0 like this:

1) restrict the test processes to run on node 0 and allocate on node 1;
2) restrict the test processes to run on node 1 and allocate on node 0.

In case 1), the reclaimer's node id is the same as the swap device's so
it's the same as current behaviour and in case 2), the page's node id is
the same as the swap device's so it's what your patch proposed.

The test I used is vm-scalability/case-swap-w-rand:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swap-w-seq
which spawns $nr_task processes and each will mmap $size and then
randomly write to that area. I set nr_task=32 and $size=4G, so a total
of 128G memory will be needed and I used memory.limit_in_bytes to
restrict the available memory to 64G, to make sure swap is triggered.

The reason why cgroup is used is to avoid waking up the per-node kswapd
which can trigger swapping with reclaimer/page/swap device all having the
same node id.

And I don't see a measuable difference from the result:
case1(using reclaimer's node id) vm-scalability.throughput: 10574 KB/s
case2(using page's node id) vm-scalability.throughput: 10567 KB/s

My interpretation of the result is, when reclaiming a remote page, it
doesn't matter much which swap device to use if the swap device is a IO
device.

Later Ying reminded me we have test box that has optane installed on
different nodes so I also tested there: Icelake 2 sockets server with 2
optane installed on each node. I did the test there like this:
1) restrict the test processes to run on node 0 and allocate on node 1
and only swapon pmem0, which is the optane backed swap device on node 0;
2) restrict the test processes to run on node 0 and allocate on node 1
and only swapon pmem1, which is the optane backed swap device on node 1.

So case 1) is current behaviour and case 2) is what your patch proposed.

With the same test and the same nr_task/size, the result is:
case1(using reclaimer's node id) vm-scalability.throughput: 71033 KB/s
case2(using page's node id) vm-scalability.throughput: 58753 KB/s

The result suggested when using a memory like device as swap device,
it's better to use the reclaimer's node id when swapping.

What do you think of these tests and results?

Thanks,
Aaron

> BTW, this patch doesn't change the node for swap slots cache, so it
> may still use the swap device on a remote node if swap slots cache is
> used.
>
> >
> > > Cc: Huang Ying <[email protected]>
> > > Signed-off-by: Yang Shi <[email protected]>
> > > ---
> > > include/linux/swap.h | 3 ++-
> > > mm/swap_slots.c | 7 ++++---
> > > mm/swapfile.c | 5 ++---
> > > 3 files changed, 8 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index 27093b477c5f..e442cf6b61ea 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -497,7 +497,8 @@ extern void si_swapinfo(struct sysinfo *);
> > > extern swp_entry_t get_swap_page(struct page *page);
> > > extern void put_swap_page(struct page *page, swp_entry_t entry);
> > > extern swp_entry_t get_swap_page_of_type(int);
> > > -extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size);
> > > +extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size,
> > > + int node);
> > > extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> > > extern void swap_shmem_alloc(swp_entry_t);
> > > extern int swap_duplicate(swp_entry_t);
> > > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > > index 2b5531840583..a1c5cf6a4302 100644
> > > --- a/mm/swap_slots.c
> > > +++ b/mm/swap_slots.c
> > > @@ -264,7 +264,7 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
> > > cache->cur = 0;
> > > if (swap_slot_cache_active)
> > > cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
> > > - cache->slots, 1);
> > > + cache->slots, 1, numa_node_id());
> > >
> > > return cache->nr;
> > > }
> > > @@ -305,12 +305,13 @@ swp_entry_t get_swap_page(struct page *page)
> > > {
> > > swp_entry_t entry;
> > > struct swap_slots_cache *cache;
> > > + int nid = page_to_nid(page);
> > >
> > > entry.val = 0;
> > >
> > > if (PageTransHuge(page)) {
> > > if (IS_ENABLED(CONFIG_THP_SWAP))
> > > - get_swap_pages(1, &entry, HPAGE_PMD_NR);
> > > + get_swap_pages(1, &entry, HPAGE_PMD_NR, nid);
> > > goto out;
> > > }
> > >
> > > @@ -342,7 +343,7 @@ swp_entry_t get_swap_page(struct page *page)
> > > goto out;
> > > }
> > >
> > > - get_swap_pages(1, &entry, 1);
> > > + get_swap_pages(1, &entry, 1, nid);
> > > out:
> > > if (mem_cgroup_try_charge_swap(page, entry)) {
> > > put_swap_page(page, entry);
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 63c61f8b2611..151fffe0fd60 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -1036,13 +1036,13 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> > > swap_range_free(si, offset, SWAPFILE_CLUSTER);
> > > }
> > >
> > > -int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> > > +int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size,
> > > + int node)
> > > {
> > > unsigned long size = swap_entry_size(entry_size);
> > > struct swap_info_struct *si, *next;
> > > long avail_pgs;
> > > int n_ret = 0;
> > > - int node;
> > >
> > > /* Only single cluster request supported */
> > > WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
> > > @@ -1060,7 +1060,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> > > atomic_long_sub(n_goal * size, &nr_swap_pages);
> > >
> > > start_over:
> > > - node = numa_node_id();
> > > plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
> > > /* requeue si to after same-priority siblings */
> > > plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
> > > --
> > > 2.26.3
> > >
> > >
>

2022-04-22 09:15:25

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Fri, 2022-04-22 at 14:24 +0800, Aaron Lu wrote:
> On Thu, Apr 21, 2022 at 04:34:09PM +0800, [email protected] wrote:
> > On Thu, 2022-04-21 at 16:17 +0800, Aaron Lu wrote:
> > > On Thu, Apr 21, 2022 at 03:49:21PM +0800, [email protected] wrote:
>
> ... ...
>
> > > > For swap-in latency, we can use pmbench, which can output latency
> > > > information.
> > > >
> > >
> > > OK, I'll give pmbench a run, thanks for the suggestion.
> >
> > Better to construct a senario with more swapin than swapout. For
> > example, start a memory eater, then kill it later.
>
> What about vm-scalability/case-swapin?
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swapin
>
> I think you are pretty familiar with it but still:
> 1) it starts $nr_task processes and each mmaps $size/$nr_task area and
>    then consumes the memory, after this, it waits for a signal;
> 2) start another process to consume $size memory to push the memory in
>    step 1) to swap device;
> 3) kick processes in step 1) to start accessing their memory, thus
>    trigger swapins. The metric of this testcase is the swapin throughput.
>
> I plan to restrict the cgroup's limit to $size.
>
> Considering there is only one NVMe drive attached to node 0, I will run
> the test as described before:
> 1) bind processes to run on node 0, allocate on node 1 to test the
>    performance when reclaimer's node id is the same as swap device's.
> 2) bind processes to run on node 1, allocate on node 0 to test the
>    performance when page's node id is the same as swap device's.
>
> Ying and Yang,
>
> Let me know what you think about the case used and the way the test is
> conducted.

The test case looks good to me. And, do you have a way to measure swap
in latency? Better to compare between enabling and disabling per-node
swap device support too to make sure per-node support has performance
impact on this system.

Best Regards,
Huang, Ying


2022-04-22 19:04:53

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Thu, Apr 21, 2022 at 04:17:20PM +0800, Aaron Lu wrote:

> The full vmstat is attached.

Now really attach it.


Attachments:
(No filename) (119.00 B)
vmstat (35.87 kB)
Download all attachments

2022-04-22 19:16:19

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Fri, 2022-04-22 at 14:43 +0800, Aaron Lu wrote:
> On Fri, Apr 22, 2022 at 02:27:45PM +0800, [email protected] wrote:
> > On Fri, 2022-04-22 at 14:24 +0800, Aaron Lu wrote:
> > > On Thu, Apr 21, 2022 at 04:34:09PM +0800, [email protected] wrote:
> > > > On Thu, 2022-04-21 at 16:17 +0800, Aaron Lu wrote:
> > > > > On Thu, Apr 21, 2022 at 03:49:21PM +0800, [email protected] wrote:
> > >
> > > ... ...
> > >
> > > > > > For swap-in latency, we can use pmbench, which can output latency
> > > > > > information.
> > > > > >
> > > > >
> > > > > OK, I'll give pmbench a run, thanks for the suggestion.
> > > >
> > > > Better to construct a senario with more swapin than swapout. For
> > > > example, start a memory eater, then kill it later.
> > >
> > > What about vm-scalability/case-swapin?
> > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swapin
> > >
> > > I think you are pretty familiar with it but still:
> > > 1) it starts $nr_task processes and each mmaps $size/$nr_task area and
> > >    then consumes the memory, after this, it waits for a signal;
> > > 2) start another process to consume $size memory to push the memory in
> > >    step 1) to swap device;
> > > 3) kick processes in step 1) to start accessing their memory, thus
> > >    trigger swapins. The metric of this testcase is the swapin throughput.
> > >
> > > I plan to restrict the cgroup's limit to $size.
> > >
> > > Considering there is only one NVMe drive attached to node 0, I will run
> > > the test as described before:
> > > 1) bind processes to run on node 0, allocate on node 1 to test the
> > >    performance when reclaimer's node id is the same as swap device's.
> > > 2) bind processes to run on node 1, allocate on node 0 to test the
> > >    performance when page's node id is the same as swap device's.
> > >
> > > Ying and Yang,
> > >
> > > Let me know what you think about the case used and the way the test is
> > > conducted.
> >
> > The test case looks good to me. And, do you have a way to measure swap
> > in latency? Better to compare between enabling and disabling per-node
>
> By swap in latency, do you mean the time it takes for a fault that is
> served by swap in?
>
> Since the test is swap in only, I think throughput can tell us the
> average swap in latency?
>

Yes. Given the same parallel level, the average swap in latency can be
reflect via throughput.

> > swap device support too to make sure per-node support has performance
> > impact on this system.
>
> I think we can tell by conducting two more tests:
> 1) bind processes to run on node 0, allocate on node 0;
> 2) bind processes to run on node 1, allocate on node 1.
> If case 1) is faster, we can say per-node support has performance impact
> on this system.

At least we can measure whether cross-node latency matters with this
test.

Best Regards,
Huang, Ying


2022-04-22 19:21:10

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Thu, Apr 21, 2022 at 11:24 PM Aaron Lu <[email protected]> wrote:
>
> On Thu, Apr 21, 2022 at 04:34:09PM +0800, [email protected] wrote:
> > On Thu, 2022-04-21 at 16:17 +0800, Aaron Lu wrote:
> > > On Thu, Apr 21, 2022 at 03:49:21PM +0800, [email protected] wrote:
>
> ... ...
>
> > > > For swap-in latency, we can use pmbench, which can output latency
> > > > information.
> > > >
> > >
> > > OK, I'll give pmbench a run, thanks for the suggestion.
> >
> > Better to construct a senario with more swapin than swapout. For
> > example, start a memory eater, then kill it later.
>
> What about vm-scalability/case-swapin?
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swapin
>
> I think you are pretty familiar with it but still:
> 1) it starts $nr_task processes and each mmaps $size/$nr_task area and
> then consumes the memory, after this, it waits for a signal;
> 2) start another process to consume $size memory to push the memory in
> step 1) to swap device;
> 3) kick processes in step 1) to start accessing their memory, thus
> trigger swapins. The metric of this testcase is the swapin throughput.
>
> I plan to restrict the cgroup's limit to $size.
>
> Considering there is only one NVMe drive attached to node 0, I will run
> the test as described before:
> 1) bind processes to run on node 0, allocate on node 1 to test the
> performance when reclaimer's node id is the same as swap device's.
> 2) bind processes to run on node 1, allocate on node 0 to test the
> performance when page's node id is the same as swap device's.
>
> Ying and Yang,
>
> Let me know what you think about the case used and the way the test is
> conducted.

Looks fine to me. To measure the latency, you could also try the below
bpftrace script:

#! /usr/bin/bpftrace

kprobe:swap_readpage
{
@start[tid] = nsecs;
}

kretprobe:swap_readpage
/@start[tid]/
{
@us[comm] = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}

2022-04-22 19:37:19

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Wed, 2022-04-20 at 16:33 +0800, Aaron Lu wrote:
> On Thu, Apr 07, 2022 at 10:36:54AM -0700, Yang Shi wrote:
> > On Thu, Apr 7, 2022 at 1:12 AM Aaron Lu <[email protected]> wrote:
> > >
> > > On Wed, Apr 06, 2022 at 07:09:53PM -0700, Yang Shi wrote:
> > > > The swap devices are linked to per node priority lists, the swap device
> > > > closer to the node has higher priority on that node's priority list.
> > > > This is supposed to improve I/O latency, particularly for some fast
> > > > devices. But the current code gets nid by calling numa_node_id() which
> > > > actually returns the nid that the reclaimer is running on instead of the
> > > > nid that the page belongs to.
> > > >
> > >
> > > Right.
> > >
> > > > Pass the page's nid dow to get_swap_pages() in order to pick up the
> > > > right swap device. But it doesn't work for the swap slots cache which
> > > > is per cpu. We could skip swap slots cache if the current node is not
> > > > the page's node, but it may be overkilling. So keep using the current
> > > > node's swap slots cache. The issue was found by visual code inspection
> > > > so it is not sure how much improvement could be achieved due to lack of
> > > > suitable testing device. But anyway the current code does violate the
> > > > design.
> > > >
> > >
> > > I intentionally used the reclaimer's nid because I think when swapping
> > > out to a device, it is faster when the device is on the same node as
> > > the cpu.
> >
> > OK, the offline discussion with Huang Ying showed the design was to
> > have page's nid in order to achieve better I/O performance (more
> > noticeable on faster devices) since the reclaimer may be running on a
> > different node from the reclaimed page.
> >
> > >
> > > Anyway, I think I can make a test case where the workload allocates all
> > > its memory on the remote node and its workingset memory is larger then
> > > the available memory so swap is triggered, then we can see which way
> > > achieves better performance. Sounds reasonable to you?
> >
> > Yeah, definitely, thank you so much. I don't have a fast enough device
> > by hand to show the difference right now. If you could get some data
> > it would be perfect.
> >
>
> Failed to find a test box that has two NVMe disks attached to different
> nodes and since Shanghai is locked down right now, we couldn't install
> another NVMe on the box so I figured it might be OK to test on a box that
> has a single NVMe attached to node 0 like this:
>
> 1) restrict the test processes to run on node 0 and allocate on node 1;
> 2) restrict the test processes to run on node 1 and allocate on node 0.
>
> In case 1), the reclaimer's node id is the same as the swap device's so
> it's the same as current behaviour and in case 2), the page's node id is
> the same as the swap device's so it's what your patch proposed.
>
> The test I used is vm-scalability/case-swap-w-rand:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swap-w-seq
> which spawns $nr_task processes and each will mmap $size and then
> randomly write to that area. I set nr_task=32 and $size=4G, so a total
> of 128G memory will be needed and I used memory.limit_in_bytes to
> restrict the available memory to 64G, to make sure swap is triggered.
>
> The reason why cgroup is used is to avoid waking up the per-node kswapd
> which can trigger swapping with reclaimer/page/swap device all having the
> same node id.
>
> And I don't see a measuable difference from the result:
> case1(using reclaimer's node id) vm-scalability.throughput: 10574 KB/s
> case2(using page's node id) vm-scalability.throughput: 10567 KB/s
>
> My interpretation of the result is, when reclaiming a remote page, it
> doesn't matter much which swap device to use if the swap device is a IO
> device.
>
> Later Ying reminded me we have test box that has optane installed on
> different nodes so I also tested there: Icelake 2 sockets server with 2
> optane installed on each node. I did the test there like this:
> 1) restrict the test processes to run on node 0 and allocate on node 1
>    and only swapon pmem0, which is the optane backed swap device on node 0;
> 2) restrict the test processes to run on node 0 and allocate on node 1
>    and only swapon pmem1, which is the optane backed swap device on node 1.
>
> So case 1) is current behaviour and case 2) is what your patch proposed.
>
> With the same test and the same nr_task/size, the result is:
> case1(using reclaimer's node id) vm-scalability.throughput: 71033 KB/s
> case2(using page's node id) vm-scalability.throughput: 58753 KB/s
>

The per-node swap device support is more about swap-in latency than
swap-out throughput. I suspect the test case is more about swap-out
throughput. perf profiling can show this.

For swap-in latency, we can use pmbench, which can output latency
information.

Best Regards,
Huang, Ying


[snip]

2022-04-22 19:49:31

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Thu, Apr 21, 2022 at 03:49:21PM +0800, [email protected] wrote:
> On Wed, 2022-04-20 at 16:33 +0800, Aaron Lu wrote:
> > On Thu, Apr 07, 2022 at 10:36:54AM -0700, Yang Shi wrote:
> > > On Thu, Apr 7, 2022 at 1:12 AM Aaron Lu <[email protected]> wrote:
> > > >
> > > > On Wed, Apr 06, 2022 at 07:09:53PM -0700, Yang Shi wrote:
> > > > > The swap devices are linked to per node priority lists, the swap device
> > > > > closer to the node has higher priority on that node's priority list.
> > > > > This is supposed to improve I/O latency, particularly for some fast
> > > > > devices. But the current code gets nid by calling numa_node_id() which
> > > > > actually returns the nid that the reclaimer is running on instead of the
> > > > > nid that the page belongs to.
> > > > >
> > > >
> > > > Right.
> > > >
> > > > > Pass the page's nid dow to get_swap_pages() in order to pick up the
> > > > > right swap device. But it doesn't work for the swap slots cache which
> > > > > is per cpu. We could skip swap slots cache if the current node is not
> > > > > the page's node, but it may be overkilling. So keep using the current
> > > > > node's swap slots cache. The issue was found by visual code inspection
> > > > > so it is not sure how much improvement could be achieved due to lack of
> > > > > suitable testing device. But anyway the current code does violate the
> > > > > design.
> > > > >
> > > >
> > > > I intentionally used the reclaimer's nid because I think when swapping
> > > > out to a device, it is faster when the device is on the same node as
> > > > the cpu.
> > >
> > > OK, the offline discussion with Huang Ying showed the design was to
> > > have page's nid in order to achieve better I/O performance (more
> > > noticeable on faster devices) since the reclaimer may be running on a
> > > different node from the reclaimed page.
> > >
> > > >
> > > > Anyway, I think I can make a test case where the workload allocates all
> > > > its memory on the remote node and its workingset memory is larger then
> > > > the available memory so swap is triggered, then we can see which way
> > > > achieves better performance. Sounds reasonable to you?
> > >
> > > Yeah, definitely, thank you so much. I don't have a fast enough device
> > > by hand to show the difference right now. If you could get some data
> > > it would be perfect.
> > >
> >
> > Failed to find a test box that has two NVMe disks attached to different
> > nodes and since Shanghai is locked down right now, we couldn't install
> > another NVMe on the box so I figured it might be OK to test on a box that
> > has a single NVMe attached to node 0 like this:
> >
> > 1) restrict the test processes to run on node 0 and allocate on node 1;
> > 2) restrict the test processes to run on node 1 and allocate on node 0.
> >
> > In case 1), the reclaimer's node id is the same as the swap device's so
> > it's the same as current behaviour and in case 2), the page's node id is
> > the same as the swap device's so it's what your patch proposed.
> >
> > The test I used is vm-scalability/case-swap-w-rand:
> > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swap-w-seq
> > which spawns $nr_task processes and each will mmap $size and then
> > randomly write to that area. I set nr_task=32 and $size=4G, so a total
> > of 128G memory will be needed and I used memory.limit_in_bytes to
> > restrict the available memory to 64G, to make sure swap is triggered.
> >
> > The reason why cgroup is used is to avoid waking up the per-node kswapd
> > which can trigger swapping with reclaimer/page/swap device all having the
> > same node id.
> >
> > And I don't see a measuable difference from the result:
> > case1(using reclaimer's node id) vm-scalability.throughput: 10574 KB/s
> > case2(using page's node id) vm-scalability.throughput: 10567 KB/s
> >
> > My interpretation of the result is, when reclaiming a remote page, it
> > doesn't matter much which swap device to use if the swap device is a IO
> > device.
> >
> > Later Ying reminded me we have test box that has optane installed on
> > different nodes so I also tested there: Icelake 2 sockets server with 2
> > optane installed on each node. I did the test there like this:
> > 1) restrict the test processes to run on node 0 and allocate on node 1
> > ???and only swapon pmem0, which is the optane backed swap device on node 0;
> > 2) restrict the test processes to run on node 0 and allocate on node 1
> > ???and only swapon pmem1, which is the optane backed swap device on node 1.
> >
> > So case 1) is current behaviour and case 2) is what your patch proposed.
> >
> > With the same test and the same nr_task/size, the result is:
> > case1(using reclaimer's node id) vm-scalability.throughput: 71033 KB/s
> > case2(using page's node id) vm-scalability.throughput: 58753 KB/s
> >
>
> The per-node swap device support is more about swap-in latency than
> swap-out throughput. I suspect the test case is more about swap-out
> throughput. perf profiling can show this.
>

Can you elaborate the "perf profiling" part, like which perf metric can
show this?

I checked the vmstat output. Initially, it's mostly swap out, then swap
in starts to be more and more, but swapout is always more than swapin.

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp-----
r b swpd free buff cache si so bi bo in cs us sy id wa st UTC
20 0 0 259772032 8 2428796 0 0 2 0 365 74 1 1 98 0 0 2022-04-19 11:44:37
4 0 0 259149824 8 2439084 0 0 0 0 148536 22298 1 3 96 0 0 2022-04-19 11:44:38
35 0 0 236926192 1052 2572000 0 0 2163 2048 392940 3690 1 24 75 0 0 2022-04-19 11:44:39
32 0 0 205170832 1052 2572280 0 0 0 0 193095 1280 2 33 65 0 0 2022-04-19 11:44:40
40 0 136448 192211008 1052 2439412 268 133760 0 0 532052 3571 0 40 59 0 0 2022-04-19 11:44:43
33 0 4224000 192182464 1052 2450936 130636 4211460 0 0 1145644 15600 1 33 65 0 0 2022-04-19 11:44:44
32 0 6908708 192179408 1052 2460836 239456 2924392 0 0 863325 10731 1 34 65 0 0 2022-04-19 11:44:45
33 0 9005036 192163936 1052 2469112 279200 2371924 0 0 739330 9635 1 34 65 0 0 2022-04-19 11:44:46
32 0 10769408 192154864 1052 2476172 305516 2071440 0 0 673619 9044 1 34 65 0 0 2022-04-19 11:44:47
32 0 12318720 192149344 1052 2482708 324312 1875224 0 0 627868 9404 1 35 65 0 0 2022-04-19 11:44:48
33 0 13735796 192146752 1052 2488520 340684 1756308 0 0 601933 11815 1 35 65 0 0 2022-04-19 11:44:49
... ...
32 0 47465984 192000400 1052 2636472 456228 652756 0 0 335570 4359 0 35 65 0 0 2022-04-19 11:46:11
33 0 47652352 191996784 1052 2637932 437368 622652 0 0 328532 3627 1 35 65 0 0 2022-04-19 11:46:12
33 0 47830272 191998800 1052 2639296 428224 607516 0 0 325651 3847 1 35 65 0 0 2022-04-19 11:46:13
34 0 48014492 192001504 1052 2640596 441592 625308 0 0 329752 4199 1 35 65 0 0 2022-04-19 11:46:14
32 0 48200960 191996240 1052 2642148 458724 645744 0 0 333806 3988 0 35 65 0 0 2022-04-19 11:46:15
45 0 48379904 191991904 1052 2643576 442948 621268 0 0 329070 4529 0 35 65 0 0 2022-04-19 11:46:16
33 0 48557504 191996960 1052 2644812 444420 621948 0 0 328916 7249 1 35 64 0 0 2022-04-19 11:46:17
33 0 48729564 191995744 1052 2646272 447396 619376 0 0 329126 4565 0 35 65 0 0 2022-04-19 11:46:18
32 0 48959268 191989024 1052 2647828 595888 825480 0 0 368714 8096 0 35 65 0 0 2022-04-19 11:46:19
32 0 49242368 191990304 1052 2650036 746212 1028524 0 0 411140 10949 0 34 65 0 0 2022-04-19 11:46:20
32 0 49520792 191984080 1052 2652372 758208 1037236 0 0 415505 10094 0 34 65 0 0 2022-04-19 11:46:21
32 0 49799168 191994240 1052 2654724 767236 1046964 0 0 418405 10726 0 35 65 0 0 2022-04-19 11:46:22
32 0 50067712 191989104 1052 2657092 759192 1028600 0 0 415356 10173 0 35 65 0 0 2022-04-19 11:46:23
33 0 50333440 191980320 1052 2659332 750764 1014732 0 0 412144 9197 0 34 65 0 0 2022-04-19 11:46:24
32 0 50584052 191973824 1052 2661576 737720 988964 0 0 406620 8752 0 35 65 0 0 2022-04-19 11:46:25
32 0 50816000 191976080 1052 2663660 689248 921108 0 0 391782 8517 0 34 65 0 0 2022-04-19 11:46:26
32 0 51036416 191970464 1052 2665612 668004 888220 0 0 385112 7310 1 34 65 0 0 2022-04-19 11:46:27
32 0 51256576 191962224 1052 2667536 678464 897872 0 0 388494 12547 0 35 65 0 0 2022-04-19 11:46:28
33 0 51464680 191966304 1052 2669472 654540 862720 0 0 380869 7069 1 34 65 0 0 2022-04-19 11:46:29
32 0 51597232 191971840 1052 2670848 419772 552324 0 0 314325 4029 1 35 65 0 0 2022-04-19 11:46:30
33 0 51722448 191969456 1052 2672072 409300 535892 0 0 310720 4014 1 35 65 0 0 2022-04-19 11:46:31
32 0 51850496 191963472 1052 2673236 413160 541076 0 0 311652 3583 1 35 65 0 0 2022-04-19 11:46:32
32 0 51978872 191968208 1052 2674452 415844 543464 0 0 312411 3579 1 35 65 0 0 2022-04-19 11:46:33
32 0 52105724 191974640 1052 2675616 418104 545728 0 0 312731 4183 1 35 65 0 0 2022-04-19 11:46:34
34 0 52232928 191964336 1052 2676964 426200 552956 0 0 314230 3834 1 35 64 0 0 2022-04-19 11:46:35

The full vmstat is attached.

> For swap-in latency, we can use pmbench, which can output latency
> information.
>

OK, I'll give pmbench a run, thanks for the suggestion.

2022-04-22 20:02:37

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Thu, Apr 21, 2022 at 03:49:21PM +0800, [email protected] wrote:
> On Wed, 2022-04-20 at 16:33 +0800, Aaron Lu wrote:
> > On Thu, Apr 07, 2022 at 10:36:54AM -0700, Yang Shi wrote:
> > > On Thu, Apr 7, 2022 at 1:12 AM Aaron Lu <[email protected]> wrote:
> > > >
> > > > On Wed, Apr 06, 2022 at 07:09:53PM -0700, Yang Shi wrote:
> > > > > The swap devices are linked to per node priority lists, the swap device
> > > > > closer to the node has higher priority on that node's priority list.
> > > > > This is supposed to improve I/O latency, particularly for some fast
> > > > > devices. But the current code gets nid by calling numa_node_id() which
> > > > > actually returns the nid that the reclaimer is running on instead of the
> > > > > nid that the page belongs to.
> > > > >
> > > >
> > > > Right.
> > > >
> > > > > Pass the page's nid dow to get_swap_pages() in order to pick up the
> > > > > right swap device. But it doesn't work for the swap slots cache which
> > > > > is per cpu. We could skip swap slots cache if the current node is not
> > > > > the page's node, but it may be overkilling. So keep using the current
> > > > > node's swap slots cache. The issue was found by visual code inspection
> > > > > so it is not sure how much improvement could be achieved due to lack of
> > > > > suitable testing device. But anyway the current code does violate the
> > > > > design.
> > > > >
> > > >
> > > > I intentionally used the reclaimer's nid because I think when swapping
> > > > out to a device, it is faster when the device is on the same node as
> > > > the cpu.
> > >
> > > OK, the offline discussion with Huang Ying showed the design was to
> > > have page's nid in order to achieve better I/O performance (more
> > > noticeable on faster devices) since the reclaimer may be running on a
> > > different node from the reclaimed page.
> > >
> > > >
> > > > Anyway, I think I can make a test case where the workload allocates all
> > > > its memory on the remote node and its workingset memory is larger then
> > > > the available memory so swap is triggered, then we can see which way
> > > > achieves better performance. Sounds reasonable to you?
> > >
> > > Yeah, definitely, thank you so much. I don't have a fast enough device
> > > by hand to show the difference right now. If you could get some data
> > > it would be perfect.
> > >
> >
> > Failed to find a test box that has two NVMe disks attached to different
> > nodes and since Shanghai is locked down right now, we couldn't install
> > another NVMe on the box so I figured it might be OK to test on a box that
> > has a single NVMe attached to node 0 like this:
> >
> > 1) restrict the test processes to run on node 0 and allocate on node 1;
> > 2) restrict the test processes to run on node 1 and allocate on node 0.
> >
> > In case 1), the reclaimer's node id is the same as the swap device's so
> > it's the same as current behaviour and in case 2), the page's node id is
> > the same as the swap device's so it's what your patch proposed.
> >
> > The test I used is vm-scalability/case-swap-w-rand:
> > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swap-w-seq
> > which spawns $nr_task processes and each will mmap $size and then
> > randomly write to that area. I set nr_task=32 and $size=4G, so a total
> > of 128G memory will be needed and I used memory.limit_in_bytes to
> > restrict the available memory to 64G, to make sure swap is triggered.
> >
> > The reason why cgroup is used is to avoid waking up the per-node kswapd
> > which can trigger swapping with reclaimer/page/swap device all having the
> > same node id.
> >
> > And I don't see a measuable difference from the result:
> > case1(using reclaimer's node id) vm-scalability.throughput: 10574 KB/s
> > case2(using page's node id) vm-scalability.throughput: 10567 KB/s
> >
> > My interpretation of the result is, when reclaiming a remote page, it
> > doesn't matter much which swap device to use if the swap device is a IO
> > device.
> >
> > Later Ying reminded me we have test box that has optane installed on
> > different nodes so I also tested there: Icelake 2 sockets server with 2
> > optane installed on each node. I did the test there like this:
> > 1) restrict the test processes to run on node 0 and allocate on node 1
> > ???and only swapon pmem0, which is the optane backed swap device on node 0;
> > 2) restrict the test processes to run on node 0 and allocate on node 1
> > ???and only swapon pmem1, which is the optane backed swap device on node 1.
> >
> > So case 1) is current behaviour and case 2) is what your patch proposed.
> >
> > With the same test and the same nr_task/size, the result is:
> > case1(using reclaimer's node id) vm-scalability.throughput: 71033 KB/s
> > case2(using page's node id) vm-scalability.throughput: 58753 KB/s
> >
>
> The per-node swap device support is more about swap-in latency than
> swap-out throughput. I suspect the test case is more about swap-out
> throughput. perf profiling can show this.
>

On another thought, swap out can very well affect swap in latency:
since swap is involved, the available memory is in short supply and swap
in may very likely need to reclaim a page and that reclaim can involve a
swap out, so swap out performance can also affect swap in latency.

> For swap-in latency, we can use pmbench, which can output latency
> information.
>
> Best Regards,
> Huang, Ying
>
>
> [snip]
>

2022-04-22 20:13:25

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Wed, Apr 20, 2022 at 1:33 AM Aaron Lu <[email protected]> wrote:
>
> On Thu, Apr 07, 2022 at 10:36:54AM -0700, Yang Shi wrote:
> > On Thu, Apr 7, 2022 at 1:12 AM Aaron Lu <[email protected]> wrote:
> > >
> > > On Wed, Apr 06, 2022 at 07:09:53PM -0700, Yang Shi wrote:
> > > > The swap devices are linked to per node priority lists, the swap device
> > > > closer to the node has higher priority on that node's priority list.
> > > > This is supposed to improve I/O latency, particularly for some fast
> > > > devices. But the current code gets nid by calling numa_node_id() which
> > > > actually returns the nid that the reclaimer is running on instead of the
> > > > nid that the page belongs to.
> > > >
> > >
> > > Right.
> > >
> > > > Pass the page's nid dow to get_swap_pages() in order to pick up the
> > > > right swap device. But it doesn't work for the swap slots cache which
> > > > is per cpu. We could skip swap slots cache if the current node is not
> > > > the page's node, but it may be overkilling. So keep using the current
> > > > node's swap slots cache. The issue was found by visual code inspection
> > > > so it is not sure how much improvement could be achieved due to lack of
> > > > suitable testing device. But anyway the current code does violate the
> > > > design.
> > > >
> > >
> > > I intentionally used the reclaimer's nid because I think when swapping
> > > out to a device, it is faster when the device is on the same node as
> > > the cpu.
> >
> > OK, the offline discussion with Huang Ying showed the design was to
> > have page's nid in order to achieve better I/O performance (more
> > noticeable on faster devices) since the reclaimer may be running on a
> > different node from the reclaimed page.
> >
> > >
> > > Anyway, I think I can make a test case where the workload allocates all
> > > its memory on the remote node and its workingset memory is larger then
> > > the available memory so swap is triggered, then we can see which way
> > > achieves better performance. Sounds reasonable to you?
> >
> > Yeah, definitely, thank you so much. I don't have a fast enough device
> > by hand to show the difference right now. If you could get some data
> > it would be perfect.
> >
>
> Failed to find a test box that has two NVMe disks attached to different
> nodes and since Shanghai is locked down right now, we couldn't install
> another NVMe on the box so I figured it might be OK to test on a box that
> has a single NVMe attached to node 0 like this:
>
> 1) restrict the test processes to run on node 0 and allocate on node 1;
> 2) restrict the test processes to run on node 1 and allocate on node 0.
>
> In case 1), the reclaimer's node id is the same as the swap device's so
> it's the same as current behaviour and in case 2), the page's node id is
> the same as the swap device's so it's what your patch proposed.
>
> The test I used is vm-scalability/case-swap-w-rand:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swap-w-seq
> which spawns $nr_task processes and each will mmap $size and then
> randomly write to that area. I set nr_task=32 and $size=4G, so a total
> of 128G memory will be needed and I used memory.limit_in_bytes to
> restrict the available memory to 64G, to make sure swap is triggered.
>
> The reason why cgroup is used is to avoid waking up the per-node kswapd
> which can trigger swapping with reclaimer/page/swap device all having the
> same node id.
>
> And I don't see a measuable difference from the result:
> case1(using reclaimer's node id) vm-scalability.throughput: 10574 KB/s
> case2(using page's node id) vm-scalability.throughput: 10567 KB/s
>
> My interpretation of the result is, when reclaiming a remote page, it
> doesn't matter much which swap device to use if the swap device is a IO
> device.
>
> Later Ying reminded me we have test box that has optane installed on
> different nodes so I also tested there: Icelake 2 sockets server with 2
> optane installed on each node. I did the test there like this:
> 1) restrict the test processes to run on node 0 and allocate on node 1
> and only swapon pmem0, which is the optane backed swap device on node 0;
> 2) restrict the test processes to run on node 0 and allocate on node 1
> and only swapon pmem1, which is the optane backed swap device on node 1.
>
> So case 1) is current behaviour and case 2) is what your patch proposed.
>
> With the same test and the same nr_task/size, the result is:
> case1(using reclaimer's node id) vm-scalability.throughput: 71033 KB/s
> case2(using page's node id) vm-scalability.throughput: 58753 KB/s
>
> The result suggested when using a memory like device as swap device,
> it's better to use the reclaimer's node id when swapping.
>
> What do you think of these tests and results?

Thanks, Aaron. Since you are using PMEM so I'm supposed case #2
actually has 1 more cross node memory copy than case #1. Swapping
doesn't incur the block I/O (DMA) but memory copy in this
configuration.

IIUC, case #1 would do node1 -> node0 -> pmem0, but case #2 would do
node1 -> node0 -> pmem1.

Actually I was thinking about Optane SSD instead of PMEM when
discussing this with Ying offline. Not sure if Optane SSD would have
any measurable difference or not. But anyway if PMEM is used as swap
device the current implementation does make more sense.

>
> Thanks,
> Aaron
>
> > BTW, this patch doesn't change the node for swap slots cache, so it
> > may still use the swap device on a remote node if swap slots cache is
> > used.
> >
> > >
> > > > Cc: Huang Ying <[email protected]>
> > > > Signed-off-by: Yang Shi <[email protected]>
> > > > ---
> > > > include/linux/swap.h | 3 ++-
> > > > mm/swap_slots.c | 7 ++++---
> > > > mm/swapfile.c | 5 ++---
> > > > 3 files changed, 8 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > index 27093b477c5f..e442cf6b61ea 100644
> > > > --- a/include/linux/swap.h
> > > > +++ b/include/linux/swap.h
> > > > @@ -497,7 +497,8 @@ extern void si_swapinfo(struct sysinfo *);
> > > > extern swp_entry_t get_swap_page(struct page *page);
> > > > extern void put_swap_page(struct page *page, swp_entry_t entry);
> > > > extern swp_entry_t get_swap_page_of_type(int);
> > > > -extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size);
> > > > +extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size,
> > > > + int node);
> > > > extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> > > > extern void swap_shmem_alloc(swp_entry_t);
> > > > extern int swap_duplicate(swp_entry_t);
> > > > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > > > index 2b5531840583..a1c5cf6a4302 100644
> > > > --- a/mm/swap_slots.c
> > > > +++ b/mm/swap_slots.c
> > > > @@ -264,7 +264,7 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
> > > > cache->cur = 0;
> > > > if (swap_slot_cache_active)
> > > > cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
> > > > - cache->slots, 1);
> > > > + cache->slots, 1, numa_node_id());
> > > >
> > > > return cache->nr;
> > > > }
> > > > @@ -305,12 +305,13 @@ swp_entry_t get_swap_page(struct page *page)
> > > > {
> > > > swp_entry_t entry;
> > > > struct swap_slots_cache *cache;
> > > > + int nid = page_to_nid(page);
> > > >
> > > > entry.val = 0;
> > > >
> > > > if (PageTransHuge(page)) {
> > > > if (IS_ENABLED(CONFIG_THP_SWAP))
> > > > - get_swap_pages(1, &entry, HPAGE_PMD_NR);
> > > > + get_swap_pages(1, &entry, HPAGE_PMD_NR, nid);
> > > > goto out;
> > > > }
> > > >
> > > > @@ -342,7 +343,7 @@ swp_entry_t get_swap_page(struct page *page)
> > > > goto out;
> > > > }
> > > >
> > > > - get_swap_pages(1, &entry, 1);
> > > > + get_swap_pages(1, &entry, 1, nid);
> > > > out:
> > > > if (mem_cgroup_try_charge_swap(page, entry)) {
> > > > put_swap_page(page, entry);
> > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > index 63c61f8b2611..151fffe0fd60 100644
> > > > --- a/mm/swapfile.c
> > > > +++ b/mm/swapfile.c
> > > > @@ -1036,13 +1036,13 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> > > > swap_range_free(si, offset, SWAPFILE_CLUSTER);
> > > > }
> > > >
> > > > -int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> > > > +int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size,
> > > > + int node)
> > > > {
> > > > unsigned long size = swap_entry_size(entry_size);
> > > > struct swap_info_struct *si, *next;
> > > > long avail_pgs;
> > > > int n_ret = 0;
> > > > - int node;
> > > >
> > > > /* Only single cluster request supported */
> > > > WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
> > > > @@ -1060,7 +1060,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> > > > atomic_long_sub(n_goal * size, &nr_swap_pages);
> > > >
> > > > start_over:
> > > > - node = numa_node_id();
> > > > plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
> > > > /* requeue si to after same-priority siblings */
> > > > plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
> > > > --
> > > > 2.26.3
> > > >
> > > >
> >

2022-04-22 20:18:55

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Wed, Apr 20, 2022 at 03:21:25PM -0700, Yang Shi wrote:
> On Wed, Apr 20, 2022 at 1:33 AM Aaron Lu <[email protected]> wrote:
> >
> > On Thu, Apr 07, 2022 at 10:36:54AM -0700, Yang Shi wrote:
> > > On Thu, Apr 7, 2022 at 1:12 AM Aaron Lu <[email protected]> wrote:
> > > >
> > > > On Wed, Apr 06, 2022 at 07:09:53PM -0700, Yang Shi wrote:
> > > > > The swap devices are linked to per node priority lists, the swap device
> > > > > closer to the node has higher priority on that node's priority list.
> > > > > This is supposed to improve I/O latency, particularly for some fast
> > > > > devices. But the current code gets nid by calling numa_node_id() which
> > > > > actually returns the nid that the reclaimer is running on instead of the
> > > > > nid that the page belongs to.
> > > > >
> > > >
> > > > Right.
> > > >
> > > > > Pass the page's nid dow to get_swap_pages() in order to pick up the
> > > > > right swap device. But it doesn't work for the swap slots cache which
> > > > > is per cpu. We could skip swap slots cache if the current node is not
> > > > > the page's node, but it may be overkilling. So keep using the current
> > > > > node's swap slots cache. The issue was found by visual code inspection
> > > > > so it is not sure how much improvement could be achieved due to lack of
> > > > > suitable testing device. But anyway the current code does violate the
> > > > > design.
> > > > >
> > > >
> > > > I intentionally used the reclaimer's nid because I think when swapping
> > > > out to a device, it is faster when the device is on the same node as
> > > > the cpu.
> > >
> > > OK, the offline discussion with Huang Ying showed the design was to
> > > have page's nid in order to achieve better I/O performance (more
> > > noticeable on faster devices) since the reclaimer may be running on a
> > > different node from the reclaimed page.
> > >
> > > >
> > > > Anyway, I think I can make a test case where the workload allocates all
> > > > its memory on the remote node and its workingset memory is larger then
> > > > the available memory so swap is triggered, then we can see which way
> > > > achieves better performance. Sounds reasonable to you?
> > >
> > > Yeah, definitely, thank you so much. I don't have a fast enough device
> > > by hand to show the difference right now. If you could get some data
> > > it would be perfect.
> > >
> >
> > Failed to find a test box that has two NVMe disks attached to different
> > nodes and since Shanghai is locked down right now, we couldn't install
> > another NVMe on the box so I figured it might be OK to test on a box that
> > has a single NVMe attached to node 0 like this:
> >
> > 1) restrict the test processes to run on node 0 and allocate on node 1;
> > 2) restrict the test processes to run on node 1 and allocate on node 0.
> >
> > In case 1), the reclaimer's node id is the same as the swap device's so
> > it's the same as current behaviour and in case 2), the page's node id is
> > the same as the swap device's so it's what your patch proposed.
> >
> > The test I used is vm-scalability/case-swap-w-rand:
> > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swap-w-seq
> > which spawns $nr_task processes and each will mmap $size and then
> > randomly write to that area. I set nr_task=32 and $size=4G, so a total
> > of 128G memory will be needed and I used memory.limit_in_bytes to
> > restrict the available memory to 64G, to make sure swap is triggered.
> >
> > The reason why cgroup is used is to avoid waking up the per-node kswapd
> > which can trigger swapping with reclaimer/page/swap device all having the
> > same node id.
> >
> > And I don't see a measuable difference from the result:
> > case1(using reclaimer's node id) vm-scalability.throughput: 10574 KB/s
> > case2(using page's node id) vm-scalability.throughput: 10567 KB/s
> >
> > My interpretation of the result is, when reclaiming a remote page, it
> > doesn't matter much which swap device to use if the swap device is a IO
> > device.
> >
> > Later Ying reminded me we have test box that has optane installed on
> > different nodes so I also tested there: Icelake 2 sockets server with 2
> > optane installed on each node. I did the test there like this:
> > 1) restrict the test processes to run on node 0 and allocate on node 1
> > and only swapon pmem0, which is the optane backed swap device on node 0;
> > 2) restrict the test processes to run on node 0 and allocate on node 1
> > and only swapon pmem1, which is the optane backed swap device on node 1.
> >
> > So case 1) is current behaviour and case 2) is what your patch proposed.
> >
> > With the same test and the same nr_task/size, the result is:
> > case1(using reclaimer's node id) vm-scalability.throughput: 71033 KB/s
> > case2(using page's node id) vm-scalability.throughput: 58753 KB/s
> >
> > The result suggested when using a memory like device as swap device,
> > it's better to use the reclaimer's node id when swapping.
> >
> > What do you think of these tests and results?
>
> Thanks, Aaron. Since you are using PMEM so I'm supposed case #2
> actually has 1 more cross node memory copy than case #1. Swapping
> doesn't incur the block I/O (DMA) but memory copy in this
> configuration.
>

Right.

> IIUC, case #1 would do node1 -> node0 -> pmem0, but case #2 would do
> node1 -> node0 -> pmem1.
>

I think so.

> Actually I was thinking about Optane SSD instead of PMEM when
> discussing this with Ying offline. Not sure if Optane SSD would have
> any measurable difference or not. But anyway if PMEM is used as swap
> device the current implementation does make more sense.
>

LKP lab doesn't have any machine with Optane SSD, will let you know if
I managed to find one elsewhere and get some results.

In the meantime, I think testing on Optane SSD might make more sense
since there is the memory tiering support for Optane DIMM and swap may
not be used there but for the SSD, it has a higher chance of being used
as swap.

Thanks,
Aaron

> > > BTW, this patch doesn't change the node for swap slots cache, so it
> > > may still use the swap device on a remote node if swap slots cache is
> > > used.
> > >
> > > >
> > > > > Cc: Huang Ying <[email protected]>
> > > > > Signed-off-by: Yang Shi <[email protected]>
> > > > > ---
> > > > > include/linux/swap.h | 3 ++-
> > > > > mm/swap_slots.c | 7 ++++---
> > > > > mm/swapfile.c | 5 ++---
> > > > > 3 files changed, 8 insertions(+), 7 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > > index 27093b477c5f..e442cf6b61ea 100644
> > > > > --- a/include/linux/swap.h
> > > > > +++ b/include/linux/swap.h
> > > > > @@ -497,7 +497,8 @@ extern void si_swapinfo(struct sysinfo *);
> > > > > extern swp_entry_t get_swap_page(struct page *page);
> > > > > extern void put_swap_page(struct page *page, swp_entry_t entry);
> > > > > extern swp_entry_t get_swap_page_of_type(int);
> > > > > -extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size);
> > > > > +extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size,
> > > > > + int node);
> > > > > extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> > > > > extern void swap_shmem_alloc(swp_entry_t);
> > > > > extern int swap_duplicate(swp_entry_t);
> > > > > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > > > > index 2b5531840583..a1c5cf6a4302 100644
> > > > > --- a/mm/swap_slots.c
> > > > > +++ b/mm/swap_slots.c
> > > > > @@ -264,7 +264,7 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
> > > > > cache->cur = 0;
> > > > > if (swap_slot_cache_active)
> > > > > cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
> > > > > - cache->slots, 1);
> > > > > + cache->slots, 1, numa_node_id());
> > > > >
> > > > > return cache->nr;
> > > > > }
> > > > > @@ -305,12 +305,13 @@ swp_entry_t get_swap_page(struct page *page)
> > > > > {
> > > > > swp_entry_t entry;
> > > > > struct swap_slots_cache *cache;
> > > > > + int nid = page_to_nid(page);
> > > > >
> > > > > entry.val = 0;
> > > > >
> > > > > if (PageTransHuge(page)) {
> > > > > if (IS_ENABLED(CONFIG_THP_SWAP))
> > > > > - get_swap_pages(1, &entry, HPAGE_PMD_NR);
> > > > > + get_swap_pages(1, &entry, HPAGE_PMD_NR, nid);
> > > > > goto out;
> > > > > }
> > > > >
> > > > > @@ -342,7 +343,7 @@ swp_entry_t get_swap_page(struct page *page)
> > > > > goto out;
> > > > > }
> > > > >
> > > > > - get_swap_pages(1, &entry, 1);
> > > > > + get_swap_pages(1, &entry, 1, nid);
> > > > > out:
> > > > > if (mem_cgroup_try_charge_swap(page, entry)) {
> > > > > put_swap_page(page, entry);
> > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > > index 63c61f8b2611..151fffe0fd60 100644
> > > > > --- a/mm/swapfile.c
> > > > > +++ b/mm/swapfile.c
> > > > > @@ -1036,13 +1036,13 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> > > > > swap_range_free(si, offset, SWAPFILE_CLUSTER);
> > > > > }
> > > > >
> > > > > -int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> > > > > +int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size,
> > > > > + int node)
> > > > > {
> > > > > unsigned long size = swap_entry_size(entry_size);
> > > > > struct swap_info_struct *si, *next;
> > > > > long avail_pgs;
> > > > > int n_ret = 0;
> > > > > - int node;
> > > > >
> > > > > /* Only single cluster request supported */
> > > > > WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
> > > > > @@ -1060,7 +1060,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> > > > > atomic_long_sub(n_goal * size, &nr_swap_pages);
> > > > >
> > > > > start_over:
> > > > > - node = numa_node_id();
> > > > > plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
> > > > > /* requeue si to after same-priority siblings */
> > > > > plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
> > > > > --
> > > > > 2.26.3
> > > > >
> > > > >
> > >

2022-04-22 20:53:36

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Thu, Apr 21, 2022 at 7:12 AM Aaron Lu <[email protected]> wrote:
>
> On Thu, Apr 21, 2022 at 03:49:21PM +0800, [email protected] wrote:
> > On Wed, 2022-04-20 at 16:33 +0800, Aaron Lu wrote:
> > > On Thu, Apr 07, 2022 at 10:36:54AM -0700, Yang Shi wrote:
> > > > On Thu, Apr 7, 2022 at 1:12 AM Aaron Lu <[email protected]> wrote:
> > > > >
> > > > > On Wed, Apr 06, 2022 at 07:09:53PM -0700, Yang Shi wrote:
> > > > > > The swap devices are linked to per node priority lists, the swap device
> > > > > > closer to the node has higher priority on that node's priority list.
> > > > > > This is supposed to improve I/O latency, particularly for some fast
> > > > > > devices. But the current code gets nid by calling numa_node_id() which
> > > > > > actually returns the nid that the reclaimer is running on instead of the
> > > > > > nid that the page belongs to.
> > > > > >
> > > > >
> > > > > Right.
> > > > >
> > > > > > Pass the page's nid dow to get_swap_pages() in order to pick up the
> > > > > > right swap device. But it doesn't work for the swap slots cache which
> > > > > > is per cpu. We could skip swap slots cache if the current node is not
> > > > > > the page's node, but it may be overkilling. So keep using the current
> > > > > > node's swap slots cache. The issue was found by visual code inspection
> > > > > > so it is not sure how much improvement could be achieved due to lack of
> > > > > > suitable testing device. But anyway the current code does violate the
> > > > > > design.
> > > > > >
> > > > >
> > > > > I intentionally used the reclaimer's nid because I think when swapping
> > > > > out to a device, it is faster when the device is on the same node as
> > > > > the cpu.
> > > >
> > > > OK, the offline discussion with Huang Ying showed the design was to
> > > > have page's nid in order to achieve better I/O performance (more
> > > > noticeable on faster devices) since the reclaimer may be running on a
> > > > different node from the reclaimed page.
> > > >
> > > > >
> > > > > Anyway, I think I can make a test case where the workload allocates all
> > > > > its memory on the remote node and its workingset memory is larger then
> > > > > the available memory so swap is triggered, then we can see which way
> > > > > achieves better performance. Sounds reasonable to you?
> > > >
> > > > Yeah, definitely, thank you so much. I don't have a fast enough device
> > > > by hand to show the difference right now. If you could get some data
> > > > it would be perfect.
> > > >
> > >
> > > Failed to find a test box that has two NVMe disks attached to different
> > > nodes and since Shanghai is locked down right now, we couldn't install
> > > another NVMe on the box so I figured it might be OK to test on a box that
> > > has a single NVMe attached to node 0 like this:
> > >
> > > 1) restrict the test processes to run on node 0 and allocate on node 1;
> > > 2) restrict the test processes to run on node 1 and allocate on node 0.
> > >
> > > In case 1), the reclaimer's node id is the same as the swap device's so
> > > it's the same as current behaviour and in case 2), the page's node id is
> > > the same as the swap device's so it's what your patch proposed.
> > >
> > > The test I used is vm-scalability/case-swap-w-rand:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swap-w-seq
> > > which spawns $nr_task processes and each will mmap $size and then
> > > randomly write to that area. I set nr_task=32 and $size=4G, so a total
> > > of 128G memory will be needed and I used memory.limit_in_bytes to
> > > restrict the available memory to 64G, to make sure swap is triggered.
> > >
> > > The reason why cgroup is used is to avoid waking up the per-node kswapd
> > > which can trigger swapping with reclaimer/page/swap device all having the
> > > same node id.
> > >
> > > And I don't see a measuable difference from the result:
> > > case1(using reclaimer's node id) vm-scalability.throughput: 10574 KB/s
> > > case2(using page's node id) vm-scalability.throughput: 10567 KB/s
> > >
> > > My interpretation of the result is, when reclaiming a remote page, it
> > > doesn't matter much which swap device to use if the swap device is a IO
> > > device.
> > >
> > > Later Ying reminded me we have test box that has optane installed on
> > > different nodes so I also tested there: Icelake 2 sockets server with 2
> > > optane installed on each node. I did the test there like this:
> > > 1) restrict the test processes to run on node 0 and allocate on node 1
> > > and only swapon pmem0, which is the optane backed swap device on node 0;
> > > 2) restrict the test processes to run on node 0 and allocate on node 1
> > > and only swapon pmem1, which is the optane backed swap device on node 1.
> > >
> > > So case 1) is current behaviour and case 2) is what your patch proposed.
> > >
> > > With the same test and the same nr_task/size, the result is:
> > > case1(using reclaimer's node id) vm-scalability.throughput: 71033 KB/s
> > > case2(using page's node id) vm-scalability.throughput: 58753 KB/s
> > >
> >
> > The per-node swap device support is more about swap-in latency than
> > swap-out throughput. I suspect the test case is more about swap-out
> > throughput. perf profiling can show this.
> >
>
> On another thought, swap out can very well affect swap in latency:
> since swap is involved, the available memory is in short supply and swap
> in may very likely need to reclaim a page and that reclaim can involve a
> swap out, so swap out performance can also affect swap in latency.

If you count in page allocation latency, yes. I think we could just
measure the I/O latency, for example, swap_readpage()? I'm supposed
the per-node swap device is aimed to minimize I/O latency.

>
> > For swap-in latency, we can use pmbench, which can output latency
> > information.
> >
> > Best Regards,
> > Huang, Ying
> >
> >
> > [snip]
> >
>

2022-04-22 20:57:46

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Thu, 2022-04-21 at 16:17 +0800, Aaron Lu wrote:
> On Thu, Apr 21, 2022 at 03:49:21PM +0800, [email protected] wrote:
> > On Wed, 2022-04-20 at 16:33 +0800, Aaron Lu wrote:
> > > On Thu, Apr 07, 2022 at 10:36:54AM -0700, Yang Shi wrote:
> > > > On Thu, Apr 7, 2022 at 1:12 AM Aaron Lu <[email protected]> wrote:
> > > > >
> > > > > On Wed, Apr 06, 2022 at 07:09:53PM -0700, Yang Shi wrote:
> > > > > > The swap devices are linked to per node priority lists, the swap device
> > > > > > closer to the node has higher priority on that node's priority list.
> > > > > > This is supposed to improve I/O latency, particularly for some fast
> > > > > > devices. But the current code gets nid by calling numa_node_id() which
> > > > > > actually returns the nid that the reclaimer is running on instead of the
> > > > > > nid that the page belongs to.
> > > > > >
> > > > >
> > > > > Right.
> > > > >
> > > > > > Pass the page's nid dow to get_swap_pages() in order to pick up the
> > > > > > right swap device. But it doesn't work for the swap slots cache which
> > > > > > is per cpu. We could skip swap slots cache if the current node is not
> > > > > > the page's node, but it may be overkilling. So keep using the current
> > > > > > node's swap slots cache. The issue was found by visual code inspection
> > > > > > so it is not sure how much improvement could be achieved due to lack of
> > > > > > suitable testing device. But anyway the current code does violate the
> > > > > > design.
> > > > > >
> > > > >
> > > > > I intentionally used the reclaimer's nid because I think when swapping
> > > > > out to a device, it is faster when the device is on the same node as
> > > > > the cpu.
> > > >
> > > > OK, the offline discussion with Huang Ying showed the design was to
> > > > have page's nid in order to achieve better I/O performance (more
> > > > noticeable on faster devices) since the reclaimer may be running on a
> > > > different node from the reclaimed page.
> > > >
> > > > >
> > > > > Anyway, I think I can make a test case where the workload allocates all
> > > > > its memory on the remote node and its workingset memory is larger then
> > > > > the available memory so swap is triggered, then we can see which way
> > > > > achieves better performance. Sounds reasonable to you?
> > > >
> > > > Yeah, definitely, thank you so much. I don't have a fast enough device
> > > > by hand to show the difference right now. If you could get some data
> > > > it would be perfect.
> > > >
> > >
> > > Failed to find a test box that has two NVMe disks attached to different
> > > nodes and since Shanghai is locked down right now, we couldn't install
> > > another NVMe on the box so I figured it might be OK to test on a box that
> > > has a single NVMe attached to node 0 like this:
> > >
> > > 1) restrict the test processes to run on node 0 and allocate on node 1;
> > > 2) restrict the test processes to run on node 1 and allocate on node 0.
> > >
> > > In case 1), the reclaimer's node id is the same as the swap device's so
> > > it's the same as current behaviour and in case 2), the page's node id is
> > > the same as the swap device's so it's what your patch proposed.
> > >
> > > The test I used is vm-scalability/case-swap-w-rand:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swap-w-seq
> > > which spawns $nr_task processes and each will mmap $size and then
> > > randomly write to that area. I set nr_task=32 and $size=4G, so a total
> > > of 128G memory will be needed and I used memory.limit_in_bytes to
> > > restrict the available memory to 64G, to make sure swap is triggered.
> > >
> > > The reason why cgroup is used is to avoid waking up the per-node kswapd
> > > which can trigger swapping with reclaimer/page/swap device all having the
> > > same node id.
> > >
> > > And I don't see a measuable difference from the result:
> > > case1(using reclaimer's node id) vm-scalability.throughput: 10574 KB/s
> > > case2(using page's node id) vm-scalability.throughput: 10567 KB/s
> > >
> > > My interpretation of the result is, when reclaiming a remote page, it
> > > doesn't matter much which swap device to use if the swap device is a IO
> > > device.
> > >
> > > Later Ying reminded me we have test box that has optane installed on
> > > different nodes so I also tested there: Icelake 2 sockets server with 2
> > > optane installed on each node. I did the test there like this:
> > > 1) restrict the test processes to run on node 0 and allocate on node 1
> > >    and only swapon pmem0, which is the optane backed swap device on node 0;
> > > 2) restrict the test processes to run on node 0 and allocate on node 1
> > >    and only swapon pmem1, which is the optane backed swap device on node 1.
> > >
> > > So case 1) is current behaviour and case 2) is what your patch proposed.
> > >
> > > With the same test and the same nr_task/size, the result is:
> > > case1(using reclaimer's node id) vm-scalability.throughput: 71033 KB/s
> > > case2(using page's node id) vm-scalability.throughput: 58753 KB/s
> > >
> >
> > The per-node swap device support is more about swap-in latency than
> > swap-out throughput. I suspect the test case is more about swap-out
> > throughput. perf profiling can show this.
> >
>
> Can you elaborate the "perf profiling" part, like which perf metric can
> show this?
>

Just `perf record`, `perf report` to show where the CPU cycles are spent
with call graph information.

Then you can find most CPU cycles are for swap out or swap in. Where is
the bottleneck.

> I checked the vmstat output. Initially, it's mostly swap out, then swap
> in starts to be more and more, but swapout is always more than swapin.
>
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp-----
>  r b swpd free buff cache si so bi bo in cs us sy id wa st UTC
> 20 0 0 259772032 8 2428796 0 0 2 0 365 74 1 1 98 0 0 2022-04-19 11:44:37
>  4 0 0 259149824 8 2439084 0 0 0 0 148536 22298 1 3 96 0 0 2022-04-19 11:44:38
> 35 0 0 236926192 1052 2572000 0 0 2163 2048 392940 3690 1 24 75 0 0 2022-04-19 11:44:39
> 32 0 0 205170832 1052 2572280 0 0 0 0 193095 1280 2 33 65 0 0 2022-04-19 11:44:40
> 40 0 136448 192211008 1052 2439412 268 133760 0 0 532052 3571 0 40 59 0 0 2022-04-19 11:44:43
> 33 0 4224000 192182464 1052 2450936 130636 4211460 0 0 1145644 15600 1 33 65 0 0 2022-04-19 11:44:44
> 32 0 6908708 192179408 1052 2460836 239456 2924392 0 0 863325 10731 1 34 65 0 0 2022-04-19 11:44:45
> 33 0 9005036 192163936 1052 2469112 279200 2371924 0 0 739330 9635 1 34 65 0 0 2022-04-19 11:44:46
> 32 0 10769408 192154864 1052 2476172 305516 2071440 0 0 673619 9044 1 34 65 0 0 2022-04-19 11:44:47
> 32 0 12318720 192149344 1052 2482708 324312 1875224 0 0 627868 9404 1 35 65 0 0 2022-04-19 11:44:48
> 33 0 13735796 192146752 1052 2488520 340684 1756308 0 0 601933 11815 1 35 65 0 0 2022-04-19 11:44:49
> ... ...
> 32 0 47465984 192000400 1052 2636472 456228 652756 0 0 335570 4359 0 35 65 0 0 2022-04-19 11:46:11
> 33 0 47652352 191996784 1052 2637932 437368 622652 0 0 328532 3627 1 35 65 0 0 2022-04-19 11:46:12
> 33 0 47830272 191998800 1052 2639296 428224 607516 0 0 325651 3847 1 35 65 0 0 2022-04-19 11:46:13
> 34 0 48014492 192001504 1052 2640596 441592 625308 0 0 329752 4199 1 35 65 0 0 2022-04-19 11:46:14
> 32 0 48200960 191996240 1052 2642148 458724 645744 0 0 333806 3988 0 35 65 0 0 2022-04-19 11:46:15
> 45 0 48379904 191991904 1052 2643576 442948 621268 0 0 329070 4529 0 35 65 0 0 2022-04-19 11:46:16
> 33 0 48557504 191996960 1052 2644812 444420 621948 0 0 328916 7249 1 35 64 0 0 2022-04-19 11:46:17
> 33 0 48729564 191995744 1052 2646272 447396 619376 0 0 329126 4565 0 35 65 0 0 2022-04-19 11:46:18
> 32 0 48959268 191989024 1052 2647828 595888 825480 0 0 368714 8096 0 35 65 0 0 2022-04-19 11:46:19
> 32 0 49242368 191990304 1052 2650036 746212 1028524 0 0 411140 10949 0 34 65 0 0 2022-04-19 11:46:20
> 32 0 49520792 191984080 1052 2652372 758208 1037236 0 0 415505 10094 0 34 65 0 0 2022-04-19 11:46:21
> 32 0 49799168 191994240 1052 2654724 767236 1046964 0 0 418405 10726 0 35 65 0 0 2022-04-19 11:46:22
> 32 0 50067712 191989104 1052 2657092 759192 1028600 0 0 415356 10173 0 35 65 0 0 2022-04-19 11:46:23
> 33 0 50333440 191980320 1052 2659332 750764 1014732 0 0 412144 9197 0 34 65 0 0 2022-04-19 11:46:24
> 32 0 50584052 191973824 1052 2661576 737720 988964 0 0 406620 8752 0 35 65 0 0 2022-04-19 11:46:25
> 32 0 50816000 191976080 1052 2663660 689248 921108 0 0 391782 8517 0 34 65 0 0 2022-04-19 11:46:26
> 32 0 51036416 191970464 1052 2665612 668004 888220 0 0 385112 7310 1 34 65 0 0 2022-04-19 11:46:27
> 32 0 51256576 191962224 1052 2667536 678464 897872 0 0 388494 12547 0 35 65 0 0 2022-04-19 11:46:28
> 33 0 51464680 191966304 1052 2669472 654540 862720 0 0 380869 7069 1 34 65 0 0 2022-04-19 11:46:29
> 32 0 51597232 191971840 1052 2670848 419772 552324 0 0 314325 4029 1 35 65 0 0 2022-04-19 11:46:30
> 33 0 51722448 191969456 1052 2672072 409300 535892 0 0 310720 4014 1 35 65 0 0 2022-04-19 11:46:31
> 32 0 51850496 191963472 1052 2673236 413160 541076 0 0 311652 3583 1 35 65 0 0 2022-04-19 11:46:32
> 32 0 51978872 191968208 1052 2674452 415844 543464 0 0 312411 3579 1 35 65 0 0 2022-04-19 11:46:33
> 32 0 52105724 191974640 1052 2675616 418104 545728 0 0 312731 4183 1 35 65 0 0 2022-04-19 11:46:34
> 34 0 52232928 191964336 1052 2676964 426200 552956 0 0 314230 3834 1 35 64 0 0 2022-04-19 11:46:35
>
> The full vmstat is attached.
>
> > For swap-in latency, we can use pmbench, which can output latency
> > information.
> >
>
> OK, I'll give pmbench a run, thanks for the suggestion.

Better to construct a senario with more swapin than swapout. For
example, start a memory eater, then kill it later.

Best Regards,
Huang, Ying

2022-04-22 21:23:50

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Thu, Apr 21, 2022 at 04:34:09PM +0800, [email protected] wrote:
> On Thu, 2022-04-21 at 16:17 +0800, Aaron Lu wrote:
> > On Thu, Apr 21, 2022 at 03:49:21PM +0800, [email protected] wrote:

... ...

> > > For swap-in latency, we can use pmbench, which can output latency
> > > information.
> > >
> >
> > OK, I'll give pmbench a run, thanks for the suggestion.
>
> Better to construct a senario with more swapin than swapout. For
> example, start a memory eater, then kill it later.

What about vm-scalability/case-swapin?
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swapin

I think you are pretty familiar with it but still:
1) it starts $nr_task processes and each mmaps $size/$nr_task area and
then consumes the memory, after this, it waits for a signal;
2) start another process to consume $size memory to push the memory in
step 1) to swap device;
3) kick processes in step 1) to start accessing their memory, thus
trigger swapins. The metric of this testcase is the swapin throughput.

I plan to restrict the cgroup's limit to $size.

Considering there is only one NVMe drive attached to node 0, I will run
the test as described before:
1) bind processes to run on node 0, allocate on node 1 to test the
performance when reclaimer's node id is the same as swap device's.
2) bind processes to run on node 1, allocate on node 0 to test the
performance when page's node id is the same as swap device's.

Ying and Yang,

Let me know what you think about the case used and the way the test is
conducted.

2022-04-22 22:12:25

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Fri, Apr 22, 2022 at 02:27:45PM +0800, [email protected] wrote:
> On Fri, 2022-04-22 at 14:24 +0800, Aaron Lu wrote:
> > On Thu, Apr 21, 2022 at 04:34:09PM +0800, [email protected] wrote:
> > > On Thu, 2022-04-21 at 16:17 +0800, Aaron Lu wrote:
> > > > On Thu, Apr 21, 2022 at 03:49:21PM +0800, [email protected] wrote:
> >
> > ... ...
> >
> > > > > For swap-in latency, we can use pmbench, which can output latency
> > > > > information.
> > > > >
> > > >
> > > > OK, I'll give pmbench a run, thanks for the suggestion.
> > >
> > > Better to construct a senario with more swapin than swapout. For
> > > example, start a memory eater, then kill it later.
> >
> > What about vm-scalability/case-swapin?
> > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swapin
> >
> > I think you are pretty familiar with it but still:
> > 1) it starts $nr_task processes and each mmaps $size/$nr_task area and
> > ???then consumes the memory, after this, it waits for a signal;
> > 2) start another process to consume $size memory to push the memory in
> > ???step 1) to swap device;
> > 3) kick processes in step 1) to start accessing their memory, thus
> > ???trigger swapins. The metric of this testcase is the swapin throughput.
> >
> > I plan to restrict the cgroup's limit to $size.
> >
> > Considering there is only one NVMe drive attached to node 0, I will run
> > the test as described before:
> > 1) bind processes to run on node 0, allocate on node 1 to test the
> > ???performance when reclaimer's node id is the same as swap device's.
> > 2) bind processes to run on node 1, allocate on node 0 to test the
> > ???performance when page's node id is the same as swap device's.
> >
> > Ying and Yang,
> >
> > Let me know what you think about the case used and the way the test is
> > conducted.
>
> The test case looks good to me. And, do you have a way to measure swap
> in latency? Better to compare between enabling and disabling per-node

By swap in latency, do you mean the time it takes for a fault that is
served by swap in?

Since the test is swap in only, I think throughput can tell us the
average swap in latency?

> swap device support too to make sure per-node support has performance
> impact on this system.

I think we can tell by conducting two more tests:
1) bind processes to run on node 0, allocate on node 0;
2) bind processes to run on node 1, allocate on node 1.
If case 1) is faster, we can say per-node support has performance impact
on this system.

2022-04-22 22:54:36

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Thu, 2022-04-21 at 22:11 +0800, Aaron Lu wrote:
> On Thu, Apr 21, 2022 at 03:49:21PM +0800, [email protected] wrote:
> > On Wed, 2022-04-20 at 16:33 +0800, Aaron Lu wrote:
> > > On Thu, Apr 07, 2022 at 10:36:54AM -0700, Yang Shi wrote:
> > > > On Thu, Apr 7, 2022 at 1:12 AM Aaron Lu <[email protected]> wrote:
> > > > >
> > > > > On Wed, Apr 06, 2022 at 07:09:53PM -0700, Yang Shi wrote:
> > > > > > The swap devices are linked to per node priority lists, the swap device
> > > > > > closer to the node has higher priority on that node's priority list.
> > > > > > This is supposed to improve I/O latency, particularly for some fast
> > > > > > devices. But the current code gets nid by calling numa_node_id() which
> > > > > > actually returns the nid that the reclaimer is running on instead of the
> > > > > > nid that the page belongs to.
> > > > > >
> > > > >
> > > > > Right.
> > > > >
> > > > > > Pass the page's nid dow to get_swap_pages() in order to pick up the
> > > > > > right swap device. But it doesn't work for the swap slots cache which
> > > > > > is per cpu. We could skip swap slots cache if the current node is not
> > > > > > the page's node, but it may be overkilling. So keep using the current
> > > > > > node's swap slots cache. The issue was found by visual code inspection
> > > > > > so it is not sure how much improvement could be achieved due to lack of
> > > > > > suitable testing device. But anyway the current code does violate the
> > > > > > design.
> > > > > >
> > > > >
> > > > > I intentionally used the reclaimer's nid because I think when swapping
> > > > > out to a device, it is faster when the device is on the same node as
> > > > > the cpu.
> > > >
> > > > OK, the offline discussion with Huang Ying showed the design was to
> > > > have page's nid in order to achieve better I/O performance (more
> > > > noticeable on faster devices) since the reclaimer may be running on a
> > > > different node from the reclaimed page.
> > > >
> > > > >
> > > > > Anyway, I think I can make a test case where the workload allocates all
> > > > > its memory on the remote node and its workingset memory is larger then
> > > > > the available memory so swap is triggered, then we can see which way
> > > > > achieves better performance. Sounds reasonable to you?
> > > >
> > > > Yeah, definitely, thank you so much. I don't have a fast enough device
> > > > by hand to show the difference right now. If you could get some data
> > > > it would be perfect.
> > > >
> > >
> > > Failed to find a test box that has two NVMe disks attached to different
> > > nodes and since Shanghai is locked down right now, we couldn't install
> > > another NVMe on the box so I figured it might be OK to test on a box that
> > > has a single NVMe attached to node 0 like this:
> > >
> > > 1) restrict the test processes to run on node 0 and allocate on node 1;
> > > 2) restrict the test processes to run on node 1 and allocate on node 0.
> > >
> > > In case 1), the reclaimer's node id is the same as the swap device's so
> > > it's the same as current behaviour and in case 2), the page's node id is
> > > the same as the swap device's so it's what your patch proposed.
> > >
> > > The test I used is vm-scalability/case-swap-w-rand:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swap-w-seq
> > > which spawns $nr_task processes and each will mmap $size and then
> > > randomly write to that area. I set nr_task=32 and $size=4G, so a total
> > > of 128G memory will be needed and I used memory.limit_in_bytes to
> > > restrict the available memory to 64G, to make sure swap is triggered.
> > >
> > > The reason why cgroup is used is to avoid waking up the per-node kswapd
> > > which can trigger swapping with reclaimer/page/swap device all having the
> > > same node id.
> > >
> > > And I don't see a measuable difference from the result:
> > > case1(using reclaimer's node id) vm-scalability.throughput: 10574 KB/s
> > > case2(using page's node id) vm-scalability.throughput: 10567 KB/s
> > >
> > > My interpretation of the result is, when reclaiming a remote page, it
> > > doesn't matter much which swap device to use if the swap device is a IO
> > > device.
> > >
> > > Later Ying reminded me we have test box that has optane installed on
> > > different nodes so I also tested there: Icelake 2 sockets server with 2
> > > optane installed on each node. I did the test there like this:
> > > 1) restrict the test processes to run on node 0 and allocate on node 1
> > >    and only swapon pmem0, which is the optane backed swap device on node 0;
> > > 2) restrict the test processes to run on node 0 and allocate on node 1
> > >    and only swapon pmem1, which is the optane backed swap device on node 1.
> > >
> > > So case 1) is current behaviour and case 2) is what your patch proposed.
> > >
> > > With the same test and the same nr_task/size, the result is:
> > > case1(using reclaimer's node id) vm-scalability.throughput: 71033 KB/s
> > > case2(using page's node id) vm-scalability.throughput: 58753 KB/s
> > >
> >
> > The per-node swap device support is more about swap-in latency than
> > swap-out throughput. I suspect the test case is more about swap-out
> > throughput. perf profiling can show this.
> >
>
> On another thought, swap out can very well affect swap in latency:
> since swap is involved, the available memory is in short supply and swap
> in may very likely need to reclaim a page and that reclaim can involve a
> swap out, so swap out performance can also affect swap in latency.
>

I think you are talking about thrashing. Thrashing will kill
performance. With proactive reclaiming, or something similar (e.g. kill
low priority workloads), we can introduce swapping almost without
thrashing.

I don't want to say the performance of swapout isn't important. But
swap out and swap in are different. swap out performance is more about
throughput, while swap in performance is more about latency.

Best Regards,
Huang, Ying

> > For swap-in latency, we can use pmbench, which can output latency
> > information.
> >
> > Best Regards,
> > Huang, Ying
> >
> >
> > [snip]
> >


2022-04-23 07:11:57

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Fri, Apr 22, 2022 at 10:00:59AM -0700, Yang Shi wrote:
> On Thu, Apr 21, 2022 at 11:24 PM Aaron Lu <[email protected]> wrote:
> >
> > On Thu, Apr 21, 2022 at 04:34:09PM +0800, [email protected] wrote:
> > > On Thu, 2022-04-21 at 16:17 +0800, Aaron Lu wrote:
> > > > On Thu, Apr 21, 2022 at 03:49:21PM +0800, [email protected] wrote:
> >
> > ... ...
> >
> > > > > For swap-in latency, we can use pmbench, which can output latency
> > > > > information.
> > > > >
> > > >
> > > > OK, I'll give pmbench a run, thanks for the suggestion.
> > >
> > > Better to construct a senario with more swapin than swapout. For
> > > example, start a memory eater, then kill it later.
> >
> > What about vm-scalability/case-swapin?
> > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swapin
> >
> > I think you are pretty familiar with it but still:
> > 1) it starts $nr_task processes and each mmaps $size/$nr_task area and
> > then consumes the memory, after this, it waits for a signal;
> > 2) start another process to consume $size memory to push the memory in
> > step 1) to swap device;
> > 3) kick processes in step 1) to start accessing their memory, thus
> > trigger swapins. The metric of this testcase is the swapin throughput.
> >
> > I plan to restrict the cgroup's limit to $size.
> >
> > Considering there is only one NVMe drive attached to node 0, I will run
> > the test as described before:
> > 1) bind processes to run on node 0, allocate on node 1 to test the
> > performance when reclaimer's node id is the same as swap device's.
> > 2) bind processes to run on node 1, allocate on node 0 to test the
> > performance when page's node id is the same as swap device's.
> >
> > Ying and Yang,
> >
> > Let me know what you think about the case used and the way the test is
> > conducted.
>
> Looks fine to me. To measure the latency, you could also try the below
> bpftrace script:

Yeah, bpftrace can nicely show us the histogram of the latency.
The hard part is to integrate bpftrace into LKP framework though.

>
> #! /usr/bin/bpftrace
>
> kprobe:swap_readpage
> {
> @start[tid] = nsecs;
> }
>
> kretprobe:swap_readpage
> /@start[tid]/
> {
> @us[comm] = hist((nsecs - @start[tid]) / 1000);
> delete(@start[tid]);
> }

2022-04-29 16:07:30

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Fri, Apr 22, 2022 at 10:00:59AM -0700, Yang Shi wrote:
> On Thu, Apr 21, 2022 at 11:24 PM Aaron Lu <[email protected]> wrote:
> >
> > On Thu, Apr 21, 2022 at 04:34:09PM +0800, [email protected] wrote:
> > > On Thu, 2022-04-21 at 16:17 +0800, Aaron Lu wrote:
> > > > On Thu, Apr 21, 2022 at 03:49:21PM +0800, [email protected] wrote:
> >
> > ... ...
> >
> > > > > For swap-in latency, we can use pmbench, which can output latency
> > > > > information.
> > > > >
> > > >
> > > > OK, I'll give pmbench a run, thanks for the suggestion.
> > >
> > > Better to construct a senario with more swapin than swapout. For
> > > example, start a memory eater, then kill it later.
> >
> > What about vm-scalability/case-swapin?
> > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swapin
> >
> > I think you are pretty familiar with it but still:
> > 1) it starts $nr_task processes and each mmaps $size/$nr_task area and
> > then consumes the memory, after this, it waits for a signal;
> > 2) start another process to consume $size memory to push the memory in
> > step 1) to swap device;
> > 3) kick processes in step 1) to start accessing their memory, thus
> > trigger swapins. The metric of this testcase is the swapin throughput.
> >
> > I plan to restrict the cgroup's limit to $size.
> >
> > Considering there is only one NVMe drive attached to node 0, I will run
> > the test as described before:
> > 1) bind processes to run on node 0, allocate on node 1 to test the
> > performance when reclaimer's node id is the same as swap device's.
> > 2) bind processes to run on node 1, allocate on node 0 to test the
> > performance when page's node id is the same as swap device's.
> >

Thanks to Tim, he has found me a server that has a single Optane disk
attached to node 0.

Let's use task0_mem0 to denote tasks bound to node 0 and memory bound
to node 0 through cgroup cpuset. And for the above swapin case:
when nr_task=1:
task0_mem0 throughput: [571652, 587158, 594316], avg=584375 -> baseline
task0_mem1 throughput: [582944, 583752, 589026], avg=585240 +0.15%
task1_mem0 throughput: [569349, 577459, 581107], avg=575971 -1.4%
task1_mem1 throughput: [564482, 570664, 571466], avg=568870 -2.6%

task0_mem1 is slightly better than task1_mem0.

For nr_task=8 or nr_task=16, I also gave it a run and the result is
almost the same for all 4 cases.

> > Ying and Yang,
> >
> > Let me know what you think about the case used and the way the test is
> > conducted.
>
> Looks fine to me. To measure the latency, you could also try the below
> bpftrace script:
>

Trying to install bpftrace on an old distro(Ubuntu 16.04) is a real
pain, I gave up... But I managed to get an old bcc installed. Using
the provided funclatency script to profile 30 seconds swap_readpage(),
there is no obvious difference from the histrogram.

So for now, from the existing results, it did't show big difference.
Theoretically, for IO device, when swapping a remote page, using the
remote swap device that is at the same node as the page can reduce the
traffic of the interlink and improve performance. I think this is the
main motivation for this code change?
On swapin time, it's hard to say which node the task will run on anyway
so it's hard to say where to swap is beneficial.

2022-05-02 23:33:01

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid

On Fri, Apr 29, 2022 at 3:27 AM Aaron Lu <[email protected]> wrote:
>
> On Fri, Apr 22, 2022 at 10:00:59AM -0700, Yang Shi wrote:
> > On Thu, Apr 21, 2022 at 11:24 PM Aaron Lu <[email protected]> wrote:
> > >
> > > On Thu, Apr 21, 2022 at 04:34:09PM +0800, [email protected] wrote:
> > > > On Thu, 2022-04-21 at 16:17 +0800, Aaron Lu wrote:
> > > > > On Thu, Apr 21, 2022 at 03:49:21PM +0800, [email protected] wrote:
> > >
> > > ... ...
> > >
> > > > > > For swap-in latency, we can use pmbench, which can output latency
> > > > > > information.
> > > > > >
> > > > >
> > > > > OK, I'll give pmbench a run, thanks for the suggestion.
> > > >
> > > > Better to construct a senario with more swapin than swapout. For
> > > > example, start a memory eater, then kill it later.
> > >
> > > What about vm-scalability/case-swapin?
> > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swapin
> > >
> > > I think you are pretty familiar with it but still:
> > > 1) it starts $nr_task processes and each mmaps $size/$nr_task area and
> > > then consumes the memory, after this, it waits for a signal;
> > > 2) start another process to consume $size memory to push the memory in
> > > step 1) to swap device;
> > > 3) kick processes in step 1) to start accessing their memory, thus
> > > trigger swapins. The metric of this testcase is the swapin throughput.
> > >
> > > I plan to restrict the cgroup's limit to $size.
> > >
> > > Considering there is only one NVMe drive attached to node 0, I will run
> > > the test as described before:
> > > 1) bind processes to run on node 0, allocate on node 1 to test the
> > > performance when reclaimer's node id is the same as swap device's.
> > > 2) bind processes to run on node 1, allocate on node 0 to test the
> > > performance when page's node id is the same as swap device's.
> > >
>
> Thanks to Tim, he has found me a server that has a single Optane disk
> attached to node 0.
>
> Let's use task0_mem0 to denote tasks bound to node 0 and memory bound
> to node 0 through cgroup cpuset. And for the above swapin case:
> when nr_task=1:
> task0_mem0 throughput: [571652, 587158, 594316], avg=584375 -> baseline
> task0_mem1 throughput: [582944, 583752, 589026], avg=585240 +0.15%
> task1_mem0 throughput: [569349, 577459, 581107], avg=575971 -1.4%
> task1_mem1 throughput: [564482, 570664, 571466], avg=568870 -2.6%
>
> task0_mem1 is slightly better than task1_mem0.
>
> For nr_task=8 or nr_task=16, I also gave it a run and the result is
> almost the same for all 4 cases.
>
> > > Ying and Yang,
> > >
> > > Let me know what you think about the case used and the way the test is
> > > conducted.
> >
> > Looks fine to me. To measure the latency, you could also try the below
> > bpftrace script:
> >
>
> Trying to install bpftrace on an old distro(Ubuntu 16.04) is a real
> pain, I gave up... But I managed to get an old bcc installed. Using
> the provided funclatency script to profile 30 seconds swap_readpage(),
> there is no obvious difference from the histrogram.

Thank you so much for the testing.

>
> So for now, from the existing results, it did't show big difference.
> Theoretically, for IO device, when swapping a remote page, using the
> remote swap device that is at the same node as the page can reduce the
> traffic of the interlink and improve performance. I think this is the
> main motivation for this code change?

Yes.

Given the result it seems better to keep the code as-is.

> On swapin time, it's hard to say which node the task will run on anyway
> so it's hard to say where to swap is beneficial.
>