2020-05-20 03:20:27

by Huang, Ying

[permalink] [raw]
Subject: [PATCH -V2] swap: Reduce lock contention on swap cache from swap slots allocation

In some swap scalability test, it is found that there are heavy lock
contention on swap cache even if we have split one swap cache radix
tree per swap device to one swap cache radix tree every 64 MB trunk in
commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").

The reason is as follow. After the swap device becomes fragmented so
that there's no free swap cluster, the swap device will be scanned
linearly to find the free swap slots. swap_info_struct->cluster_next
is the next scanning base that is shared by all CPUs. So nearby free
swap slots will be allocated for different CPUs. The probability for
multiple CPUs to operate on the same 64 MB trunk is high. This causes
the lock contention on the swap cache.

To solve the issue, in this patch, for SSD swap device, a percpu
version next scanning base (cluster_next_cpu) is added. Every CPU
will use its own per-cpu next scanning base. And after finishing
scanning a 64MB trunk, the per-cpu scanning base will be changed to
the beginning of another randomly selected 64MB trunk. In this way,
the probability for multiple CPUs to operate on the same 64 MB trunk
is reduced greatly. Thus the lock contention is reduced too. For
HDD, because sequential access is more important for IO performance,
the original shared next scanning base is used.

To test the patch, we have run 16-process pmbench memory benchmark on
a 2-socket server machine with 48 cores. One ram disk is configured
as the swap device per socket. The pmbench working-set size is much
larger than the available memory so that swapping is triggered. The
memory read/write ratio is 80/20 and the accessing pattern is random.
In the original implementation, the lock contention on the swap cache
is heavy. The perf profiling data of the lock contention code path is
as following,

_raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list: 7.91
_raw_spin_lock_irqsave.__remove_mapping.shrink_page_list: 7.11
_raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51
_raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 1.66
_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 1.29
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.03
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 0.93

After applying this patch, it becomes,

_raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58
_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 2.3
_raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 2.26
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 1.8
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.19

The lock contention on the swap cache is almost eliminated.

And the pmbench score increases 18.5%. The swapin throughput
increases 18.7% from 2.96 GB/s to 3.51 GB/s. While the swapout
throughput increases 18.5% from 2.99 GB/s to 3.54 GB/s.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Hugh Dickins <[email protected]>
---

Changelog:

v2:

- Rebased on latest mmotm tree (v5.7-rc5-mmots-2020-05-15-16-36), the
mem cgroup change has influence on performance data.

- Fix cluster_next_cpu initialization per Andrew and Daniel's comments.

- Change per-cpu scan base every 64MB per Andrew's comments.

---
include/linux/swap.h | 1 +
mm/swapfile.c | 54 ++++++++++++++++++++++++++++++++++++++++----
2 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b42fb47d8cbe..e96820fb7472 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -252,6 +252,7 @@ struct swap_info_struct {
unsigned int inuse_pages; /* number of those currently in use */
unsigned int cluster_next; /* likely index for next allocation */
unsigned int cluster_nr; /* countdown to next cluster search */
+ unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
struct rb_root swap_extent_root;/* root of the swap extent rbtree */
struct block_device *bdev; /* swap device or bdev of swap file */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 423c234aca15..f5e3ab06bf18 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -615,7 +615,8 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
* discarding, do discard now and reclaim them
*/
swap_do_scheduled_discard(si);
- *scan_base = *offset = si->cluster_next;
+ *scan_base = this_cpu_read(*si->cluster_next_cpu);
+ *offset = *scan_base;
goto new_cluster;
} else
return false;
@@ -721,6 +722,34 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
}
}

+static void set_cluster_next(struct swap_info_struct *si, unsigned long next)
+{
+ unsigned long prev;
+
+ if (!(si->flags & SWP_SOLIDSTATE)) {
+ si->cluster_next = next;
+ return;
+ }
+
+ prev = this_cpu_read(*si->cluster_next_cpu);
+ /*
+ * Cross the swap address space size aligned trunk, choose
+ * another trunk randomly to avoid lock contention on swap
+ * address space if possible.
+ */
+ if ((prev >> SWAP_ADDRESS_SPACE_SHIFT) !=
+ (next >> SWAP_ADDRESS_SPACE_SHIFT)) {
+ /* No free swap slots available */
+ if (si->highest_bit <= si->lowest_bit)
+ return;
+ next = si->lowest_bit +
+ prandom_u32_max(si->highest_bit - si->lowest_bit + 1);
+ next = ALIGN(next, SWAP_ADDRESS_SPACE_PAGES);
+ next = max_t(unsigned int, next, si->lowest_bit);
+ }
+ this_cpu_write(*si->cluster_next_cpu, next);
+}
+
static int scan_swap_map_slots(struct swap_info_struct *si,
unsigned char usage, int nr,
swp_entry_t slots[])
@@ -745,7 +774,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
*/

si->flags += SWP_SCANNING;
- scan_base = offset = si->cluster_next;
+ /*
+ * Use percpu scan base for SSD to reduce lock contention on
+ * cluster and swap cache. For HDD, sequential access is more
+ * important.
+ */
+ if (si->flags & SWP_SOLIDSTATE)
+ scan_base = this_cpu_read(*si->cluster_next_cpu);
+ else
+ scan_base = si->cluster_next;
+ offset = scan_base;

/* SSD algorithm */
if (si->cluster_info) {
@@ -834,7 +872,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
unlock_cluster(ci);

swap_range_alloc(si, offset, 1);
- si->cluster_next = offset + 1;
slots[n_ret++] = swp_entry(si->type, offset);

/* got enough slots or reach max slots? */
@@ -883,6 +920,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
}

done:
+ set_cluster_next(si, offset + 1);
si->flags -= SWP_SCANNING;
return n_ret;

@@ -2827,6 +2865,11 @@ static struct swap_info_struct *alloc_swap_info(void)
p = kvzalloc(struct_size(p, avail_lists, nr_node_ids), GFP_KERNEL);
if (!p)
return ERR_PTR(-ENOMEM);
+ p->cluster_next_cpu = alloc_percpu(unsigned int);
+ if (!p->cluster_next_cpu) {
+ kvfree(p);
+ return ERR_PTR(-ENOMEM);
+ }

spin_lock(&swap_lock);
for (type = 0; type < nr_swapfiles; type++) {
@@ -3202,7 +3245,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
* select a random position to start with to help wear leveling
* SSD
*/
- p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
+ for_each_possible_cpu(cpu) {
+ per_cpu(*p->cluster_next_cpu, cpu) =
+ 1 + prandom_u32_max(p->highest_bit);
+ }
nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);

cluster_info = kvcalloc(nr_cluster, sizeof(*cluster_info),
--
2.26.2


2020-05-21 02:55:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH -V2] swap: Reduce lock contention on swap cache from swap slots allocation

On Wed, 20 May 2020 11:15:02 +0800 Huang Ying <[email protected]> wrote:

> In some swap scalability test, it is found that there are heavy lock
> contention on swap cache even if we have split one swap cache radix
> tree per swap device to one swap cache radix tree every 64 MB trunk in
> commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>
> The reason is as follow. After the swap device becomes fragmented so
> that there's no free swap cluster, the swap device will be scanned
> linearly to find the free swap slots. swap_info_struct->cluster_next
> is the next scanning base that is shared by all CPUs. So nearby free
> swap slots will be allocated for different CPUs. The probability for
> multiple CPUs to operate on the same 64 MB trunk is high. This causes
> the lock contention on the swap cache.
>
> To solve the issue, in this patch, for SSD swap device, a percpu
> version next scanning base (cluster_next_cpu) is added. Every CPU
> will use its own per-cpu next scanning base. And after finishing
> scanning a 64MB trunk, the per-cpu scanning base will be changed to
> the beginning of another randomly selected 64MB trunk. In this way,
> the probability for multiple CPUs to operate on the same 64 MB trunk
> is reduced greatly. Thus the lock contention is reduced too. For
> HDD, because sequential access is more important for IO performance,
> the original shared next scanning base is used.
>
> To test the patch, we have run 16-process pmbench memory benchmark on
> a 2-socket server machine with 48 cores. One ram disk is configured

What does "ram disk" mean here? Which drivers(s) are in use and backed
by what sort of memory?

> as the swap device per socket. The pmbench working-set size is much
> larger than the available memory so that swapping is triggered. The
> memory read/write ratio is 80/20 and the accessing pattern is random.
> In the original implementation, the lock contention on the swap cache
> is heavy. The perf profiling data of the lock contention code path is
> as following,
>
> _raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list: 7.91
> _raw_spin_lock_irqsave.__remove_mapping.shrink_page_list: 7.11
> _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51
> _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 1.66
> _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 1.29
> _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.03
> _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 0.93
>
> After applying this patch, it becomes,
>
> _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58
> _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 2.3
> _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 2.26
> _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 1.8
> _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.19
>
> The lock contention on the swap cache is almost eliminated.
>
> And the pmbench score increases 18.5%. The swapin throughput
> increases 18.7% from 2.96 GB/s to 3.51 GB/s. While the swapout
> throughput increases 18.5% from 2.99 GB/s to 3.54 GB/s.

If this was backed by plain old RAM, can we assume that the performance
improvement on SSD swap is still good?

Does the ram disk actually set SWP_SOLIDSTATE?

2020-05-21 03:26:37

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH -V2] swap: Reduce lock contention on swap cache from swap slots allocation

Andrew Morton <[email protected]> writes:

> On Wed, 20 May 2020 11:15:02 +0800 Huang Ying <[email protected]> wrote:
>
>> In some swap scalability test, it is found that there are heavy lock
>> contention on swap cache even if we have split one swap cache radix
>> tree per swap device to one swap cache radix tree every 64 MB trunk in
>> commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>>
>> The reason is as follow. After the swap device becomes fragmented so
>> that there's no free swap cluster, the swap device will be scanned
>> linearly to find the free swap slots. swap_info_struct->cluster_next
>> is the next scanning base that is shared by all CPUs. So nearby free
>> swap slots will be allocated for different CPUs. The probability for
>> multiple CPUs to operate on the same 64 MB trunk is high. This causes
>> the lock contention on the swap cache.
>>
>> To solve the issue, in this patch, for SSD swap device, a percpu
>> version next scanning base (cluster_next_cpu) is added. Every CPU
>> will use its own per-cpu next scanning base. And after finishing
>> scanning a 64MB trunk, the per-cpu scanning base will be changed to
>> the beginning of another randomly selected 64MB trunk. In this way,
>> the probability for multiple CPUs to operate on the same 64 MB trunk
>> is reduced greatly. Thus the lock contention is reduced too. For
>> HDD, because sequential access is more important for IO performance,
>> the original shared next scanning base is used.
>>
>> To test the patch, we have run 16-process pmbench memory benchmark on
>> a 2-socket server machine with 48 cores. One ram disk is configured
>
> What does "ram disk" mean here? Which drivers(s) are in use and backed
> by what sort of memory?

We use the following kernel command line

memmap=48G!6G memmap=48G!68G

to create 2 DRAM based /dev/pmem disks (48GB each). Then we use these
ram disks as swap devices.

>> as the swap device per socket. The pmbench working-set size is much
>> larger than the available memory so that swapping is triggered. The
>> memory read/write ratio is 80/20 and the accessing pattern is random.
>> In the original implementation, the lock contention on the swap cache
>> is heavy. The perf profiling data of the lock contention code path is
>> as following,
>>
>> _raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list: 7.91
>> _raw_spin_lock_irqsave.__remove_mapping.shrink_page_list: 7.11
>> _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51
>> _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 1.66
>> _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 1.29
>> _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.03
>> _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 0.93
>>
>> After applying this patch, it becomes,
>>
>> _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58
>> _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 2.3
>> _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 2.26
>> _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 1.8
>> _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.19
>>
>> The lock contention on the swap cache is almost eliminated.
>>
>> And the pmbench score increases 18.5%. The swapin throughput
>> increases 18.7% from 2.96 GB/s to 3.51 GB/s. While the swapout
>> throughput increases 18.5% from 2.99 GB/s to 3.54 GB/s.
>
> If this was backed by plain old RAM, can we assume that the performance
> improvement on SSD swap is still good?

We need really fast disk to show the benefit. I have tried this on 2
Intel P3600 NVMe disks. The performance improvement is only about 1%.
The improvement should be better on the faster disks, such as Intel
Optane disk. I will try to find some to test.

> Does the ram disk actually set SWP_SOLIDSTATE?

Yes. "blk_queue_flag_set(QUEUE_FLAG_NONROT, q)" is called in
drivers/nvdimm/pmem.c.

Best Regards,
Huang, Ying

2020-05-21 13:40:21

by Daniel Jordan

[permalink] [raw]
Subject: Re: [PATCH -V2] swap: Reduce lock contention on swap cache from swap slots allocation

On Wed, May 20, 2020 at 11:15:02AM +0800, Huang Ying wrote:
> @@ -2827,6 +2865,11 @@ static struct swap_info_struct *alloc_swap_info(void)
> p = kvzalloc(struct_size(p, avail_lists, nr_node_ids), GFP_KERNEL);
> if (!p)
> return ERR_PTR(-ENOMEM);
> + p->cluster_next_cpu = alloc_percpu(unsigned int);
> + if (!p->cluster_next_cpu) {
> + kvfree(p);
> + return ERR_PTR(-ENOMEM);
> + }

There should be free_percpu()s at two places after this, but I think the
allocation really belongs right...

> @@ -3202,7 +3245,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> * select a random position to start with to help wear leveling
> * SSD
> */
> - p->cluster_next = 1 + prandom_u32_max(p->highest_bit);

...here because then it's only allocated when it's actually used.

> + for_each_possible_cpu(cpu) {
> + per_cpu(*p->cluster_next_cpu, cpu) =
> + 1 + prandom_u32_max(p->highest_bit);
> + }
> nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
>
> cluster_info = kvcalloc(nr_cluster, sizeof(*cluster_info),
> --
> 2.26.2
>
>

2020-05-22 06:00:43

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH -V2] swap: Reduce lock contention on swap cache from swap slots allocation

Daniel Jordan <[email protected]> writes:

> On Wed, May 20, 2020 at 11:15:02AM +0800, Huang Ying wrote:
>> @@ -2827,6 +2865,11 @@ static struct swap_info_struct *alloc_swap_info(void)
>> p = kvzalloc(struct_size(p, avail_lists, nr_node_ids), GFP_KERNEL);
>> if (!p)
>> return ERR_PTR(-ENOMEM);
>> + p->cluster_next_cpu = alloc_percpu(unsigned int);
>> + if (!p->cluster_next_cpu) {
>> + kvfree(p);
>> + return ERR_PTR(-ENOMEM);
>> + }
>
> There should be free_percpu()s at two places after this, but I think the
> allocation really belongs right...
>
>> @@ -3202,7 +3245,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>> * select a random position to start with to help wear leveling
>> * SSD
>> */
>> - p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
>
> ...here because then it's only allocated when it's actually used.

Good catch! And yes, this is the better place to allocate memory. I
will fix this in the new version! Thanks a lot!

Best Regards,
Huang, Ying

>> + for_each_possible_cpu(cpu) {
>> + per_cpu(*p->cluster_next_cpu, cpu) =
>> + 1 + prandom_u32_max(p->highest_bit);
>> + }
>> nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
>>
>> cluster_info = kvcalloc(nr_cluster, sizeof(*cluster_info),
>> --
>> 2.26.2
>>
>>

2020-06-24 03:33:11

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH -V2] swap: Reduce lock contention on swap cache from swap slots allocation

"Huang, Ying" <[email protected]> writes:

> Andrew Morton <[email protected]> writes:
>
>> On Wed, 20 May 2020 11:15:02 +0800 Huang Ying <[email protected]> wrote:
>>
>>> In some swap scalability test, it is found that there are heavy lock
>>> contention on swap cache even if we have split one swap cache radix
>>> tree per swap device to one swap cache radix tree every 64 MB trunk in
>>> commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>>>
>>> The reason is as follow. After the swap device becomes fragmented so
>>> that there's no free swap cluster, the swap device will be scanned
>>> linearly to find the free swap slots. swap_info_struct->cluster_next
>>> is the next scanning base that is shared by all CPUs. So nearby free
>>> swap slots will be allocated for different CPUs. The probability for
>>> multiple CPUs to operate on the same 64 MB trunk is high. This causes
>>> the lock contention on the swap cache.
>>>
>>> To solve the issue, in this patch, for SSD swap device, a percpu
>>> version next scanning base (cluster_next_cpu) is added. Every CPU
>>> will use its own per-cpu next scanning base. And after finishing
>>> scanning a 64MB trunk, the per-cpu scanning base will be changed to
>>> the beginning of another randomly selected 64MB trunk. In this way,
>>> the probability for multiple CPUs to operate on the same 64 MB trunk
>>> is reduced greatly. Thus the lock contention is reduced too. For
>>> HDD, because sequential access is more important for IO performance,
>>> the original shared next scanning base is used.
>>>
>>> To test the patch, we have run 16-process pmbench memory benchmark on
>>> a 2-socket server machine with 48 cores. One ram disk is configured
>>
>> What does "ram disk" mean here? Which drivers(s) are in use and backed
>> by what sort of memory?
>
> We use the following kernel command line
>
> memmap=48G!6G memmap=48G!68G
>
> to create 2 DRAM based /dev/pmem disks (48GB each). Then we use these
> ram disks as swap devices.
>
>>> as the swap device per socket. The pmbench working-set size is much
>>> larger than the available memory so that swapping is triggered. The
>>> memory read/write ratio is 80/20 and the accessing pattern is random.
>>> In the original implementation, the lock contention on the swap cache
>>> is heavy. The perf profiling data of the lock contention code path is
>>> as following,
>>>
>>> _raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list: 7.91
>>> _raw_spin_lock_irqsave.__remove_mapping.shrink_page_list: 7.11
>>> _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51
>>> _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 1.66
>>> _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 1.29
>>> _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.03
>>> _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 0.93
>>>
>>> After applying this patch, it becomes,
>>>
>>> _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58
>>> _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 2.3
>>> _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 2.26
>>> _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 1.8
>>> _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.19
>>>
>>> The lock contention on the swap cache is almost eliminated.
>>>
>>> And the pmbench score increases 18.5%. The swapin throughput
>>> increases 18.7% from 2.96 GB/s to 3.51 GB/s. While the swapout
>>> throughput increases 18.5% from 2.99 GB/s to 3.54 GB/s.
>>
>> If this was backed by plain old RAM, can we assume that the performance
>> improvement on SSD swap is still good?
>
> We need really fast disk to show the benefit. I have tried this on 2
> Intel P3600 NVMe disks. The performance improvement is only about 1%.
> The improvement should be better on the faster disks, such as Intel
> Optane disk. I will try to find some to test.

I finally find 2 Intel Optane disks to test. The pmbench throughput
(page accesses per second) increases ~1.7% with the patch. The swapin
throughput increases 2% (~1.36 GB/s to ~1.39 GB/s), the swapout
throughput increases 1.7% (~1.61 GB/s to 1.63 GB/s). Perf profile shows
the CPU cycles on the swap cache radix tree spinlock is reduced from
~1.76% to nearly 0. So the performance difference is much smaller, but
still measurable.

Best Regards,
Huang, Ying