2024-02-20 01:31:58

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4] mm/swap: fix race when skipping swapcache

On Mon, 19 Feb 2024 16:20:40 +0800 Kairui Song <[email protected]> wrote:

> From: Kairui Song <[email protected]>
>
> When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> swapin the same entry at the same time, they get different pages (A, B).
> Before one thread (T0) finishes the swapin and installs page (A)
> to the PTE, another thread (T1) could finish swapin of page (B),
> swap_free the entry, then swap out the possibly modified page
> reusing the same entry. It breaks the pte_same check in (T0) because
> PTE value is unchanged, causing ABA problem. Thread (T0) will
> install a stalled page (A) into the PTE and cause data corruption.
>
> @@ -3867,6 +3868,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (!folio) {
> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> __swap_count(entry) == 1) {
> + /*
> + * Prevent parallel swapin from proceeding with
> + * the cache flag. Otherwise, another thread may
> + * finish swapin first, free the entry, and swapout
> + * reusing the same entry. It's undetectable as
> + * pte_same() returns true due to entry reuse.
> + */
> + if (swapcache_prepare(entry)) {
> + /* Relax a bit to prevent rapid repeated page faults */
> + schedule_timeout_uninterruptible(1);

Well this is unpleasant. How often can we expect this to occur?

> + goto out;
> + }
> + need_clear_cache = true;
> +
> /* skip swapcache */
> folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> vma, vmf->address, false);



2024-02-20 03:42:38

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH v4] mm/swap: fix race when skipping swapcache

On Tue, Feb 20, 2024 at 9:31 AM Andrew Morton <[email protected]> wrote:
>
> On Mon, 19 Feb 2024 16:20:40 +0800 Kairui Song <[email protected]> wrote:
>
> > From: Kairui Song <[email protected]>
> >
> > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> > swapin the same entry at the same time, they get different pages (A, B).
> > Before one thread (T0) finishes the swapin and installs page (A)
> > to the PTE, another thread (T1) could finish swapin of page (B),
> > swap_free the entry, then swap out the possibly modified page
> > reusing the same entry. It breaks the pte_same check in (T0) because
> > PTE value is unchanged, causing ABA problem. Thread (T0) will
> > install a stalled page (A) into the PTE and cause data corruption.
> >
> > @@ -3867,6 +3868,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > if (!folio) {
> > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > __swap_count(entry) == 1) {
> > + /*
> > + * Prevent parallel swapin from proceeding with
> > + * the cache flag. Otherwise, another thread may
> > + * finish swapin first, free the entry, and swapout
> > + * reusing the same entry. It's undetectable as
> > + * pte_same() returns true due to entry reuse.
> > + */
> > + if (swapcache_prepare(entry)) {
> > + /* Relax a bit to prevent rapid repeated page faults */
> > + schedule_timeout_uninterruptible(1);
>
> Well this is unpleasant. How often can we expect this to occur?
>

The chance is very low, using the current mainline kernel and ZRAM,
even with threads set to race on purpose using the reproducer I
provides, for 647132 page faults it occured 1528 times (~0.2%).

If I run MySQL and sysbench with 128 threads and 16G buffer pool, with
6G cgroup limit and 32G ZRAM, it occured 1372 times for 40 min,
109930201 page faults in total (~0.001%).

2024-02-20 04:01:32

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v4] mm/swap: fix race when skipping swapcache

On Tue, Feb 20, 2024 at 4:42 PM Kairui Song <[email protected]> wrote:
>
> On Tue, Feb 20, 2024 at 9:31 AM Andrew Morton <[email protected]> wrote:
> >
> > On Mon, 19 Feb 2024 16:20:40 +0800 Kairui Song <[email protected]> wrote:
> >
> > > From: Kairui Song <[email protected]>
> > >
> > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> > > swapin the same entry at the same time, they get different pages (A, B).
> > > Before one thread (T0) finishes the swapin and installs page (A)
> > > to the PTE, another thread (T1) could finish swapin of page (B),
> > > swap_free the entry, then swap out the possibly modified page
> > > reusing the same entry. It breaks the pte_same check in (T0) because
> > > PTE value is unchanged, causing ABA problem. Thread (T0) will
> > > install a stalled page (A) into the PTE and cause data corruption.
> > >
> > > @@ -3867,6 +3868,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > if (!folio) {
> > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > > __swap_count(entry) == 1) {
> > > + /*
> > > + * Prevent parallel swapin from proceeding with
> > > + * the cache flag. Otherwise, another thread may
> > > + * finish swapin first, free the entry, and swapout
> > > + * reusing the same entry. It's undetectable as
> > > + * pte_same() returns true due to entry reuse.
> > > + */
> > > + if (swapcache_prepare(entry)) {
> > > + /* Relax a bit to prevent rapid repeated page faults */
> > > + schedule_timeout_uninterruptible(1);
> >
> > Well this is unpleasant. How often can we expect this to occur?
> >
>
> The chance is very low, using the current mainline kernel and ZRAM,
> even with threads set to race on purpose using the reproducer I
> provides, for 647132 page faults it occured 1528 times (~0.2%).
>
> If I run MySQL and sysbench with 128 threads and 16G buffer pool, with
> 6G cgroup limit and 32G ZRAM, it occured 1372 times for 40 min,
> 109930201 page faults in total (~0.001%).

it might not be a problem for throughput. but for real-time and tail latency,
this hurts. For example, this might increase dropping frames of UI which
is an important parameter to evaluate performance :-)

BTW, I wonder if ying's previous proposal - moving swapcache_prepare()
after swap_read_folio() will further help decrease the number?

Thanks
Barry