2024-02-15 00:44:28

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2] mm/swap: fix race when skipping swapcache

Hi Kairui,

On Tue, Feb 13, 2024 at 03:53:09AM +0800, Kairui Song wrote:
> On Fri, Feb 9, 2024 at 1:30 PM Kairui Song <[email protected]> wrote:
> >
> > On Fri, Feb 9, 2024 at 3:42 AM Chris Li <[email protected]> wrote:
> > >
> > > On Thu, Feb 8, 2024 at 11:01 AM Kairui Song <[email protected]> wrote:
> > > >
> > > > On Thu, Feb 8, 2024 at 2:36 PM Huang, Ying <[email protected]> wrote:
> > > > >
> > > > > Kairui Song <[email protected]> writes:
> > > > >
> > > > > > On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <[email protected]> wrote:
> > > > > >>
> > > > > >> On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote:
> > > > >
> > > > > [snip]
> > > > >
> > > > > >> >
> > > > > >> > So I think the thing is, it's getting complex because this patch
> > > > > >> > wanted to make it simple and just reuse the swap cache flags.
> > > > > >>
> > > > > >> I agree that a simple fix would be the important at this point.
> > > > > >>
> > > > > >> Considering your description, here's my understanding of the other idea:
> > > > > >> Other method, such as increasing the swap count, haven't proven effective
> > > > > >> in your tests. The approach risk forcing racers to rely on the swap cache
> > > > > >> again and the potential performance loss in race scenario.
> > > > > >>
> > > > > >> While I understand that simplicity is important, and performance loss
> > > > > >> in this case may be infrequent, I believe swap_count approach could be a
> > > > > >> suitable solution. What do you think?
> > > > > >
> > > > > > Hi Minchan
> > > > > >
> > > > > > Yes, my main concern was about simplicity and performance.
> > > > > >
> > > > > > Increasing swap_count here will also race with another process from
> > > > > > releasing swap_count to 0 (swapcache was able to sync callers in other
> > > > > > call paths but we skipped swapcache here).
> > > > >
> > > > > What is the consequence of the race condition?
> > > >
> > > > Hi Ying,
> > > >
> > > > It will increase the swap count of an already freed entry, this race
> > > > with multiple swap free/alloc logic that checks if count ==
> > > > SWAP_HAS_CACHE or sets count to zero, or repeated free of an entry,
> > > > all result in random corruption of the swap map. This happens a lot
> > > > during stress testing.
> > >
> > > In theory, the original code before your patch can get into a
> > > situation similar to what you try to avoid.
> > > CPU1 enters the do_swap_page() with entry swap count == 1 and
> > > continues handling the swap fault without swap cache. Then some
> > > operation bumps up the swap entry count and CPU2 enters the
> > > do_swap_page() racing with CPU1 with swap count == 2. CPU2 will need
> > > to go through the swap cache case. We still need to handle this
> > > situation correctly.
> >
> > Hi Chris,
> >
> > This won't happen, nothing can bump the swap entry count unless it's
> > swapped in and freed. There are only two places that call
> > swap_duplicate: unmap or fork, unmap need page mapped and entry alloc,
> > so it won't happen unless we hit the entry reuse issue. Fork needs the
> > VMA lock which we hold it during page fault.
> >
> > > So the complexity is already there.
> > >
> > > If we can make sure the above case works correctly, then one way to
> > > avoid the problem is just send the CPU2 to use the swap cache (without
> > > the swap cache by-passing).
> >
> > Yes, more auditing of existing code and explanation is needed to
> > ensure things won't go wrong, that's the reason I tried to avoid
> > things from going too complex...
> >
> > > > > > So the right step is: 1. Lock the cluster/swap lock; 2. Check if still
> > > > > > have swap_count == 1, bail out if not; 3. Set it to 2;
> > > > > > __swap_duplicate can be modified to support this, it's similar to
> > > > > > existing logics for SWAP_HAS_CACHE.
> > > > > >
> > > > > > And swap freeing path will do more things, swapcache clean up needs to
> > > > > > be handled even in the bypassing path since the racer may add it to
> > > > > > swapcache.
> > > > > >
> > > > > > Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many
> > > > > > overhead, so I used that way in this patch, the only issue is
> > > > > > potentially repeated page faults now.
> > > > > >
> > > > > > I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad
> > > > > > at naming it) special value, so any racer can just spin on it to avoid
> > > > > > all the problems, how do you think about this?
> > > > >
> > > > > Let's try some simpler method firstly.
> > > >
> > > > Another simpler idea is, add a schedule() or
> > > > schedule_timeout_uninterruptible(1) in the swapcache_prepare failure
> > > > path before goto out (just like __read_swap_cache_async). I think this
> > > > should ensure in almost all cases, PTE is ready after it returns, also
> > > > yields more CPU.
> > >
> > > I mentioned in my earlier email and Ying points out that as well.
> > > Looping waiting inside do_swap_page() is bad because it is holding
> > > other locks.
> >
> > It's not looping here though, just a tiny delay, since
> > SWP_SYNCHRONOUS_IO is supposed to be super fast devices so a tiny
> > delay should be enough.
> >
> > > Sorry I started this idea but it seems no good.
> >
> > Not at all, more reviewing helps to find a better solution :)
> >
> > > If we can have CPU2 make forward progress without retrying
> > > the page fault would be the best, if possible.
> >
> > Yes, making CPU2 fall back to cached swapin path is doable after
> > careful auditing. Just CPU2 is usually slower than CPU1 due to cache
> > and timing, so what it does will most likely be in vain and need to be
> > reverted, causing more work for both code logic and CPU. The case of
> > the race (CPU2 went faster) is very rare.
> >
> > I'm not against the idea of bump count, it's better if things that can
> > be done without introducing too much noise. Will come back after more
> > tests and work on this.
>
> Hi,
>
> After some more testing, I still think bump swap count is a bad idea
> here. Besides the issues I mentioned above, adding folio into cache is
> fundamentally in conflict with the problem we are trying to solve:
> folio in swapcache can be swapped out without allocating a new entry,
> so the entry reuse issue will be much more serious.
>
> If we want to avoid this we have to release the cache unconditionally
> before folio unlock because it can't tell if there is a parallel cache
> bypassing swapin going on or not. This is unacceptable, it breaks
> normal usage of swap cache. There are also other potential issues like
> read ahead, race causing stalled folio remain in swap cache, so I dont
> think this is a good approach.

I also played the swap refcount stuff ideas and encoutered a lot
problems(e.g., kernel crash and/or data missing).

Now I realize your original solution would be best to fix under such a
small change.

Folks, please chime in if you have another idea.


>
> Instead if we add a schedule or schedule_timeout_uninterruptible(1),

How much your workload is going If we use schedule instead of
schedule_timeout_uninterruptible(1)? If that doesn't increase the
statistics a lot, I prefer the schedule.
(TBH, I didn't care much about the stat since it would be
very rare).


> it seems quite enough to avoid repeated page faults. I did following
> test with the reproducer I provided in the commit message:
>
> Using ZRAM instead of brd for more realistic workload:
> $ perf stat -e minor-faults,major-faults -I 30000 ./a.out
>
> Unpatched kernel:
> # time counts unit events
> 30.000096504 33,089 minor-faults:u
> 30.000096504 958,745 major-faults:u
> 60.000375308 22,150 minor-faults:u
> 60.000375308 1,221,665 major-faults:u
> 90.001062176 24,840 minor-faults:u
> 90.001062176 1,005,427 major-faults:u
> 120.004233268 23,634 minor-faults:u
> 120.004233268 1,045,596 major-faults:u
> 150.004530091 22,358 minor-faults:u
> 150.004530091 1,097,871 major-faults:u
>
> Patched kernel:
> # time counts unit events
> 30.000069753 280,094 minor-faults:u
> 30.000069753 1,198,871 major-faults:u
> 60.000415915 436,862 minor-faults:u
> 60.000415915 919,860 major-faults:u
> 90.000651670 359,176 minor-faults:u
> 90.000651670 955,340 major-faults:u
> 120.001088135 289,137 minor-faults:u
> 120.001088135 992,317 major-faults:u
>
> Indeed there is a huge increase of minor page faults, which indicate
> the raced path returned without handling the fault. This reproducer
> tries to maximize the race chance so the reading is more terrifying
> than usual.
>
> But after adding a schedule/schedule_timeout_uninterruptible(1) after
> swapcache_prepare failed, still using the same test:
>
> Patched kernel (adding a schedule() before goto out):
> # time counts unit events
> 30.000335901 43,066 minor-faults:u
> 30.000335901 1,163,232 major-faults:u
> 60.034629687 35,652 minor-faults:u
> 60.034629687 844,188 major-faults:u
> 90.034941472 34,957 minor-faults:u
> 90.034941472 1,218,174 major-faults:u
> 120.035255386 36,241 minor-faults:u
> 120.035255386 1,073,395 major-faults:u
> 150.035535786 33,057 minor-faults:u
> 150.035535786 1,171,684 major-faults:u
>
> Patched kernel (adding a schedule_timeout_uninterruptible(1) before goto out):
> # time counts unit events
> 30.000044452 26,748 minor-faults:u
> 30.000044452 1,202,064 major-faults:u
> 60.000317379 19,883 minor-faults:u
> 60.000317379 1,186,152 major-faults:u
> 90.000568779 18,333 minor-faults:u
> 90.000568779 1,151,015 major-faults:u
> 120.000787253 22,844 minor-faults:u
> 120.000787253 887,531 major-faults:u
> 150.001309065 18,900 minor-faults:u
> 150.001309065 1,211,894 major-faults:u
>
> The minor faults are basically the same as before, or even lower since
> other races are also reduced, with no observable performance
> regression so far.
> If the vague margin of this schedule call is still concerning, I think
> another approach is just a new swap map value for parallel cache
> bypassed swapping to loop on.

Long term, the swap map vaule would be good but for now, I prefer
your original solution with some delay with schedule.

Thanks, Kairui!


2024-02-15 19:08:16

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH v2] mm/swap: fix race when skipping swapcache

On Thu, Feb 15, 2024 at 8:44 AM Minchan Kim <[email protected]> wrote:
>
> Hi Kairui,
>

Hi, Minchan,

>
> I also played the swap refcount stuff ideas and encoutered a lot
> problems(e.g., kernel crash and/or data missing).
>
> Now I realize your original solution would be best to fix under such a
> small change.
>
> Folks, please chime in if you have another idea.
>
> >
> > Instead if we add a schedule or schedule_timeout_uninterruptible(1),
>
> How much your workload is going If we use schedule instead of
> schedule_timeout_uninterruptible(1)? If that doesn't increase the
> statistics a lot, I prefer the schedule.
> (TBH, I didn't care much about the stat since it would be
> very rare).

I've tested both schedule() and schedule_timeout_uninterruptible(1)
locally using the reproducer, and some other cases, and looked all
good.

The reproducer I provided demonstrates an extreme case, so I think
schedule() is good enough already.

I currently don't have any more benchmarks or tests, as the change is
very small for most workloads. I'll use schedule() here in V3 if no
one else has other suggestions.

I remember Barry's series about large folio swapin may change the ZRAM
read time depending on folio size, and cause strange races that's
different from the reproducer. Not sure if I can test using some known
race cases. That could be helpful to verify this fix and if schedule()
or schedule_timeout_uninterruptible(1) is better here.

> > it seems quite enough to avoid repeated page faults. I did following
> > test with the reproducer I provided in the commit message:
> >
> > Using ZRAM instead of brd for more realistic workload:
> > $ perf stat -e minor-faults,major-faults -I 30000 ./a.out
> >
> > Unpatched kernel:
> > # time counts unit events
> > 30.000096504 33,089 minor-faults:u
> > 30.000096504 958,745 major-faults:u
> > 60.000375308 22,150 minor-faults:u
> > 60.000375308 1,221,665 major-faults:u
> > 90.001062176 24,840 minor-faults:u
> > 90.001062176 1,005,427 major-faults:u
> > 120.004233268 23,634 minor-faults:u
> > 120.004233268 1,045,596 major-faults:u
> > 150.004530091 22,358 minor-faults:u
> > 150.004530091 1,097,871 major-faults:u
> >
> > Patched kernel:
> > # time counts unit events
> > 30.000069753 280,094 minor-faults:u
> > 30.000069753 1,198,871 major-faults:u
> > 60.000415915 436,862 minor-faults:u
> > 60.000415915 919,860 major-faults:u
> > 90.000651670 359,176 minor-faults:u
> > 90.000651670 955,340 major-faults:u
> > 120.001088135 289,137 minor-faults:u
> > 120.001088135 992,317 major-faults:u
> >
> > Indeed there is a huge increase of minor page faults, which indicate
> > the raced path returned without handling the fault. This reproducer
> > tries to maximize the race chance so the reading is more terrifying
> > than usual.
> >
> > But after adding a schedule/schedule_timeout_uninterruptible(1) after
> > swapcache_prepare failed, still using the same test:
> >
> > Patched kernel (adding a schedule() before goto out):
> > # time counts unit events
> > 30.000335901 43,066 minor-faults:u
> > 30.000335901 1,163,232 major-faults:u
> > 60.034629687 35,652 minor-faults:u
> > 60.034629687 844,188 major-faults:u
> > 90.034941472 34,957 minor-faults:u
> > 90.034941472 1,218,174 major-faults:u
> > 120.035255386 36,241 minor-faults:u
> > 120.035255386 1,073,395 major-faults:u
> > 150.035535786 33,057 minor-faults:u
> > 150.035535786 1,171,684 major-faults:u
> >
> > Patched kernel (adding a schedule_timeout_uninterruptible(1) before goto out):
> > # time counts unit events
> > 30.000044452 26,748 minor-faults:u
> > 30.000044452 1,202,064 major-faults:u
> > 60.000317379 19,883 minor-faults:u
> > 60.000317379 1,186,152 major-faults:u
> > 90.000568779 18,333 minor-faults:u
> > 90.000568779 1,151,015 major-faults:u
> > 120.000787253 22,844 minor-faults:u
> > 120.000787253 887,531 major-faults:u
> > 150.001309065 18,900 minor-faults:u
> > 150.001309065 1,211,894 major-faults:u
> >
> > The minor faults are basically the same as before, or even lower since
> > other races are also reduced, with no observable performance
> > regression so far.
> > If the vague margin of this schedule call is still concerning, I think
> > another approach is just a new swap map value for parallel cache
> > bypassed swapping to loop on.
>
> Long term, the swap map vaule would be good but for now, I prefer
> your original solution with some delay with schedule.

Yes, that's also what I have in mind.

With a swap map value, actually I think the cache bypassed path may
even go faster as it can simplify some swap map value change process,
but that requires quite some extra work and change, could be
introduced later as an optimization.

> Thanks, Kairui!

Thanks for the comments!