>> (4) relock the folio. (we do that already, might not want to fail)
>>
>> (4) take the PTE lock. If the PTE did not change, turn it into a present
>> PTE entry. Otherwise, cleanup.
>
> Very interesting idea!
>
> I'm just not sure what actual benefit it brings. The only concern
> about reusing swapcache_prepare so far is repeated page faults that
> may hurt performance or statistics, this issue is basically gone after
> adding a schedule().
I think you know that slapping in random schedule() calls is not a
proper way to wait for an event to happen :) It's pretty much
unpredictable how long the schedule() will take and if there even is
anything to schedule to!
With what I propose, just like with page migration, you really do wait
for the event (page read and unlocked, only the PTE has to be updated)
to happen before you try anything else.
Now, the difference is most likely that the case here happens much less
frequently than page migration. Still, you could have all threads
faulting one the same page and all would do the same dance here.
>
> We can't drop all the operations around swap cache and map anyway. It
> doesn't know if it should skip the swapcache until swapcache lookup
> and swap count look up are all done. So I think it can be done more
> naturally here with a special value, making things simpler, robust,
> and improving performance a bit more.
>
The issue will always be that someone can zap the PTE concurrently,
which would free up the swap cache. With what I propose, that cannot
happen in the sync swapin case we are discussing here.
If someone where to zap the PTE in the meantime, it would only remove
the special non-swap entry, indicating to swapin code that concurrent
zapping happened. The swapin code would handle the swapcache freeing
instead, after finishing reading the content.
So the swapcache would not get freed concurrently anymore if I am not wrong.
At least the theory, I didn't prototype it yet.
> And in another series [1] I'm working on making shmem make use of
> cache bypassed swapin as well, following this approach I'll have to
> implement another shmem map based synchronization too.
>
I'd have to look further into that, if something like that could
similarly apply to shmem. But yes, it's no using PTEs, so a PTE-based
sync mechanism does definitely not apply..
> After all it's only a rare race, I think a simpler solution might be better.
I'm not sure that simpler means better here. Simpler might be better for
a backport, though.
The "use schedule() to wait" looks odd, maybe it's a common primitive
that I simply didn't stumble over yet. (I doubt it but it could be possible)
--
Cheers,
David / dhildenb