Received-SPF: pass (google.com: domain of linux-kernel+bounces-120517-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249;
From: "Huang, Ying" <ying.huang@intel.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: linux-mm@kvack.org,  Chris Li <chrisl@kernel.org>,  Minchan Kim
 <minchan@kernel.org>,  Barry Song <v-songbaohua@oppo.com>,  Ryan Roberts
 <ryan.roberts@arm.com>,  Yu Zhao <yuzhao@google.com>,  SeongJae Park
 <sj@kernel.org>,  David Hildenbrand <david@redhat.com>,  Yosry Ahmed
 <yosryahmed@google.com>,  Johannes Weiner <hannes@cmpxchg.org>,  Matthew
 Wilcox <willy@infradead.org>,  Nhat Pham <nphamcs@gmail.com>,  Chengming
 Zhou <zhouchengming@bytedance.com>,  Andrew Morton
 <akpm@linux-foundation.org>,  linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 10/10] mm/swap: optimize synchronous swapin
In-Reply-To: <CAMgjq7D9-6JXOzpd18t8MSBAotHgEG2YZbi4efNkJiwiSJyJmw@mail.gmail.com>
	(Kairui Song's message of "Wed, 27 Mar 2024 15:14:03 +0800")
References: <20240326185032.72159-1-ryncsn@gmail.com>
	<20240326185032.72159-11-ryncsn@gmail.com>
	<87zfukmbwz.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAMgjq7A-TxWkNKz0wwjaf0C-KZgps-VdPG+QcpY9tMmBY04TNA@mail.gmail.com>
	<87r0fwmar4.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAMgjq7D9-6JXOzpd18t8MSBAotHgEG2YZbi4efNkJiwiSJyJmw@mail.gmail.com>
Date: Wed, 27 Mar 2024 16:16:30 +0800
Message-ID: <87il18m6n5.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Kairui Song <ryncsn@gmail.com> writes:

> On Wed, Mar 27, 2024 at 2:49=E2=80=AFPM Huang, Ying <ying.huang@intel.com=
> wrote:
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> > On Wed, Mar 27, 2024 at 2:24=E2=80=AFPM Huang, Ying <ying.huang@intel.=
com> wrote:
>> >>
>> >> Kairui Song <ryncsn@gmail.com> writes:
>> >>
>> >> > From: Kairui Song <kasong@tencent.com>
>> >> >
>> >> > Interestingly the major performance overhead of synchronous is actu=
ally
>> >> > from the workingset nodes update, that's because synchronous swap in
>> >>
>> >> If it's the major overhead, why not make it the first optimization?
>> >
>> > This performance issue became much more obvious after doing other
>> > optimizations, and other optimizations are for general swapin not only
>> > for synchronous swapin, that's also how I optimized things step by
>> > step, so I kept my patch order...
>> >
>> > And it is easier to do this after Patch 8/10 which introduces the new
>> > interface for swap cache.
>> >
>> >>
>> >> > keeps adding single folios into a xa_node, making the node no longer
>> >> > a shadow node and have to be removed from shadow_nodes, then remove
>> >> > the folio very shortly and making the node a shadow node again,
>> >> > so it has to add back to the shadow_nodes.
>> >>
>> >> The folio is removed only if should_try_to_free_swap() returns true?
>> >>
>> >> > Mark synchronous swapin folio with a special bit in swap entry embe=
dded
>> >> > in folio->swap, as we still have some usable bits there. Skip worki=
ngset
>> >> > node update on insertion of such folio because it will be removed v=
ery
>> >> > quickly, and will trigger the update ensuring the workingset info is
>> >> > eventual consensus.
>> >>
>> >> Is this safe?  Is it possible for the shadow node to be reclaimed aft=
er
>> >> the folio are added into node and before being removed?
>> >
>> > If a xa node contains any non-shadow entry, it can't be reclaimed,
>> > shadow_lru_isolate will check and skip such nodes in case of race.
>>
>> In shadow_lru_isolate(),
>>
>>         /*
>>          * The nodes should only contain one or more shadow entries,
>>          * no pages, so we expect to be able to remove them all and
>>          * delete and free the empty node afterwards.
>>          */
>>         if (WARN_ON_ONCE(!node->nr_values))
>>                 goto out_invalid;
>>         if (WARN_ON_ONCE(node->count !=3D node->nr_values))
>>                 goto out_invalid;
>>
>> So, this isn't considered normal and will cause warning now.
>
> Yes, I added an exception in this patch:
> -       if (WARN_ON_ONCE(node->count !=3D node->nr_values))
> +       if (WARN_ON_ONCE(node->count !=3D node->nr_values &&
> mapping->host !=3D NULL))
>
> The code is not a good final solution, but the idea might not be that
> bad, list_lru provides many operations like LRU_ROTATE, we can even
> lazy remove all the nodes as a general optimization, or add a
> threshold for adding/removing a node from LRU.

We can compare different solutions.  For this one, we still need to deal
with the cases where the folio isn't removed from the swap cache, that
is, should_try_to_free_swap() returns false.

>>
>> >>
>> >> If so, we may consider some other methods.  Make shadow_nodes per-cpu?
>> >
>> > That's also an alternative solution if there are other risks.
>>
>> This appears a general optimization and more clean.
>
> I'm not sure if synchronization between CPUs will make more burden,
> because shadow nodes are globally shared, one node can be referenced
> by multiple CPUs, I can have a try to see if this is doable. Maybe a
> per-cpu batch is better but synchronization might still be an issue.

Yes.  Per-CPU shadow_nodes needs to find list from shadow node.  That
has some overhead.

If lock contention on list_lru lock is the root cause, we can use hashed
shadow node lists.  That can reduce lock contention effectively.

--
Best Regards,
Huang, Ying