Received-SPF: pass (google.com: domain of linux-kernel+bounces-108488-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99;
From: "Huang, Ying" <ying.huang@intel.com>
To: Barry Song <21cnbao@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,  Matthew Wilcox
 <willy@infradead.org>,  akpm@linux-foundation.org,  linux-mm@kvack.org,
  chengming.zhou@linux.dev,  chrisl@kernel.org,  david@redhat.com,
  hannes@cmpxchg.org,  kasong@tencent.com,
  linux-arm-kernel@lists.infradead.org,  linux-kernel@vger.kernel.org,
  mhocko@suse.com,  nphamcs@gmail.com,  shy828301@gmail.com,
  steven.price@arm.com,  surenb@google.com,  wangkefeng.wang@huawei.com,
  xiang@kernel.org,  yosryahmed@google.com,  yuzhao@google.com,  Chuanhua
 Han <hanchuanhua@oppo.com>,  Barry Song <v-songbaohua@oppo.com>
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole
In-Reply-To: <CAGsJ_4zuEFnLwM_h7DF1BN2eN3P4S0Sw=omxo90ucKpPT4ampA@mail.gmail.com>
	(Barry Song's message of "Wed, 20 Mar 2024 15:47:50 +1300")
References: <20240304081348.197341-1-21cnbao@gmail.com>
	<20240304081348.197341-6-21cnbao@gmail.com>
	<87wmq3yji6.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAGsJ_4x+t_X4Tn15=QPbH58e1S1FwOoM3t37T+cUE8-iKoENLw@mail.gmail.com>
	<87sf0rx3d6.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAGsJ_4xna1xKz7J=MWDR3h543UvnS9v0-+ggVc5fFzpFOzfpyA@mail.gmail.com>
	<87jzm0wblq.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAGsJ_4wTU3cmzXMCu+yQRMnEiCEUA8rO5=QQUopgG0RMnHYd5g@mail.gmail.com>
	<9ec62266-26f1-46b6-8bb7-9917d04ed04e@arm.com>
	<87jzlyvar3.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<f918354d-12ee-4349-9356-fc02d2457a26@arm.com>
	<87zfutsl25.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAGsJ_4zuEFnLwM_h7DF1BN2eN3P4S0Sw=omxo90ucKpPT4ampA@mail.gmail.com>
Date: Wed, 20 Mar 2024 14:20:38 +0800
Message-ID: <87msqts9u1.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Barry Song <21cnbao@gmail.com> writes:

> On Wed, Mar 20, 2024 at 3:20=E2=80=AFPM Huang, Ying <ying.huang@intel.com=
> wrote:
>>
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>> > On 19/03/2024 09:20, Huang, Ying wrote:
>> >> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >>
>> >>>>>> I agree phones are not the only platform. But Rome wasn't built i=
n a
>> >>>>>> day. I can only get
>> >>>>>> started on a hardware which I can easily reach and have enough ha=
rdware/test
>> >>>>>> resources on it. So we may take the first step which can be appli=
ed on
>> >>>>>> a real product
>> >>>>>> and improve its performance, and step by step, we broaden it and =
make it
>> >>>>>> widely useful to various areas  in which I can't reach :-)
>> >>>>>
>> >>>>> We must guarantee the normal swap path runs correctly and has no
>> >>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimiza=
tion.
>> >>>>> So we have to put some effort on the normal path test anyway.
>> >>>>>
>> >>>>>> so probably we can have a sysfs "enable" entry with default "n" or
>> >>>>>> have a maximum
>> >>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
>> >>>>>>
>> >>>>>> "
>> >>>>>> So in the common case, swap-in will pull in the same size of foli=
o as was
>> >>>>>> swapped-out. Is that definitely the right policy for all folio si=
zes? Certainly
>> >>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). Bu=
t I'm not sure
>> >>>>>> it makes sense for 2M THP; As the size increases the chances of a=
ctually needing
>> >>>>>> all of the folio reduces so chances are we are wasting IO. There =
are similar
>> >>>>>> arguments for CoW, where we currently copy 1 page per fault - it =
probably makes
>> >>>>>> sense to copy the whole folio up to a certain size.
>> >>>>>> "
>> >>>
>> >>> I thought about this a bit more. No clear conclusions, but hoped thi=
s might help
>> >>> the discussion around policy:
>> >>>
>> >>> The decision about the size of the THP is made at first fault, with =
some help
>> >>> from user space and in future we might make decisions to split based=
 on
>> >>> munmap/mremap/etc hints. In an ideal world, the fact that we have ha=
d to swap
>> >>> the THP out at some point in its lifetime should not impact on its s=
ize. It's
>> >>> just being moved around in the system and the reason for our origina=
l decision
>> >>> should still hold.
>> >>>
>> >>> So from that PoV, it would be good to swap-in to the same size that =
was
>> >>> swapped-out.
>> >>
>> >> Sorry, I don't agree with this.  It's better to swap-in and swap-out =
in
>> >> smallest size if the page is only accessed seldom to avoid to waste
>> >> memory.
>> >
>> > If we want to optimize only for memory consumption, I'm sure there are=
 many
>> > things we would do differently. We need to find a balance between memo=
ry and
>> > performance. The benefits of folios are well documented and the kernel=
 is
>> > heading in the direction of managing memory in variable-sized blocks. =
So I don't
>> > think it's as simple as saying we should always swap-in the smallest p=
ossible
>> > amount of memory.
>>
>> It's conditional, that is,
>>
>> "if the page is only accessed seldom"
>>
>> Then, the page swapped-in will be swapped-out soon and adjacent pages in
>> the same large folio will not be accessed during this period.
>>
>> So, I suggest to create an algorithm to decide swap-in order based on
>> swap-readahead information automatically.  It can detect the situation
>> above via reduced swap readahead window size.  And, if the page is
>> accessed for quite long time, and the adjacent pages in the same large
>> folio are accessed too, swap-readahead window will increase and large
>> swap-in order will be used.
>
> The original size of do_anonymous_page() should be honored, considering it
> embodies a decision influenced by not only sysfs settings and per-vma
> HUGEPAGE hints but also architectural characteristics, for example
> CONT-PTE.
>
> The model you're proposing may offer memory-saving benefits or reduce I/O,
> but it entirely disassociates the size of the swap in from the size prior=
 to the
> swap out.

Readahead isn't the only factor to determine folio order.  For example,
we must respect "never" policy to allocate order-0 folio always.
There's no requirements to use swap-out order in swap-in too.  Memory
allocation has different performance character of storage reading.

> Moreover, there's no guarantee that the large folio generated by
> the readahead window is contiguous in the swap and can be added to the
> swap cache, as we are currently dealing with folio->swap instead of
> subpage->swap.

Yes.  We can optimize only when all conditions are satisfied.  Just like
other optimization.

> Incidentally, do_anonymous_page() serves as the initial location for allo=
cating
> large folios. Given that memory conservation is a significant considerati=
on in
> do_swap_page(), wouldn't it be even more crucial in do_anonymous_page()?

Yes.  We should consider that too.  IIUC, that is why mTHP support is
off by default for now.  After we find a way to solve the memory usage
issue.  We may make default "on".

> A large folio, by its nature, represents a high-quality resource that has=
 the
> potential to leverage hardware characteristics for the benefit of the
> entire system.

But not at the cost of memory wastage.

> Conversely, I don't believe that a randomly determined size dictated by t=
he
> readahead window possesses the same advantageous qualities.

There's a readahead algorithm which is not pure random.

> SWP_SYNCHRONOUS_IO devices are not reliant on readahead whatsoever,
> their needs should also be respected.

I understand that there are special requirements for SWP_SYNCHRONOUS_IO
devices.  I just suggest to work on general code before specific
optimization.

>> > You also said we should swap *out* in smallest size possible. Have I
>> > misunderstood you? I thought the case for swapping-out a whole folio w=
ithout
>> > splitting was well established and non-controversial?
>>
>> That is conditional too.
>>
>> >>
>> >>> But we only kind-of keep that information around, via the swap
>> >>> entry contiguity and alignment. With that scheme it is possible that=
 multiple
>> >>> virtually adjacent but not physically contiguous folios get swapped-=
out to
>> >>> adjacent swap slot ranges and then they would be swapped-in to a sin=
gle, larger
>> >>> folio. This is not ideal, and I think it would be valuable to try to=
 maintain
>> >>> the original folio size information with the swap slot. One way to d=
o this would
>> >>> be to store the original order for which the cluster was allocated i=
n the
>> >>> cluster. Then we at least know that a given swap slot is either for =
a folio of
>> >>> that order or an order-0 folio (due to cluster exhaustion/scanning).=
 Can we
>> >>> steal a bit from swap_map to determine which case it is? Or are ther=
e better
>> >>> approaches?
>> >>
>> >> [snip]

--
Best Regards,
Huang, Ying