Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
From:   Huang Ying <ying.huang@intel.com>
To:     linux-mm@kvack.org
Cc:     linux-kernel@vger.kernel.org,
        Andrew Morton <akpm@linux-foundation.org>,
        "Huang, Ying" <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Yang Shi <shy828301@gmail.com>,
        Baolin Wang <baolin.wang@linux.alibaba.com>,
        Oscar Salvador <osalvador@suse.de>,
        Matthew Wilcox <willy@infradead.org>
Subject: [RFC 0/6] migrate_pages(): batch TLB flushing
Date:   Wed, 21 Sep 2022 14:06:10 +0800
Message-Id: <20220921060616.73086-1-ying.huang@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

From: "Huang, Ying" <ying.huang@intel.com>

Now, migrate_pages() migrate pages one by one, like the fake code as
follows,

  for each page
    unmap
    flush TLB
    copy
    restore map

If multiple pages are passed to migrate_pages(), there are
opportunities to batch the TLB flushing and copying.  That is, we can
change the code to something as follows,

  for each page
    unmap
  for each page
    flush TLB
  for each page
    copy
  for each page
    restore map

The total number of TLB flushing IPI can be reduced considerably.  And
we may use some hardware accelerator such as DSA to accelerate the
page copying.

So in this patch, we refactor the migrate_pages() implementation and
implement the TLB flushing batching.  Base on this, hardware
accelerated page copying can be implemented.

If too many pages are passed to migrate_pages(), in the naive batched
implementation, we may unmap too many pages at the same time.  The
possibility for a task to wait for the migrated pages to be mapped
again increases.  So the latency may be hurt.  To deal with this
issue, the max number of pages be unmapped in batch is restricted to
no more than HPAGE_PMD_NR.  That is, the influence is at the same
level of THP migration.

We use the following test to measure the performance impact of the
patchset,

On a 2-socket Intel server,

 - Run pmbench memory accessing benchmark

 - Run `migratepages` to migrate pages of pmbench between node 0 and
   node 1 back and forth.

With the patch, the TLB flushing IPI reduces 99.1% during the test and
the number of pages migrated successfully per second increases 291.7%.

This patchset is based on v6.0-rc5 and the following patchset,

[PATCH -V3 0/8] migrate_pages(): fix several bugs in error path
https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/

The migrate_pages() related code is converting to folio now. So this
patchset cannot apply recent akpm/mm-unstable branch.  This patchset
is used to check the basic idea.  If it is OK, I will rebase the
patchset on top of folio changes.

Best Regards,
Huang, Ying