Received-SPF: pass (google.com: domain of linux-kernel+bounces-174244-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3;
Date: Thu, 9 May 2024 16:42:34 +0900
From: Byungchul Park <byungchul@sk.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	kernel_team@skhynix.com, akpm@linux-foundation.org,
	vernhao@tencent.com, mgorman@techsingularity.net, hughd@google.com,
	willy@infradead.org, david@redhat.com, peterz@infradead.org,
	luto@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, rjgolo@gmail.com
Subject: Re: [PATCH v9 rebase on mm-unstable 0/8] Reduce tlb and interrupt
 numbers over 90% by improving folio migration
Message-ID: <20240509074234.GA77328@system.software.com>
References: <20240418061536.11645-1-byungchul@sk.com>
 <87cyqlyjh5.fsf@yhuang6-desk2.ccr.corp.intel.com>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <87cyqlyjh5.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Mutt/1.9.4 (2018-02-28)

On Fri, Apr 19, 2024 at 02:06:30PM +0800, Huang, Ying wrote:
> Byungchul Park <byungchul@sk.com> writes:
> 
> > The test envitonment:
> >
> >    Architecture - x86_64
> >    QEMU - kvm enabled, host cpu
> 
> The test is run in VM?  Do you have test results in bare metal
> environment?

I tested it in a bare metal server.  See the result below.

> >    Numa - 2 nodes (16 CPUs 1GB, no CPUs 99GB)
> 
> The configuration looks quite abnormal.  Have you tested with other
> configuration, such 1:4 or 1:8?

I tested with DRAM : CXL expander = 42GB : 98GB.

> >    Linux Kernel - v6.9-rc4, numa balancing tiering on, demotion enabled
> >
> > < measurement: raw data - tlb and interrupt numbers >
> >
> >    $ perf stat -a \
> >            -e itlb.itlb_flush \
> >            -e tlb_flush.dtlb_thread \
> >            -e tlb_flush.stlb_any \
> >            -e dtlb-load-misses \
> >            -e dtlb-store-misses \
> >            -e itlb-load-misses \
> >            XSBench -t 16 -p 50000000
> >
> >    $ grep "TLB shootdowns" /proc/interrupts
> >
> >    BEFORE
> >    ------
> >    40417078     itlb.itlb_flush
> >    234852566    tlb_flush.dtlb_thread
> >    153192357    tlb_flush.stlb_any
> >    119001107892 dTLB-load-misses
> >    307921167    dTLB-store-misses
> >    1355272118   iTLB-load-misses
> >
> >    TLB: 1364803    1303670    1333921    1349607
> >         1356934    1354216    1332972    1342842
> >         1350265    1316443    1355928    1360793
> >         1298239    1326358    1343006    1340971
> >         TLB shootdowns
> >
> >    AFTER
> >    -----
> >    3316495      itlb.itlb_flush
> >    138912511    tlb_flush.dtlb_thread
> >    115199341    tlb_flush.stlb_any
> >    117610390021 dTLB-load-misses
> >    198042233    dTLB-store-misses
> >    840066984    iTLB-load-misses
> >
> >    TLB: 117257     119219     117178     115737
> >         117967     118948     117508     116079
> >         116962     117266     117320     117215
> >         105808     103934     115672     117610
> >         TLB shootdowns
> >
> > < measurement: user experience - runtime >
> >
> >    $ time XSBench -t 16 -p 50000000
> >
> >    BEFORE
> >    ------
> >    Threads:     16
> >    Runtime:     968.783 seconds
> >    Lookups:     1,700,000,000
> >    Lookups/s:   1,754,778
> >
> >    15208.91s user 141.44s system 1564% cpu 16:20.98 total
> >
> >    AFTER
> >    -----
> >    Threads:     16
> >    Runtime:     913.210 seconds
> >    Lookups:     1,700,000,000
> >    Lookups/s:   1,861,565
> >
> >    14351.69s user 138.23s system 1565% cpu 15:25.47 total
> 
> IIUC, the memory footprint will be larger with the patchset.  Do you
> have data?

As I already told you, from version 9, the footprint is exactly same
between patched kernel and vanilla kernel because that let folios go as
is, but controls TLB flush timing only.

There's two things to note.

1. I changed the patchset and will post the next version shortly:

	BEFORE - Defer TLB flush required until the interesting folios
	         exiting either pcp or buddy.  The interesting folios
		 are source folios unmapped during folio migration.

	AFTER  - Defer TLB flush required until the interesting folios
	         exiting either pcp or buddy.  The interesting folios
		 are source folios unmapped during folio migration,
		 * plus, folios unmapped during reclaiming folios in
		 shrink_folio_list()*.

2. I changed workload for testing because XSBench doesn't struggle
   against lack of memory in such a big server.  Instead, I picked a
   very real workload, LLM inference engine, llama.cpp.

I tested with the two changes.  The test result is like:

---

   Kernel version: mm-unstable around v6.9-rc4
   Machine: bare metal, x86_64, Intel(R) Xeon(R) Gold 6430
   CPU: 1 socket 64 core with hyper thread on
   Numa: 2 nodes (64 CPUs DRAM 42GB, no CPUs CXL(expander) 98GB)
   Config: swap off, numa balancing tiering on, demotion enabled

   1 set of test workload:

      echo 3 > /proc/sys/vm/drop_caches
      llama.cpp/main -m $(70G_model1) -p "who are you?" -s 1 -t 15 -n 20 &
      llama.cpp/main -m $(70G_model2) -p "who are you?" -s 1 -t 15 -n 20 &
      llama.cpp/main -m $(70G_model3) -p "who are you?" -s 1 -t 15 -n 20 &
      wait
   
   where -t: nr of threads, -s: seed used to make the runtime stable,
   -n: nr of tokens determinig the runtime, -p: prompt to ask, -m: LLM
   model to use.

   Run this set 10 times successively.  So I got 30 total runtimes since
   each inference prints its runtime at the end of each run.  The result
   is like:

   BEFORE
   ------
   llama_print_timings:       total time = 1002461.95 ms /    24 tokens
   llama_print_timings:       total time = 1044978.38 ms /    24 tokens
   llama_print_timings:       total time = 1000653.09 ms /    24 tokens
   llama_print_timings:       total time = 1047104.80 ms /    24 tokens
   llama_print_timings:       total time = 1069430.36 ms /    24 tokens
   llama_print_timings:       total time = 1068201.16 ms /    24 tokens
   llama_print_timings:       total time = 1078092.59 ms /    24 tokens
   llama_print_timings:       total time = 1073200.45 ms /    24 tokens
   llama_print_timings:       total time = 1067136.00 ms /    24 tokens
   llama_print_timings:       total time = 1076442.56 ms /    24 tokens
   llama_print_timings:       total time = 1004142.64 ms /    24 tokens
   llama_print_timings:       total time = 1042942.65 ms /    24 tokens
   llama_print_timings:       total time =  999933.76 ms /    24 tokens
   llama_print_timings:       total time = 1046548.83 ms /    24 tokens
   llama_print_timings:       total time = 1068671.48 ms /    24 tokens
   llama_print_timings:       total time = 1068285.76 ms /    24 tokens
   llama_print_timings:       total time = 1077789.63 ms /    24 tokens
   llama_print_timings:       total time = 1071558.93 ms /    24 tokens
   llama_print_timings:       total time = 1066181.55 ms /    24 tokens
   llama_print_timings:       total time = 1076767.53 ms /    24 tokens
   llama_print_timings:       total time = 1004065.63 ms /    24 tokens
   llama_print_timings:       total time = 1044522.13 ms /    24 tokens
   llama_print_timings:       total time =  999725.33 ms /    24 tokens
   llama_print_timings:       total time = 1047510.77 ms /    24 tokens
   llama_print_timings:       total time = 1068010.27 ms /    24 tokens
   llama_print_timings:       total time = 1068999.31 ms /    24 tokens
   llama_print_timings:       total time = 1077648.05 ms /    24 tokens
   llama_print_timings:       total time = 1071378.96 ms /    24 tokens
   llama_print_timings:       total time = 1066326.32 ms /    24 tokens
   llama_print_timings:       total time = 1077088.92 ms /    24 tokens

   AFTER
   -----
   llama_print_timings:       total time =  988522.03 ms /    24 tokens
   llama_print_timings:       total time =  997204.52 ms /    24 tokens
   llama_print_timings:       total time =  996605.86 ms /    24 tokens
   llama_print_timings:       total time =  991985.50 ms /    24 tokens
   llama_print_timings:       total time = 1035143.31 ms /    24 tokens
   llama_print_timings:       total time =  993660.18 ms /    24 tokens
   llama_print_timings:       total time =  983082.14 ms /    24 tokens
   llama_print_timings:       total time =  990431.36 ms /    24 tokens
   llama_print_timings:       total time =  992707.09 ms /    24 tokens
   llama_print_timings:       total time =  992673.27 ms /    24 tokens
   llama_print_timings:       total time =  989285.43 ms /    24 tokens
   llama_print_timings:       total time =  996710.06 ms /    24 tokens
   llama_print_timings:       total time =  996534.64 ms /    24 tokens
   llama_print_timings:       total time =  991344.17 ms /    24 tokens
   llama_print_timings:       total time = 1035210.84 ms /    24 tokens
   llama_print_timings:       total time =  994714.13 ms /    24 tokens
   llama_print_timings:       total time =  984184.15 ms /    24 tokens
   llama_print_timings:       total time =  990909.45 ms /    24 tokens
   llama_print_timings:       total time =  991881.48 ms /    24 tokens
   llama_print_timings:       total time =  993918.03 ms /    24 tokens
   llama_print_timings:       total time =  990061.34 ms /    24 tokens
   llama_print_timings:       total time =  998076.69 ms /    24 tokens
   llama_print_timings:       total time =  997082.59 ms /    24 tokens
   llama_print_timings:       total time =  990677.58 ms /    24 tokens
   llama_print_timings:       total time = 1036054.94 ms /    24 tokens
   llama_print_timings:       total time =  994125.93 ms /    24 tokens
   llama_print_timings:       total time =  982467.01 ms /    24 tokens
   llama_print_timings:       total time =  990191.60 ms /    24 tokens
   llama_print_timings:       total time =  993319.24 ms /    24 tokens
   llama_print_timings:       total time =  992540.57 ms /    24 tokens
   
   The difference of TLB shootdown(/proc/interrupts) is like:

   BEFORE
   ------
   TLB:
   125553646  141418810  161932620  176853972  186655697  190399283
   192143823  196414038  192872439  193313658  193395617  192521416
   190788161  195067598  198016061  193607347  194293972  190786732
   191545637  194856822  191801931  189634535  190399803  196365922
   195268398  190115840  188050050  193194908  195317617  190820190
   190164820  185556071  226797214  229592631  216112464  209909495
   205575979  205950252  204948111  197999795  198892232  205287952
   199344631  195015158  195869844  198858745  195692876  200961904
   203463252  205921722  199850838  206145986  199613202  199961345
   200129577  203020521  207873649  203697671  197093386  204243803
   205993323  200934664  204193128  194435376  TLB shootdowns                                                  

   AFTER
   -----
   TLB:
   5648092    6610142    7032849    7882308    8088518    8352310
   8656536    8705136    8647426    8905583    8985408    8704522
   8884344    9026261    8929974    8869066    8877575    8810096
   8770984    8754503    8801694    8865925    8787524    8656432
   8755912    8682034    8773935    8832925    8797997    8515777
   8481240    8891258   10595243   10285973    9756935    9573681
   9398968    9069244    9242984    8899009    9310690    9029095
   9069758    9105825    9092703    9270202    9460287    9258546
   9180415    9232723    9270611    9175020    9490420    9360316
   9420818    9057663    9525631    9310152    9152242    8654483
   9181804    9050847    8919916    8883856    TLB shootdowns                                                  

   The difference of 'perf stat' for tlb numbers during one set of test
   with drop cache excluded is like:

   BEFORE
   ------
   3163679332	dTLB-load-misses     
   2017751856	dTLB-store-misses    
   327092903	iTLB-load-misses     
   1357543886	tlb:tlb_flush        

   AFTER
   -----
   2394694609	dTLB-load-misses     
   861144167	dTLB-store-misses    
   64055579	iTLB-load-misses     
   69175002	tlb:tlb_flush        

---

I'm happy to share great results.  I used a real workload that is super
popular these days, and got good results.  I will post the next version
of the patchset shortly after organizing and refining things.

	Byungchul