by Yu Zhao

[permalink] [raw]

Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

On Mon, Dec 25, 2023 at 5:03 AM Kairui Song <[email protected]> wrote:
>
> Yu Zhao <[email protected]> 于2023年12月25日周一 14:30写道：
> >
> > On Wed, Dec 20, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > >
> > > Yu Zhao <[email protected]> 于2023年12月20日周三 16:17写道：
> > > >
> > > > On Tue, Dec 19, 2023 at 11:38 PM Yu Zhao <[email protected]> wrote:
> > > > >
> > > > > On Tue, Dec 19, 2023 at 11:58 AM Kairui Song <[email protected]> wrote:
> > > > > >
> > > > > > Yu Zhao <[email protected]> 于2023年12月19日周二 11:45写道：
> > > > > > >
> > > > > > > On Mon, Dec 18, 2023 at 8:21 PM Yu Zhao <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Mon, Dec 18, 2023 at 11:05 AM Kairui Song <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > Yu Zhao <[email protected]> 于2023年12月15日周五 12:56写道：
> > > > > > > > > >
> > > > > > > > > > On Thu, Dec 14, 2023 at 04:51:00PM -0700, Yu Zhao wrote:
> > > > > > > > > > > On Thu, Dec 14, 2023 at 11:38 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月14日周四 11:09写道：
> > > > > > > > > > > > > On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > > > > > > > > > > > > > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道：
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道：
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道：
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > > > > > > > > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > > > > > > > > > > > > > on:
> > > > > > > > > > > > > > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > > > > > > > > > > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > > > > > > > > > > > > > cause major PFs and can be non-blocking if needed again).
> > > > > > > > > > > > > > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > > > > > > > > > > > > > client use cases like Android, its apps parse configuration files
> > > > > > > > > > > > > > > > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > > > > > > > > > > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > > > > > > > > > > > > > > > buffer pools (anon).
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > > > > > > > > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > > > > > > > > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > > > > > > > > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > > > > > > > > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > > > > > > > > > > > > > moving average smooths out the spike quickly.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > To fix the problem:
> > > > > > > > > > > > > > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > > > > > > > > > > > > > tracking range of tiers, i.e., five accesses through file
> > > > > > > > > > > > > > > > > > > descriptors, move them to the second oldest generation to give them
> > > > > > > > > > > > > > > > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > > > > > > > > > > > > > > > to statistically determine whether folios accessed multiple times
> > > > > > > > > > > > > > > > > > > through file descriptors are worth protecting.)
> > > > > > > > > > > > > > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > > > > > > > > > > > > > that they are not too close to the tail. The effect of this is
> > > > > > > > > > > > > > > > > > > similar to the above.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > > > > > > > > > > > > > Before After Change
> > > > > > > > > > > > > > > > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > > > > > > > > > > > > > > > workingset_refault_file 115016834 106178438 -8%
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks you for your amazing works on MGLRU.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > > > > > > > > > > > > > https://lwn.net/Articles/945266/
> > > > > > > > > > > > > > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > > > > > > > > > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > > > > > > > > > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > > > > > > > > > > > > > multiple workloads.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > > > > > > > > > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > > > > > > > > > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > > > > > > > > > > > > > upstream.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I think both this patch and my previous series are for solving the
> > > > > > > > > > > > > > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > > > > > > > > > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > > > > > > > > > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > > > > > > > > > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > > > > > > > > > > > > > similar problem):
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Previous result:
> > > > > > > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > > > > > > Execution Results after 905 seconds
> > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > > > > > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > This patch:
> > > > > > > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > > > > > > Execution Results after 900 seconds
> > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > > > > > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Unpatched version is always around ~500.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for the test results!
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I think there are a few points here:
> > > > > > > > > > > > > > > > > > - Refault distance make use of page shadow so it can better
> > > > > > > > > > > > > > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > > > > > > > > > > > > > distance).
> > > > > > > > > > > > > > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > > > > > > > > > > > > > memory is too small to hold the whole workingset.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > > > > > > > > > > > > > combined to work better on this issue, how do you think?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I'm working on V4 of the RFC now, which just update some comments, and
> > > > > > > > > > > > > > > skip anon page re-activation in refault path for mglru which was not
> > > > > > > > > > > > > > > very helpful, only some tiny adjustment.
> > > > > > > > > > > > > > > And I found it easier to test with fio, using following test script:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > #!/bin/bash
> > > > > > > > > > > > > > > swapoff -a
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > modprobe brd rd_nr=1 rd_size=16777216
> > > > > > > > > > > > > > > mkfs.ext4 /dev/ram0
> > > > > > > > > > > > > > > mount /dev/ram0 /mnt
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > mkdir -p /sys/fs/cgroup/benchmark
> > > > > > > > > > > > > > > cd /sys/fs/cgroup/benchmark
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > echo 4G > memory.max
> > > > > > > > > > > > > > > echo $$ > cgroup.procs
> > > > > > > > > > > > > > > echo 3 > /proc/sys/vm/drop_caches
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > > > > > > > > > > > > > --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > > > > > > > > > > > > > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > > > > > > > > > > > > > --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > > > > > > > > > > > > > --time_based --ramp_time=5m --runtime=5m --group_reporting
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > > > > > > > > > > > > > towards certain pages.
> > > > > > > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > > > > > > > > > > > > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > > > > > > > > > > > > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > > > > > > > > > > > > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > MGLRU off:
> > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > > > > > > > > > > > > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > - If I change zipf:0.5 to random:
> > > > > > > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > > > > > > > > > > > > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > > > > > > > > > > > > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > > > > > > > > > > > > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > MGLRU off:
> > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > > > > > > > > > > > > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > > > > > > > > > > > > > test I provided before uses a SATA SSD so it will have a much higher
> > > > > > > > > > > > > > > impact. I'll provides a script to setup the test case and run it, it's
> > > > > > > > > > > > > > > more complex to setup than fio since involving setting up multiple
> > > > > > > > > > > > > > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > > > > > > > > > > > > > occupied by some other tasks but will try best to send them out as
> > > > > > > > > > > > > > > soon as possible.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks! Apparently your RFC did show better IOPS with both access
> > > > > > > > > > > > > > patterns, which was a surprise to me because it had higher refaults
> > > > > > > > > > > > > > and usually higher refautls result in worse performance.
> > > > > > > > > >
> > > > > > > > > > And thanks for providing the refaults I requested for -- your data
> > > > > > > > > > below confirms what I mentioned above:
> > > > > > > > > >
> > > > > > > > > > For fio:
> > > > > > > > > > Your RFC This series Change
> > > > > > > > > > workingset_refault_file 628192729 596790506 -5%
> > > > > > > > > > IOPS 1862k 1830k -2%
> > > > > > > > > >
> > > > > > > > > > For MongoDB:
> > > > > > > > > > Your RFC This series Change
> > > > > > > > > > workingset_refault_anon 10512 35277 +30%
> > > > > > > > > > workingset_refault_file 22751782 20335355 -11%
> > > > > > > > > > total 22762294 20370632 -11%
> > > > > > > > > > TPS 0.09 0.06 -33%
> > > > > > > > > >
> > > > > > > > > > For MongoDB, this series should be a big win (but apparently it's not),
> > > > > > > > > > especially when using zram, since an anon refault should be a lot
> > > > > > > > > > cheaper than a file refault.
> > > > > > > > > >
> > > > > > > > > > So, I'm baffled...
> > > > > > > > > >
> > > > > > > > > > One important detail I forgot to mention: based on your data from
> > > > > > > > > > lru_gen_full, I think there is another difference between our Kconfigs:
> > > > > > > > > >
> > > > > > > > > > Your Kconfig My Kconfig Max possible
> > > > > > > > > > LRU_REFS_WIDTH 1 2 2
> > > > > > > > >
> > > > > > > > > Hi Yu,
> > > > > > > > >
> > > > > > > > > Thanks for the info, my fault, I forgot to update my config as I was
> > > > > > > > > testing some other features.
> > > > > > > > > Buf after I changed LRU_REFS_WIDTH to 2 by disabling IDLE_PAGE, thing
> > > > > > > > > got much worse for MongoDB test:
> > > > > > > > >
> > > > > > > > > With LRU_REFS_WIDTH == 2:
> > > > > > > > >
> > > > > > > > > This patch:
> > > > > > > > > ==================================================================
> > > > > > > > > Execution Results after 919 seconds
> > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > STOCK_LEVEL 488 27598136201.9 0.02 txn/s
> > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > TOTAL 488 27598136201.9 0.02 txn/s
> > > > > > > > >
> > > > > > > > > memcg 86 /system.slice/docker-1c3a90be9f0a072f5719332419550cd0e1455f2cd5863bc2780ca4d3f913ece5.scope
> > > > > > > > > node 0
> > > > > > > > > 1 948187 0x 0x
> > > > > > > > > 0 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 1 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 2 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 3 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 2 948187 0 6051788·
> > > > > > > > > 0 0r 0e 0p 11916r
> > > > > > > > > 66442e 0p
> > > > > > > > > 1 0r 0e 0p 903r
> > > > > > > > > 16888e 0p
> > > > > > > > > 2 0r 0e 0p 459r
> > > > > > > > > 9764e 0p
> > > > > > > > > 3 0r 0e 0p 0r
> > > > > > > > > 0e 2874p
> > > > > > > > > 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 3 948187 1353160 6351·
> > > > > > > > > 0 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 1 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 2 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 3 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 4 73045 23573 12·
> > > > > > > > > 0 0R 0T 0 3498607R
> > > > > > > > > 4868605T 0·
> > > > > > > > > 1 0R 0T 0 3012246R
> > > > > > > > > 3270261T 0·
> > > > > > > > > 2 0R 0T 0 2498608R
> > > > > > > > > 2839104T 0·
> > > > > > > > > 3 0R 0T 0 0R
> > > > > > > > > 1983947T 0·
> > > > > > > > > 1486579L 0O 1380614Y 2945N
> > > > > > > > > 2945F 2734A
> > > > > > > > >
> > > > > > > > > workingset_refault_anon 0
> > > > > > > > > workingset_refault_file 18130598
> > > > > > > > >
> > > > > > > > > total used free shared buff/cache available
> > > > > > > > > Mem: 31978 6705 312 20 24960 24786
> > > > > > > > > Swap: 31977 4 31973
> > > > > > > > >
> > > > > > > > > RFC:
> > > > > > > > > ==================================================================
> > > > > > > > > Execution Results after 908 seconds
> > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > STOCK_LEVEL 2252 27159962888.2 0.08 txn/s
> > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > TOTAL 2252 27159962888.2 0.08 txn/s
> > > > > > > > >
> > > > > > > > > workingset_refault_anon 22585
> > > > > > > > > workingset_refault_file 22715256
> > > > > > > > >
> > > > > > > > > memcg 66 /system.slice/docker-0989446ff78106e32d3f400a0cf371c9a703281bded86d6d6bb1af706ebb25da.scope
> > > > > > > > > node 0
> > > > > > > > > 22 563007 2274 1198225·
> > > > > > > > > 0 0r 1e 0p 0r
> > > > > > > > > 697076e 0p
> > > > > > > > > 1 0r 0e 0p 0r
> > > > > > > > > 0e 325661p
> > > > > > > > > 2 0r 0e 0p 0r
> > > > > > > > > 0e 888728p
> > > > > > > > > 3 0r 0e 0p 0r
> > > > > > > > > 0e 3602238p
> > > > > > > > > 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 23 532222 7525 4948747·
> > > > > > > > > 0 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 1 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 2 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 3 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 24 500367 1214667 3292·
> > > > > > > > > 0 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 1 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 2 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 3 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 25 469692 40797 466·
> > > > > > > > > 0 0R 271T 0 0R
> > > > > > > > > 1162165T 0·
> > > > > > > > > 1 0R 0T 0 774028R
> > > > > > > > > 1205332T 0·
> > > > > > > > > 2 0R 0T 0 0R
> > > > > > > > > 932484T 0·
> > > > > > > > > 3 0R 1T 0 0R
> > > > > > > > > 4252158T 0·
> > > > > > > > > 25178380L 156515O 23953602Y 59234N
> > > > > > > > > 49391F 48664A
> > > > > > > > >
> > > > > > > > > total used free shared buff/cache available
> > > > > > > > > Mem: 31978 6968 338 5 24671 24555
> > > > > > > > > Swap: 31977 1533 30444
> > > > > > > > >
> > > > > > > > > Using same mongodb config (a 3 replica cluster using the same config):
> > > > > > > > > {
> > > > > > > > > "net": {
> > > > > > > > > "bindIpAll": true,
> > > > > > > > > "ipv6": false,
> > > > > > > > > "maxIncomingConnections": 10000,
> > > > > > > > > },
> > > > > > > > > "setParameter": {
> > > > > > > > > "disabledSecureAllocatorDomains": "*"
> > > > > > > > > },
> > > > > > > > > "replication": {
> > > > > > > > > "oplogSizeMB": 10480,
> > > > > > > > > "replSetName": "issa-tpcc_0"
> > > > > > > > > },
> > > > > > > > > "security": {
> > > > > > > > > "keyFile": "/data/db/keyfile"
> > > > > > > > > },
> > > > > > > > > "storage": {
> > > > > > > > > "dbPath": "/data/db/",
> > > > > > > > > "syncPeriodSecs": 60,
> > > > > > > > > "directoryPerDB": true,
> > > > > > > > > "wiredTiger": {
> > > > > > > > > "engineConfig": {
> > > > > > > > > "cacheSizeGB": 5
> > > > > > > > > }
> > > > > > > > > }
> > > > > > > > > },
> > > > > > > > > "systemLog": {
> > > > > > > > > "destination": "file",
> > > > > > > > > "logAppend": true,
> > > > > > > > > "logRotate": "rename",
> > > > > > > > > "path": "/data/db/mongod.log",
> > > > > > > > > "verbosity": 0
> > > > > > > > > }
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > The test environment have 32g memory and 16 core.
> > > > > > > > >
> > > > > > > > > Per my analyze, the access pattern for the mongodb test is that page
> > > > > > > > > will be re-access long after it's evicted so PID controller won't
> > > > > > > > > protect higher tier. That RFC will make use of the long existing
> > > > > > > > > shadow to do feedback to PID/Gen so the result will be much better.
> > > > > > > > > Still need more adjusting though, will try to do a rebase on top of
> > > > > > > > > mm-unstable which includes your patch.
> > > > > > > > >
> > > > > > > > > I've no idea why the workingset_refault_* is higher in the better
> > > > > > > > > case, this a clearly an IO bound workload, Memory and IO is busy while
> > > > > > > > > CPU is not full...
> > > > > > > > >
> > > > > > > > > I've uploaded my local reproducer here:
> > > > > > > > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > > > > > > > https://github.com/ryncsn/py-tpcc
> > > > > > > >
> > > > > > > > Thanks for the repos -- I'm trying them right now. Which MongoDB
> > > > > > > > version did you use? setup.sh didn't seem to install it.
> > > > > > > >
> > > > > > > > Also do you have a QEMU image? It'd be a lot easier for me to
> > > > > > > > duplicate the exact environment by looking into it.
> > > > > > >
> > > > > > > I ended up using docker.io/mongodb/mongodb-community-server:latest,
> > > > > > > and it's not working:
> > > > > > >
> > > > > > > # docker exec -it mongo-r1 mongosh --eval \
> > > > > > > '"rs.initiate({
> > > > > > > _id: "issa-tpcc_0",
> > > > > > > members: [
> > > > > > > {_id: 0, host: "mongo-r1"},
> > > > > > > {_id: 1, host: "mongo-r2"},
> > > > > > > {_id: 2, host: "mongo-r3"}
> > > > > > > ]
> > > > > > > })"'
> > > > > > > Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
> > > > > > > Error: can only create exec sessions on running containers: container
> > > > > > > state improper
> > > > > >
> > > > > > Hi Yu,
> > > > > >
> > > > > > I've updated the test repo:
> > > > > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > > > >
> > > > > > I've tested it on top of latest Fedora Cloud Image 39 and it worked
> > > > > > well for me, the README now contains detailed and not hard to follow
> > > > > > steps to reproduce this test.
> > > > >
> > > > > Thanks. I was following the instructions down to the letter and it
> > > > > fell apart again at line 46 (./tpcc.py).
> > > >
> > > > I think you just broke it by
> > > > https://github.com/ryncsn/py-tpcc/commit/7b9b380d636cb84faa5b11b5562e531f924eeb7e
> > > >
> > > > (But it's also possible you actually wanted me to use this latest
> > > > commit but forgot to account for it in your instructions.)
> > > >
> > > > > Were you able to successfully run the benchmark on a fresh VM by
> > > > > following the instructions? If not, I'd appreciate it if you could do
> > > > > so and document all the missing steps.
> > >
> > > Ah, you are right, I attempted to convert it to Python3 but found it
> > > only brought more trouble, so I gave up and the instruction is still
> > > using Python2. However I accidentally pushed the WIP python3 convert
> > > commit... I've reset the repo to
> > > https://github.com/ryncsn/py-tpcc/commit/86e862c5cf3b2d1f51e0297742fa837c7a99ebf8,
> > > this is working well. Sorry for the inconvenient.
> >
> > Thanks -- I was able to reproduce results similar to yours.
> >
>
> Hi Yu,
>
> Thanks for the testing, and merry xmas.
>
> > It turned out the mystery (fewer refaults but worse performance) was caused by
> > 13.89% 13.89% kswapd0 [kernel.vmlinux] [k]
> > __list_del_entry_valid_or_report
>
> I'm not sure about this, if the task is CPU bounded, this could
> explain. But it's not, the performance gap is larger when tested on
> slow IO device.
>
> The iostat output during my test run:
> avg-cpu: %user %nice %system %iowait %steal %idle
> 7.40 0.00 2.42 83.37 0.00 6.80
> Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s
> %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
> vda 35.00 0.80 167.60 17.20 6.90 3.50
> 16.47 81.40 0.47 1.62 0.02 4.79 21.50 0.63 2.27
> vdb 5999.30 4.80 104433.60 84.00 0.00 8.30
> 0.00 63.36 6.54 1.31 39.25 17.41 17.50 0.17 100.00
> zram0 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

I ran the benchmark on the slowest bare metal I have that roughly
matches your CPU/DRAM configurations (ThinkPad P1 G4
https://support.lenovo.com/us/en/solutions/pd031426).

But it seems you used a VM (vda/vdb) -- I never run performance
benchmarks in VMs because the host and hypervisor can complicate
things, for example, in this case, is it possible the host page cache
cached your disk image containing the database files?

> You can see CPU is waiting for IO, %user is always around 10%.
> The hotspot you posted only take up 13.89% of the runtime, which
> shouldn't cause so much performance drop.
>
> >
> > Apparently Fedora has CONFIG_DEBUG_LIST=y by default, and after I
> > turned it off (the only change I made), this series showed better TPS
> > (I used"--duration=10800" for more reliable results):
> > v6.7-rc6 RFC [1] change
> > total txns 25024 24672 +1%
> > workingset_refault_anon 573668 680248 -16%
> > workingset_refault_file 260631976 265808452 -2%
>
> I have disabled CONFIG_DEBUG_LIST when doing performance comparison test.
>
> I believe you are using higher performance SSD, so the bottle neck is
> CPU, and the RFC involves more lru/memcg counter update/iteration, so
> it is slower by 1%.
>
> > I think this is easy to explain: this series is "lazy", i.e.,
> > deferring the protection to eviction time, whereas your RFC tries to
> > do it upfront, i.e., at (re)fault time. The advantage of the former is
> > that it has more up-to-date information because a folio that is hot
> > when it's faulted in doesn't mean it's still hot later when memory
> > pressure kicks in. The disadvantage is that it needs to protect folios
> > that are still hot at eviction time, by moving them to a younger
> > generation, where the slow down happened with CONFIG_DEBUG_LIST=y.
> >
> > (It's not really a priority for me to investigate why
> > __list_del_entry_valid_or_report() is so heavy. Hopefully someone else
> > can shed some light on it.)
>
> I've just setup another cluster with high performance SSD, where now
> CPU is the bottle neck to better understand this. Will try to do more
> test to see if I can find out something.

I'd suggest we both stick to bare metal until we can reconcile our
test results. Otherwise, there'd be too many moving parts for us to
get to the bottom of this.

2023-12-25 22:01:49

On Thu, Jan 11, 2024 at 11:24 AM Kairui Song <[email protected]> wrote:
>
> Yu Zhao <[email protected]> 于2024年1月11日周四 15:02写道：
> > Could you try the attached patch on the mainline v6.7 and see how it
> > compares with the results above? Thanks.
>
> Hi Yu,
>
> Thanks for the patch, it helped in some degrees, but not as effective:
> On that exclusive baremetal, I did a resetup, rebase on 6.7 mainline
> and reran the test:
>
> Refault distance series:
> ==================================================================
> Execution Results after 901 seconds
> ------------------------------------------------------------------
> Executed Time (µs) Rate
> STOCK_LEVEL 4224 27030724835.9 0.16 txn/s
> ------------------------------------------------------------------
> TOTAL 4224 27030724835.9 0.16 txn/s
>
> workingset_nodes 111349
> workingset_refault_anon 261331
> workingset_refault_file 42862224
> workingset_activate_anon 0
> workingset_activate_file 13803763
> workingset_restore_anon 250743
> workingset_restore_file 599031
> workingset_nodereclaim 23708
>
> memcg 67 /machine.slice/libpod-edbf5a3cb2574c60180c1fb5ddb2fb160df00bcee3758b7649f2b31baa97ed78.scope/container
> node 0
> 10 347163 518379 207449
> 0 0r 2e 0p 33017r
> 1726749e 0p
> 1 0r 0e 0p 7278r
> 496268e 0p
> 2 0r 0e 0p 19789r
> 55418e 0p
> 3 0r 0e 0p 0r
> 0e 4747801p
> 0 0 0 0
> 0 0
> 11 283279 154400 4791558
> 0 0 0 0 0
> 0 0
> 1 0 0 0 0
> 0 0
> 2 0 0 0 0
> 0 0
> 3 0 0 0 0
> 0 0
> 0 0 0 0
> 0 0
> 12 158723 431513 37647
> 0 0 0 0 0
> 0 0
> 1 0 0 0 0
> 0 0
> 2 0 0 0 0
> 0 0
> 3 0 0 0 0
> 0 0
> 0 0 0 0
> 0 0
> 13 44775 104986 27258
> 0 576R 982T 0 2488768R
> 5769505T 0
> 1 0R 0T 0 2335910R
> 3357277T 0
> 2 0R 0T 0 647398R
> 753021T 0
> 3 0R 20T 0 52725R
> 4740516T 0
> 2819476L 31196O 2551928Y 8298N
> 5549F 5329A
>
> Device tps kB_read/s kB_wrtn/s kB_dscd/s
> kB_read kB_wrtn kB_dscd
> dm-0 12.81 546.32 39.04 0.00
> 520178 37171 0
> dm-1 0.05 1.10 0.00 0.00
> 1044 0 0
> nvme0n1 13.17 561.99 41.19 0.00
> 535103 39219 0
> nvme1n1 5220.39 227385.96 1028.17 0.00
> 216505545 978976 0
> zram0 2440.61 2856.32 6907.13 0.00
> 2719644 6576628 0
>
> total used free shared buff/cache available
> Mem: 31830 11251 332 0 20246 20144
> Swap: 31829 3761 28068
>
> Your attachment:
> ==================================================================
> Execution Results after 905 seconds
> ------------------------------------------------------------------
> Executed Time (µs) Rate
> STOCK_LEVEL 4070 27170023578.4 0.15 txn/s
> ------------------------------------------------------------------
> TOTAL 4070 27170023578.4 0.15 txn/s
>
> workingset_nodes 121864
> workingset_refault_anon 430917
> workingset_refault_file 42915675
> workingset_activate_anon 100194
> workingset_activate_file 21619480
> workingset_restore_anon 100194
> workingset_restore_file 165054
> workingset_nodereclaim 26851
>
> memcg 65 /machine.slice/libpod-c6d8c5fedb9b390ec7f1db7d0d7c57d6a284a94e74a3923d93ea0ce4e4ffdf28.scope/container
> node 0
> 8 418689 55033 106862
> 0 16r 17e 0p 2789768r
> 6034831e 0p
> 1 0r 0e 0p 239664r
> 490278e 0p
> 2 0r 0e 0p 79145r
> 126408e 0p
> 3 23r 23e 0p 23404r
> 27107e 4736933p
> 0 0 0 0
> 0 0
> 9 322798 237713 4759110
> 0 0 0 0 0
> 0 0
> 1 0 0 0 0
> 0 0
> 2 0 0 0 0
> 0 0
> 3 0 0 0 0
> 0 0
> 0 0 0 0
> 0 0
> 10 182729 942701 5348
> 0 0 0 0 0
> 0 0
> 1 0 0 0 0
> 0 0
> 2 0 0 0 0
> 0 0
> 3 0 0 0 0
> 0 0
> 0 0 0 0
> 0 0
> 11 120287 560 375
> 0 25187R 29324T 0 1679308R
> 4256147T 0
> 1 0R 0T 0 153592R
> 364122T 0
> 2 0R 0T 0 51825R
> 98646T 0
> 3 101R 2944T 0 13985R
> 4743515T 0
> 7702245L 865749O 6514831Y 16843N
> 15088F 14167A
>
> Device tps kB_read/s kB_wrtn/s kB_dscd/s
> kB_read kB_wrtn kB_dscd
> dm-0 11.49 489.97 41.80 0.00
> 488006 41633 0
> dm-1 0.05 1.05 0.00 0.00
> 1044 0 0
> nvme0n1 11.83 504.95 43.86 0.00
> 502932 43682 0
> nvme0n1 5145.44 218803.29 984.46 0.00
> 217928081 980520 0
> zram0 3164.11 4399.55 8257.84 0.00
> 4381952 8224812 0
>
> total used free shared buff/cache available
> Mem: 31830 11583 310 1 19935 19809
> Swap: 31829 3710 28119
>
> Refault distance series still have a better performance and lower total IO.
>
> Similar result on that VM:
> ==================================================================
> Execution Results after 907 seconds
> ------------------------------------------------------------------
> Executed Time (µs) Rate
> STOCK_LEVEL 1667 27151581934.5 0.06 txn/s
> ------------------------------------------------------------------
> TOTAL 1667 27151581934.5 0.06 txn/s
>
> While refault distance series had about ~2500 - 2600 txns, mainline
> 6.7 had about ~800 - 900 txns.
>
> Loop test so far:
> Using refault distance seriese (previous result, it doesn't change much anyway):
> STOCK_LEVEL 2605 27120667462.8 0.10 txn/s
> STOCK_LEVEL 3000 27106854857.2 0.11 txn/s
> STOCK_LEVEL 2925 27066601064.4 0.11 txn/s
> STOCK_LEVEL 2757 27035248005.2 0.10 txn/s
> STOCK_LEVEL 1325 28053716046.8 0.05 txn/s
> STOCK_LEVEL 717 27455091366.3 0.03 txn/s
> STOCK_LEVEL 967 27404085208.2 0.04 txn/s
> Refault stat here:
> workingset_refault_anon 109337
> workingset_refault_file 191249716
>
> Using the attached patch:
> STOCK_LEVEL 1667 27151581934.5 0.06 txn/s
> STOCK_LEVEL 2999 27085125092.3 0.11 txn/s
> STOCK_LEVEL 2874 27120635371.2 0.11 txn/s
> STOCK_LEVEL 2658 27139142413.9 0.10 txn/s
> STOCK_LEVEL 1254 27526009063.7 0.05 txn/s
> STOCK_LEVEL 993 28065506801.8 0.04 txn/s
> STOCK_LEVEL 954 27226012906.3 0.04 txn/s
> Refault stat here:
> workingset_refault_anon 383579
> workingset_refault_file 205493832
>
> The peak performance almost equal, but still starts slow, refault is
> higher too. File refault might be interfered due to some IO layer
> issue, but anon refault is always accurate.
>
> I see the improvement you did in the attachment patch, I think
> actually they are not in conflict with the refault distance series.
> Maybe they can be combined into a even better result.
>
> Refault distance (which originally used by active/inactive LRU) is
> used here to give evicted pages priorities based on eviction distance
> and add extra feedback to PID and gen. While the PID info recorded in
> page flags/shadow represents pages's access pattern before eviction,
> and all the check and logics about it can also be improved.
>
> One critical effect of the refault distance series that boost the
> MongoDB startup (and I haven't see any negative effect of it on other
> test / workload / benchmark yet, except the overhead of memcg
> statistics itself) is it prevents overprotecting of tier 0 page: that
> is, a tier 0 page evicted but refaulted very quickly (refault distance
> < LRU / MAX_NR_GEN, this value may worth some more adjustment, but
> with LRU / MAX_NR_GEN, it can be imaged as an idea that having a small
> shadow gen holding these page shadows...) will be categorised as tier
> 1 and get protect. Other wise, if I got everything right, when most
> pages are stuck in tier 0 and keep refaulting, tier 0 will have a very
> high refault rate, and no pages will be protect, until randomness
> causes quick repeated read of some page, so they get promoted to tier
> 3 get get protected.
>
> Now min_seq contains lower tier pages and new pages will be added to
> min_seq too, so min_seq will stay for a long time, while min_seq + 1
> holds protected full ref tier 3 pages and they stay long enough to get
> promoted as tier 3 again, so they will always be kept in memory.
> Now MongoDB will perform well even without refault distance series,
> but this period may take a long time (~15 min for the MongoDB test for
> SATA SSD, which is based on a real workload), long enough to cause
> real issue.
>
> And this also means PID won't react to workload change fast enough.
>
> Also the anon refault's refs value is adjusted by refault distance too
> in the series, it tries to split the whole LRU as at least two gens
> for refaulted pages (only page with refault distance < LRU /
> MIN_NR_GEN will have full refs set, else will have refs - 1 set as
> penalty for long time evicted and unused page, which complies with
> LRU's nature). Which seems actually decreased refault of anon pages.
>
> There are some other issue that refault distance series is trying to
> solve too, eg. if there is a user agent force MGLRU to age
> periodically for proactive memory reclaim, or MGLRU simply ages fast,
> min_seq will grow periodically and PID won't catch enough feedback
> using previous logic.

Thanks. So far I've been making shots in the dark since I haven't been
able to reproduce your results on bare metal or VMs. So, either the
benchmark itself is not reliable, which according to your results is
unlikely, or I've been using different hardware configurations. Do you
think you can share some off-the-shelf hardware configuration that I
can buy and use to reliably reproduce your results? Ideally we use the
exactly same model from, for example, Dell, HP or Lenovo.