Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp6221119rwb; Tue, 22 Nov 2022 10:16:05 -0800 (PST) X-Google-Smtp-Source: AA0mqf4Ccss2qT0MUUhOtQeI399TU34/1eZ1P4c/h7+OGvBMhloovgKuUzEazFpG9IKfLZXu9Vz+ X-Received: by 2002:a17:90a:6906:b0:20d:5587:805b with SMTP id r6-20020a17090a690600b0020d5587805bmr27046479pjj.190.1669140965322; Tue, 22 Nov 2022 10:16:05 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669140965; cv=none; d=google.com; s=arc-20160816; b=JYOEDxH7WXv/8v+QvLDIZntAvKCLDikab8j/D6RECy0e3WayTdZ5WMezD1aW8kAt/T RZxW2Rh6fDWEouJ1fwXmqOvSZeoOggALZoQM+JnuFTKotjxcsMcK556pCiqGNohzWT60 DBQIQWM2/TCD4GXwExxJdq/k4bCOeWsnAshC3kOue9GEOw5unlleEfIaLgfAgeyzle5J UPdpvu9ffz+ipzd+yPmY8M+N0s5YwZgUtOKsHbZOg1+pbFxrtV7a8YqNfaCyDc45hBpl G7j+4rSiDJTWRP8u8JSCKy/c6t79qCPWaOS+cyQAunD2k86fuAKj5NP6Po/VIIYnZKhL z+2w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=6B0NrMvYa9h30nShzcOWnnoLa9VhsXqCL7iaXLlPHUo=; b=JSiTdTks6Y1jtTHNdstS5RZIj/fZDKGd3dDkaRAP7L/tYCQsYNEt7zse1AKkgOUuGe gDlGHJLcnsZOjdqDW2UQxsVPcqUFTZ55DfcbKkiB6Jw1nBa5hKNs5b6Pa2cW2VBI6yvg bO6diAzcnyxcIKk3iYuYz5tXnX9qzJmJMxUvrehyc8oxuKW7Gzj+7AOCfKDF8UBuaSqx FkcJbELEQ79XfYPYo4UyHXbePogH2sDoJqMSornPxcLkCUhwLOh5cfVVD7UGkS85xarv nRvU29gzS37N5BR8etC9ngEBx68pI7nsaoS1DyliRDFFV4TU/QMePmqRNp5N8nvKUE9J dV1A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=iWPChtAo; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d16-20020a631d50000000b004772ef46b5csi12114547pgm.219.2022.11.22.10.15.53; Tue, 22 Nov 2022 10:16:05 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=iWPChtAo; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233928AbiKVSBf (ORCPT + 91 others); Tue, 22 Nov 2022 13:01:35 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48456 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233859AbiKVSBa (ORCPT ); Tue, 22 Nov 2022 13:01:30 -0500 Received: from mail-yw1-x112b.google.com (mail-yw1-x112b.google.com [IPv6:2607:f8b0:4864:20::112b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 929A2682BB for ; Tue, 22 Nov 2022 10:01:28 -0800 (PST) Received: by mail-yw1-x112b.google.com with SMTP id 00721157ae682-3abc71aafcaso19370857b3.3 for ; Tue, 22 Nov 2022 10:01:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=6B0NrMvYa9h30nShzcOWnnoLa9VhsXqCL7iaXLlPHUo=; b=iWPChtAoKFdKhqcc/DN3txXed0Dwlsv6aeY/kU0GIAptMe3AhWO8wNTMkP814+mpot Lw+8OhB61MCpzcW/qHOQt9y0qCwc2TeVUQZL6D4AaTgJnk4brkjJmPWHb9ErENo8EMPV vNAw2U5rYK5vDLFMMjyG/YPN8oDJszHjXswrS7dU4RuyirXZUZn3LSiKEZH67u0ITTlB vm7GHgRSaPuARILnXSmyQdd3uOb1udK3X9qB19+nwQzHalBvF2/CKdn9VUtUr2dTyGfD Ub/JVtM14HO5rq2oN6TgFpsme/nE5KtZ05qtTeNRWANWEX7Ghk/R9QA5RRdfaycx8rvi csrA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=6B0NrMvYa9h30nShzcOWnnoLa9VhsXqCL7iaXLlPHUo=; b=52NdoXHXel79Q/UNKnIWbO6EL1uLla7/e5VAfufYZv0zqpqprch/j9SfJzYdZY3qTT YYKtGMLVIHqjpKCLHeco1+H5AR3z8QCN5VxoEBO3tS82uCvyDIeyejv7rFj4EkbDFtsN lTLb2zxHQHmRKVE9EDp2ygMn9AaFqqJPZ19r2oCzv0Gv7HxmC6MOLsMmBGxZbFBLEnYy GJW3cUz26fRPaA/8DCfhS8bFLsgdYbDOCzTN7QhKLvhhaBCoY1Jzp5ofuIwFDFxmYt/H rINzuyzjhZi1M49ih2ZTQg64Kx+mTsdffQpphm+jNmpofdpnPW0K5/oOr0PaZr9Xa0E/ 81HA== X-Gm-Message-State: ANoB5pniqMDrTrgF0rz72DB2CetmfGEMBB5f1fbOjxYXhPP/obl9tcFZ rv4edGPtAts7VfoTrcfV2xLpTZc3VnIMNcnPvwWl+g== X-Received: by 2002:a05:690c:a92:b0:36c:aaa6:e571 with SMTP id ci18-20020a05690c0a9200b0036caaa6e571mr23063228ywb.467.1669140087308; Tue, 22 Nov 2022 10:01:27 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Eric Dumazet Date: Tue, 22 Nov 2022 10:01:16 -0800 Message-ID: Subject: Re: Low TCP throughput due to vmpressure with swap enabled To: Ivan Babrou Cc: Linux MM , Linux Kernel Network Developers , linux-kernel , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , "David S. Miller" , Hideaki YOSHIFUJI , David Ahern , Jakub Kicinski , Paolo Abeni , cgroups@vger.kernel.org, kernel-team Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 21, 2022 at 4:53 PM Ivan Babrou wrote: > > Hello, > > We have observed a negative TCP throughput behavior from the following commit: > > * 8e8ae645249b mm: memcontrol: hook up vmpressure to socket pressure > > It landed back in 2016 in v4.5, so it's not exactly a new issue. > > The crux of the issue is that in some cases with swap present the > workload can be unfairly throttled in terms of TCP throughput. I guess defining 'fairness' in such a scenario is nearly impossible. Have you tried changing /proc/sys/net/ipv4/tcp_rmem (and/or tcp_wmem) ? Defaults are quite conservative. If for your workload you want to ensure a minimum amount of memory per TCP socket, that might be good enough. Of course, if your proxy has to deal with millions of concurrent TCP sockets, I fear this is not an option. > > I am able to reproduce this issue in a VM locally on v6.1-rc6 with 8 > GiB of RAM with zram enabled. > > The setup is fairly simple: > > 1. Run the following go proxy in one cgroup (it has some memory > ballast to simulate useful memory usage): > > * https://gist.github.com/bobrik/2c1a8a19b921fefe22caac21fda1be82 > > sudo systemd-run --scope -p MemoryLimit=6G go run main.go > > 2. Run the following fio config in another cgroup to simulate mmapped > page cache usage: > > [global] > size=8g > bs=256k > iodepth=256 > direct=0 > ioengine=mmap > group_reporting > time_based > runtime=86400 > numjobs=8 > name=randread > rw=randread > > [job1] > filename=derp > > sudo systemd-run --scope fio randread.fio > > 3. Run curl to request a large file via proxy: > > curl -o /dev/null http://localhost:4444 > > 4. Observe low throughput. The numbers here are dependent on your > location, but in my VM the throughput drops from 60MB/s to 10MB/s > depending on whether fio is running or not. > > I can see that this happens because of the commit I mentioned with > some perf tracing: > > sudo perf probe --add 'vmpressure:48 memcg->css.cgroup->kn->id scanned > vmpr_scanned=vmpr->scanned reclaimed vmpr_reclaimed=vmpr->reclaimed' > sudo perf probe --add 'vmpressure:72 memcg->css.cgroup->kn->id' > > I can record the probes above during curl runtime: > > sudo perf record -a -e probe:vmpressure_L48,probe:vmpressure_L72 -- sleep 5 > > Line 48 allows me to observe scanned and reclaimed page counters, line > 72 is the actual throttling. > > Here's an example trace showing my go proxy cgroup: > > kswapd0 89 [002] 2351.221995: probe:vmpressure_L48: (ffffffed2639dd90) > id=0xf23 scanned=0x140 vmpr_scanned=0x0 reclaimed=0x0 > vmpr_reclaimed=0x0 > kswapd0 89 [007] 2351.333407: probe:vmpressure_L48: (ffffffed2639dd90) > id=0xf23 scanned=0x2b3 vmpr_scanned=0x140 reclaimed=0x0 > vmpr_reclaimed=0x0 > kswapd0 89 [007] 2351.333408: probe:vmpressure_L72: (ffffffed2639de2c) id=0xf23 > > We scanned lots of pages, but weren't able to reclaim anything. > > When throttling happens, it's in tcp_prune_queue, where rcv_ssthresh > (TCP window clamp) is set to 4 x advmss: > > * https://elixir.bootlin.com/linux/v5.15.76/source/net/ipv4/tcp_input.c#L5373 > > else if (tcp_under_memory_pressure(sk)) > tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss); > > I can see plenty of memory available in both my go proxy cgroup and in > the system in general: > > $ free -h > total used free shared buff/cache available > Mem: 7.8Gi 4.3Gi 104Mi 0.0Ki 3.3Gi 3.3Gi > Swap: 11Gi 242Mi 11Gi > > It just so happens that all of the memory is hot and is not eligible > to be reclaimed. Since swap is enabled, the memory is still eligible > to be scanned. If swap is disabled, then my go proxy is not eligible > for scanning anymore (all memory is anonymous, nowhere to reclaim it), > so the whole issue goes away. > > Punishing well behaving programs like that doesn't seem fair. We saw > production metals with 200GB page cache out of 384GB of RAM, where a > well behaved proxy with 60GB of RAM + 15GB of swap is throttled like > that. The fact that it only happens with swap makes it extra weird. > > I'm not really sure what to do with this. From our end we'll probably > just pass cgroup.memory=nosocket in cmdline to disable this behavior > altogether, since it's not like we're running out of TCP memory (and > we can deal with that better if it ever comes to that). There should > probably be a better general case solution. Probably :) > > I don't know how widespread this issue can be. You need a fair amount > of page cache pressure to try to go to anonymous memory for reclaim to > trigger this. > > Either way, this seems like a bit of a landmine.