Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp5128400rwb; Mon, 21 Nov 2022 17:17:12 -0800 (PST) X-Google-Smtp-Source: AA0mqf427p73/7XCQuZcYoAmS1DL1vFnIBtWkTeDFoc6z9i0iXDs+EoVS1nDJLEmGPE4zOfesjSB X-Received: by 2002:a05:6402:1397:b0:467:78f2:d81f with SMTP id b23-20020a056402139700b0046778f2d81fmr2162730edv.88.1669079832444; Mon, 21 Nov 2022 17:17:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669079832; cv=none; d=google.com; s=arc-20160816; b=j/4lnLVjPQ661uckaSnosn8XQ09iDj/S0snNvt14fHQXCbZeWcHYo2EBJqoprkrOJx 49gBl+MRILEi5zh/0/hLQSX9kNI20wJf+aRTDJhkhk3R0GNN8JsBsqFForOkjYAnZQyv A60KxgVRnnF1cznvIZlNyU8y3Hqok2ufZQSi4jAaVvAlEoINKg4xXBzqpRviHuTfDNfs oYZqiSBv7Wb1tJMcQ+D+kz/AahbdJnySj6A1oJWcLs/Gf1S52v2BSoXb5Mh45z0701YM BrJfcTwisAYqTdXwvZvVLBbKHDZFsfwxPZN98NH+eyIVjLkkFYrP9WG3QAmJ87CJaoln zgLg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:mime-version :dkim-signature; bh=O6WUOw9T0DV19s/c2TGTrCQrdf4F5EyxA3HYag+nfl0=; b=rqR31exbE8VtV1tu/8xbwI2feN4K72QtN1YOeHbdSs4cCRXB6zg9pzdu8u91v+p4ze ZQhFIzuV2waGQF1DAkYQMu28gMcl+ug3ITKDEPIvrUHAvnPDT/jK2k+yrbjUmHAl/6OL 4tct8zI8QX9LI6Ywx3lMPHrlgwitCzplAXc8kUoLD4kNDjM2jdK77VeG4+h4JVUxqctq fBgF/9YyNDgjRbpCnVZkGJ04LZ3B9fjEl3TSnTe9BAMjJsyLCnYnGFS7snYjYO5tLhJX 73hvCzp3px6Qr0+s0QN1GH0nRQu3rZeL1G3i/yeqcWtGTuS2GABwp3CVgdr9mfqG0HLJ v51A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cloudflare.com header.s=google header.b=K7uGD5oz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=cloudflare.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z27-20020a1709063adb00b0078dacbcaa7asi9810081ejd.992.2022.11.21.17.16.50; Mon, 21 Nov 2022 17:17:12 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@cloudflare.com header.s=google header.b=K7uGD5oz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=cloudflare.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231758AbiKVAyY (ORCPT + 91 others); Mon, 21 Nov 2022 19:54:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38094 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231820AbiKVAxz (ORCPT ); Mon, 21 Nov 2022 19:53:55 -0500 Received: from mail-yb1-xb2e.google.com (mail-yb1-xb2e.google.com [IPv6:2607:f8b0:4864:20::b2e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8DAD9E06B7 for ; Mon, 21 Nov 2022 16:53:54 -0800 (PST) Received: by mail-yb1-xb2e.google.com with SMTP id v184so209983ybv.6 for ; Mon, 21 Nov 2022 16:53:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google; h=cc:to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=O6WUOw9T0DV19s/c2TGTrCQrdf4F5EyxA3HYag+nfl0=; b=K7uGD5oz/aX87z0dSzQiMLPSWToAYxYdnapsn443FHnDX0Zpnyu7ogdz44QyeZYtkh N2soQOhA7Uf54WmfPwkNv8Wh6ohKOfa1RV33ck4avR0Fos+V9s17/WKxkSUAxC5H1lPj MHaYxnD5euM+lAJWHkrzN/9sC9BjhZNqYWqqs= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=O6WUOw9T0DV19s/c2TGTrCQrdf4F5EyxA3HYag+nfl0=; b=ZloCHswA1W6AJk/C5xdKm5APVX43PAYbYwQShirdB9vIwmdHYj7BYVBCS+dVEJRFHW uyK24bGS6r/oNId687+Gyrse77z3vjfjucO7dY9jIXc0HjFgguAZRXbgc0ahjxGsuesh 3E+tIc5ffu1S3cMOqCb8eDNxaDrnIyc6DjfDwm89gPL67aIbAik9CnYShdu2bqUG9P4P aiqaIJbu4hY5Z0GiCVnmLJSHuCSaFCofsTiKfOGJK1yOrV5xLEKro2IP+5ZWXYUqGCY4 xxaxv1aY/cPxvZpMgMEADJ21HRRQgyUNM8GJNi+01Xcyn3Gb5y5+exVz0YAzcfXki/4Y hVwA== X-Gm-Message-State: ANoB5pnW2KkT7IdH1Hhx/vSxV+bcKnmA6y3dhNc8kFvAWbwdy5Wm/teU EqfebHNQL1TdJKcMHd77A9RttZo+NeizesTAddFA4w== X-Received: by 2002:a05:6902:1825:b0:6de:f09:2427 with SMTP id cf37-20020a056902182500b006de0f092427mr1386018ybb.125.1669078433693; Mon, 21 Nov 2022 16:53:53 -0800 (PST) MIME-Version: 1.0 From: Ivan Babrou Date: Mon, 21 Nov 2022 16:53:43 -0800 Message-ID: Subject: Low TCP throughput due to vmpressure with swap enabled To: Linux MM Cc: Linux Kernel Network Developers , linux-kernel , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Eric Dumazet , "David S. Miller" , Hideaki YOSHIFUJI , David Ahern , Jakub Kicinski , Paolo Abeni , cgroups@vger.kernel.org, kernel-team Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, We have observed a negative TCP throughput behavior from the following commit: * 8e8ae645249b mm: memcontrol: hook up vmpressure to socket pressure It landed back in 2016 in v4.5, so it's not exactly a new issue. The crux of the issue is that in some cases with swap present the workload can be unfairly throttled in terms of TCP throughput. I am able to reproduce this issue in a VM locally on v6.1-rc6 with 8 GiB of RAM with zram enabled. The setup is fairly simple: 1. Run the following go proxy in one cgroup (it has some memory ballast to simulate useful memory usage): * https://gist.github.com/bobrik/2c1a8a19b921fefe22caac21fda1be82 sudo systemd-run --scope -p MemoryLimit=6G go run main.go 2. Run the following fio config in another cgroup to simulate mmapped page cache usage: [global] size=8g bs=256k iodepth=256 direct=0 ioengine=mmap group_reporting time_based runtime=86400 numjobs=8 name=randread rw=randread [job1] filename=derp sudo systemd-run --scope fio randread.fio 3. Run curl to request a large file via proxy: curl -o /dev/null http://localhost:4444 4. Observe low throughput. The numbers here are dependent on your location, but in my VM the throughput drops from 60MB/s to 10MB/s depending on whether fio is running or not. I can see that this happens because of the commit I mentioned with some perf tracing: sudo perf probe --add 'vmpressure:48 memcg->css.cgroup->kn->id scanned vmpr_scanned=vmpr->scanned reclaimed vmpr_reclaimed=vmpr->reclaimed' sudo perf probe --add 'vmpressure:72 memcg->css.cgroup->kn->id' I can record the probes above during curl runtime: sudo perf record -a -e probe:vmpressure_L48,probe:vmpressure_L72 -- sleep 5 Line 48 allows me to observe scanned and reclaimed page counters, line 72 is the actual throttling. Here's an example trace showing my go proxy cgroup: kswapd0 89 [002] 2351.221995: probe:vmpressure_L48: (ffffffed2639dd90) id=0xf23 scanned=0x140 vmpr_scanned=0x0 reclaimed=0x0 vmpr_reclaimed=0x0 kswapd0 89 [007] 2351.333407: probe:vmpressure_L48: (ffffffed2639dd90) id=0xf23 scanned=0x2b3 vmpr_scanned=0x140 reclaimed=0x0 vmpr_reclaimed=0x0 kswapd0 89 [007] 2351.333408: probe:vmpressure_L72: (ffffffed2639de2c) id=0xf23 We scanned lots of pages, but weren't able to reclaim anything. When throttling happens, it's in tcp_prune_queue, where rcv_ssthresh (TCP window clamp) is set to 4 x advmss: * https://elixir.bootlin.com/linux/v5.15.76/source/net/ipv4/tcp_input.c#L5373 else if (tcp_under_memory_pressure(sk)) tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss); I can see plenty of memory available in both my go proxy cgroup and in the system in general: $ free -h total used free shared buff/cache available Mem: 7.8Gi 4.3Gi 104Mi 0.0Ki 3.3Gi 3.3Gi Swap: 11Gi 242Mi 11Gi It just so happens that all of the memory is hot and is not eligible to be reclaimed. Since swap is enabled, the memory is still eligible to be scanned. If swap is disabled, then my go proxy is not eligible for scanning anymore (all memory is anonymous, nowhere to reclaim it), so the whole issue goes away. Punishing well behaving programs like that doesn't seem fair. We saw production metals with 200GB page cache out of 384GB of RAM, where a well behaved proxy with 60GB of RAM + 15GB of swap is throttled like that. The fact that it only happens with swap makes it extra weird. I'm not really sure what to do with this. From our end we'll probably just pass cgroup.memory=nosocket in cmdline to disable this behavior altogether, since it's not like we're running out of TCP memory (and we can deal with that better if it ever comes to that). There should probably be a better general case solution. I don't know how widespread this issue can be. You need a fair amount of page cache pressure to try to go to anonymous memory for reclaim to trigger this. Either way, this seems like a bit of a landmine.