Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DFAE4C6379F for ; Tue, 14 Feb 2023 02:12:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230165AbjBNCMm (ORCPT ); Mon, 13 Feb 2023 21:12:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37054 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229603AbjBNCMj (ORCPT ); Mon, 13 Feb 2023 21:12:39 -0500 Received: from mx0a-0031df01.pphosted.com (mx0a-0031df01.pphosted.com [205.220.168.131]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 51F8B13D73; Mon, 13 Feb 2023 18:12:38 -0800 (PST) Received: from pps.filterd (m0279863.ppops.net [127.0.0.1]) by mx0a-0031df01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 31E01ABr006116; Tue, 14 Feb 2023 02:12:14 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; h=message-id : date : mime-version : subject : to : cc : references : from : in-reply-to : content-type : content-transfer-encoding; s=qcppdkim1; bh=Lyk9ZAmAdxpE1HIFfBYI5WS1Y655+FJVtSWgkKqzYn4=; b=R5NriELaX15ot6qIOcI3CLE1b1+3BzqS3jaCDN+2Csppv/PXOX4zLm4pEz2jdQt7wiOc oWjq/+spNw8CxYX6g/f8cc2lpw0s5olmDxR9XpuQ8L6Connbj6DJETMnkOColnaFq8jx GSg/3JabuOQTt3da7gF1amTE/6L3XoSicIBDSyF7gd2Px8mVxUL6Rt7sgK1Ds99QQe+J 9HE98wPDlrfCXkuGjSfp4sla9y4fX4gumtUM+FRv+hGeWX6Zw4dkTRnBMPmfCdrSCZMf pYlBUOd8evJYk7Ulttr0AM1ekBj2zzWKUNNmYmNnyb6lvRdfiCTe6Ax2AuPiiiqZyl5m qg== Received: from nasanppmta02.qualcomm.com (i-global254.qualcomm.com [199.106.103.254]) by mx0a-0031df01.pphosted.com (PPS) with ESMTPS id 3np389whxd-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 14 Feb 2023 02:12:13 +0000 Received: from nasanex01b.na.qualcomm.com (nasanex01b.na.qualcomm.com [10.46.141.250]) by NASANPPMTA02.qualcomm.com (8.17.1.5/8.17.1.5) with ESMTPS id 31E2CC4H022926 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 14 Feb 2023 02:12:12 GMT Received: from [10.110.106.148] (10.80.80.8) by nasanex01b.na.qualcomm.com (10.46.141.250) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.36; Mon, 13 Feb 2023 18:12:11 -0800 Message-ID: <82406da2-799e-f0b4-bce0-7d47486030d4@quicinc.com> Date: Mon, 13 Feb 2023 18:12:11 -0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.1.1 Subject: Re: [PATCH] psi: reduce min window size to 50ms Content-Language: en-US To: Suren Baghdasaryan CC: David Hildenbrand , Johannes Weiner , Mike Rapoport , Oscar Salvador , Anshuman Khandual , , , , , , , , Trilok Soni , Sukadev Bhattiprolu , Srivatsa Vaddagiri , Patrick Daly , "Sudarshan Rajagopalan" References: <15cd8816-b474-0535-d854-41982d3bbe5c@quicinc.com> From: Sudarshan Rajagopalan In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.80.80.8] X-ClientProxiedBy: nasanex01b.na.qualcomm.com (10.46.141.250) To nasanex01b.na.qualcomm.com (10.46.141.250) X-QCInternal: smtphost X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=5800 signatures=585085 X-Proofpoint-ORIG-GUID: HRDZ6_d9dYn4E3fyaNrxBEOfa9C3R19i X-Proofpoint-GUID: HRDZ6_d9dYn4E3fyaNrxBEOfa9C3R19i X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.219,Aquarius:18.0.930,Hydra:6.0.562,FMLib:17.11.170.22 definitions=2023-02-14_01,2023-02-13_01,2023-02-09_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 clxscore=1015 bulkscore=0 lowpriorityscore=0 spamscore=0 malwarescore=0 impostorscore=0 adultscore=0 priorityscore=1501 mlxscore=0 mlxlogscore=999 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2212070000 definitions=main-2302140015 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2/10/2023 6:13 PM, Suren Baghdasaryan wrote: > On Fri, Feb 10, 2023 at 5:46 PM Sudarshan Rajagopalan > wrote: >> >> On 2/10/2023 5:09 PM, Suren Baghdasaryan wrote: >>> On Fri, Feb 10, 2023 at 4:45 PM Sudarshan Rajagopalan >>> wrote: >>>> On 2/10/2023 3:03 PM, Suren Baghdasaryan wrote: >>>>> On Fri, Feb 10, 2023 at 2:31 PM Sudarshan Rajagopalan >>>>> wrote: >>>>>> The PSI mechanism is useful tool to monitor pressure stall >>>>>> information in the system. Currently, the minimum window size >>>>>> is set to 500ms. May we know what is the rationale for this? >>>>> The limit was set to avoid regressions in performance and power >>>>> consumption if the window is set too small and the system ends up >>>>> polling too frequently. That said, the limit was chosen based on >>>>> results of specific experiments which might not represent all >>>> Rightly as you said, the effect on power and performance depends on type >>>> of the system - embedded systems, or Android mobile, or commercial VMs >>>> or servers. With higher PSI sampling, it may not be much of power impact >>>> to embedded systems with low-tier chipsets or performance impact to >>>> powerful servers. >>>> >>>>> usecases. If you want to change this limit, you would need to describe >>>>> why the new limit is inherently better than the current one (why not >>>>> higher, why not lower). >>>> This is in regards to the userspace daemon [1] that we are working on, >>>> that dynamically resizes the VM memory based on PSI memory pressure >>>> events. With current min window size of 500ms, the PSI monitor sampling >>>> period would be 50ms. So to detect increase in memory demand in system >>>> and plug-in memory into VM when pressure goes up, the minimum time the >>>> process needs to stall for is 50ms before a event can be generated and >>>> sent out to userspace and the daemon can do actions. >>>> >>>> This again I'm talking w.r.t. lightweight embedded systems, where even >>>> background kswapd/kcompd (which I'm calling it as natural memory >>>> pressure) in the system would be less than 5-10ms stall. So any stall >>>> more than 5-10ms would "hint" us that a memory consuming usecase has >>>> ranB and memory may need to be plugged in. >>>> >>>> So in these cases, having as low as 5ms psimon sampling time would give >>>> us faster reaction time and daemon can be responsive more quickly. In >>>> general, this will reduce the malloc latencies significantly. >>>> >>>> Pasting here the same excerpt I mentioned in [1]. >>> My question is: why do you think 5ms is the optimal limit here? I want >>> to avoid a race to the bottom where next time someone can argue that >>> they would like to detect a stall within a lower period than 5ms. >>> Technically the limit can be as small as one wants but at some point I >>> think we should consider the possibility of this being used for a DoS >>> attack. >> Well the optimal limit should be something which is least destructive? I >> do understand about possibility of DoS attacks, but wouldn't that still >> be possible with 500ms window today? Which will atleast be 1/10th less >> severe compared to 50ms window. The way I see it is - min pressure >> sampling should be such that even the least pressure stall which we >> think is significant should be captured (this could be 5ms or 50ms at >> present) while balancing the power and performance impact across all >> usecases. >> >> At present, Android's LMKD sets 1000ms as window for which it considers >> 100ms sampling to be significant. And here, with psi_daemon usecase we >> are saying 5ms sampling would be significant. So there's no actual >> optimal limit, but we must limit as much possible without effecting >> power or performance as a whole. Also, this is just the "minimum >> allowable" window, and system admins can configure it as per the system >> type/requirement. > Ok, let me ask you another way which might be more productive. What > caused you to choose 5ms as the time you care to react to a stall > buildup? We basically want to capture any stalls caused by direct reclaim. And ignore any stalls caused by indirect reclaim and alloc retries. Stalls due to direct reclaim is what indicates that memory pressure is building up in system and memory needs to be free'd (by oom-killer or LMKD killing apps) or made available (by plugin-in any available memory or requesting memory from Primary host). We see that any stalls above 5ms is significant enough that alloc request would've invoked direct reclaim, hinting that memory pressure is starting to build up. Keeping the 5ms and other numbers aside, lets see what is smallest pressure that is of significance to be captured. A PSI memory stall is wholly comprised of: compaction (kcompactd), thrashing, kswapd, direct compaction and direct reclaim. Out of these, compaction, thrashing and kswapd stalls may not necessarily give significance towards memory demand building up (i.e. system is in need of more memory). Direct compaction stall would indicate memory is fragmented. But a significant direct reclaim stall would indicate that system is in memory pressure. Usually, direct compaction and direct reclaim are the smallest in an aggregated PSI memory stall. So now the question - what is the smallest direct reclaim stall that we should capture, which would be significant to us? Now this depends on system type and configuration and the nature of work loads. For Android mobile maybe 100ms (lmkd), and for servers maybe 1s (because max window is 10s?). For Linux Embedded Systems, this would be even smaller. From our experiments, we observed that 5ms stall would be significant to capture direct reclaim stalls that would indicate pressure build up. I think the min window size should be set such that even the smallest pressure stall that we think is significant should be be captured. Rather than hard-coding min window to 500ms, let the system admin choose what's best? We anyway have bigger cap set for max window of 10s (though I hardly doubt one would think 1s is the least pressure that they care of for cpu/io/memory). Also these window thresholds have never changed since psi monitor was introduced in kernel.org, and are based on previous experiments which may not have represented all workloads. Finding the true bottom of the well would be hard. But to keep things in ms range, we can define range 1ms-500ms in Kconfig: --- a/init/Kconfig +++ b/init/Kconfig +config PSI_MIN_WINDOW_MS +B B B B B B int "Minimum PSI window (ms)" +B B B B B B range 1 500 +B B B B B B default 500 + + With PSI mechanism finding more of its uses, the same requirement might be applicable for io and cpu as well. Giving more flexibility in setting window size and sampling period would be beneficial. >> Also, about possible DoS attacks - file permissions for >> /proc/pressure/... can be set such that not any random user can register >> to psi events right? > True. We have a CAP_SYS_RESOURCE check for the writers of these files. > >>>> " >>>> >>>> 4. Detecting increase in memory demand b when a certain usecase starts >>>> in VM that does memory allocations, it will stall causing PSI mechanism >>>> to generate a memory pressure event to userspace. To simply put, when >>>> pressure increases certain set threshold, it can make educated guess >>>> that a memory requiring usecase has ran and VM system needs memory to be >>>> added. >>>> >>>> " >>>> >>>> [1] >>>> https://lore.kernel.org/linux-arm-kernel/1bf30145-22a5-cc46-e583-25053460b105@redhat.com/T/#m95ccf038c568271e759a277a08b8e44e51e8f90b >>>> >>>>> Thanks, >>>>> Suren. >>>>> >>>>>> For lightweight systems such as Linux Embedded Systems, PSI >>>>>> can be used to monitor and track memory pressure building up >>>>>> in the system and respond quickly to such memory demands. >>>>>> Example, the Linux Embedded Systems could be a secondary VM >>>>>> system which requests for memory from Primary host. With 500ms >>>>>> window size, the sampling period is 50ms (one-tenth of windwo >>>>>> size). So the minimum amount of time the process needs to stall, >>>>>> so that a PSI event can be generated and actions can be done >>>>>> is 50ms. This reaction time can be much reduced by reducing the >>>>>> sampling time (by reducing window size), so that responses to >>>>>> such memory pressures in system can be serviced much quicker. >>>>>> >>>>>> Please let us know your thoughts on reducing window size to 50ms. >>>>>> >>>>>> Sudarshan Rajagopalan (1): >>>>>> psi: reduce min window size to 50ms >>>>>> >>>>>> kernel/sched/psi.c | 2 +- >>>>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>>>> >>>>>> -- >>>>>> 2.7.4 >>>>>>