Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp1397298pxb; Thu, 24 Mar 2022 19:45:50 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw4kgFIauGrOOVBKKA9KW4KrT3mYHL5sU83d5vRyyeThld3ZhdJZYh5Xqw9M019eP2o2hwS X-Received: by 2002:a17:90b:1bc6:b0:1c7:3229:652a with SMTP id oa6-20020a17090b1bc600b001c73229652amr22394470pjb.65.1648176350413; Thu, 24 Mar 2022 19:45:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1648176350; cv=none; d=google.com; s=arc-20160816; b=F/Z8CcjfKItn5uUQiRUncBYLrCSPQ+66/3z+hZh8jmme9iKDEcwqxYowOo1NLvYgf1 b9FfW6GdREYHcfam1DvheKWMD2s4XfSiqGp6dAA8wB5ezSSOVkbtScSIlNGpZ7tN07v8 6nBPiwe8LfOlat+OkDOMI3D3I6X++BCndk9/M1Xt3rp4GIY2grb/dhytDbVsdOb0pC3y XHCoo1k3fNEtSCSM1j8B+RCX7E9HIjjcomqNBh1s9eSCI5N6k21doToQZ8IOfeAsXRMq IKVlOOfLbiIkTr88zlhT+cOvhdgRrOABxYrKjZ5Qkgf9nnfpH1L1DafMvdciz3Z6v3SW Qx2Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to :content-language:references:cc:to:subject:from:user-agent :mime-version:date:message-id:dkim-signature; bh=L3OkSQZxRRvVfIIG+Cdml9Xl0GKcyK9pZQQERSUUIuM=; b=Rq9fPd5GxQ03j9tT50x/k+YjEhvlPuKEoLEJiwKHCWiPGGD80+Nqn/TrrBTWsYqLEQ 9EL0qNaB/zWV+2jBPpsF91SZ+EKXa1vSIrBrh/XvnFsbLxQec3XkoZHat/LRQ2brX/LR E8tVnLoal7PTq53KgK63XFJnxB6JnsE0wGN1zTzIE/OK57R80swf4azvmKx1msVNR4Lf lq+6yiLBMHkwprFqIGioIfyPssdvH9RWtEt8R8sUEV+EEX80SygEe4c+Gqc2H63HBPxd dEpiQPhSGqrfn/YoAJZhI2/LOstbu+ZWjpVMLmkOTilL3rSq4h0epZlELArf8TcoMv2p 5RbQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@quicinc.com header.s=qcdkim header.b=RzL5f4uY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=quicinc.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d9-20020a056a0024c900b004fb1544bc72si499429pfv.353.2022.03.24.19.45.31; Thu, 24 Mar 2022 19:45:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@quicinc.com header.s=qcdkim header.b=RzL5f4uY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=quicinc.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351379AbiCXPrj (ORCPT + 99 others); Thu, 24 Mar 2022 11:47:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60764 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229868AbiCXPri (ORCPT ); Thu, 24 Mar 2022 11:47:38 -0400 Received: from alexa-out-sd-01.qualcomm.com (alexa-out-sd-01.qualcomm.com [199.106.114.38]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8F628A1447 for ; Thu, 24 Mar 2022 08:46:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; i=@quicinc.com; q=dns/txt; s=qcdkim; t=1648136767; x=1679672767; h=message-id:date:mime-version:from:subject:to:cc: references:in-reply-to:content-transfer-encoding; bh=L3OkSQZxRRvVfIIG+Cdml9Xl0GKcyK9pZQQERSUUIuM=; b=RzL5f4uYx1JOodVAPrveHRGj6CZfsOsRloYl9vffTd54n5p5q5bedwBh VYnTdlgw5w8YLmQfoaNojelJdhOvINxkTkM0XSC9I/TXDehB+oAwuz1dj B8HEtSyNdcUuhJNnCPYDCmYMhNIP+YxHBW3X99d3zA6eoQHflAqQDDgYS 8=; Received: from unknown (HELO ironmsg03-sd.qualcomm.com) ([10.53.140.143]) by alexa-out-sd-01.qualcomm.com with ESMTP; 24 Mar 2022 08:46:06 -0700 X-QCInternal: smtphost Received: from nasanex01c.na.qualcomm.com ([10.47.97.222]) by ironmsg03-sd.qualcomm.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Mar 2022 08:46:05 -0700 Received: from nalasex01a.na.qualcomm.com (10.47.209.196) by nasanex01c.na.qualcomm.com (10.47.97.222) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.22; Thu, 24 Mar 2022 08:46:05 -0700 Received: from [10.216.20.42] (10.80.80.8) by nalasex01a.na.qualcomm.com (10.47.209.196) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.22; Thu, 24 Mar 2022 08:46:01 -0700 Message-ID: <602dcc82-519b-bafd-19e6-b373abe572d4@quicinc.com> Date: Thu, 24 Mar 2022 21:15:57 +0530 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 From: Charan Teja Kalla Subject: Re: [PATCH 2/2] mm: madvise: return exact bytes advised with process_madvise under error To: Michal Hocko CC: , , , , , , , , , Johannes Weiner References: <0fa1bdb5009e898189f339610b90ecca16f243f4.1648046642.git.quic_charante@quicinc.com> Content-Language: en-US In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.80.80.8] X-ClientProxiedBy: nasanex01b.na.qualcomm.com (10.46.141.250) To nalasex01a.na.qualcomm.com (10.47.209.196) X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Thanks Michal for the inputs. On 3/24/2022 6:44 PM, Michal Hocko wrote: > On Wed 23-03-22 20:54:10, Charan Teja Kalla wrote: >> From: Charan Teja Reddy >> >> The commit 5bd009c7c9a9 ("mm: madvise: return correct bytes advised with >> process_madvise") fixes the issue to return number of bytes that are >> successfully advised before hitting error with iovec elements >> processing. But, when the user passed unmapped ranges in iovec, the >> syscall ignores these holes and continues processing and returns ENOMEM >> in the end, which is same as madvise semantic. This is a problem for >> vector processing where user may want to know how many bytes were >> exactly processed in a iovec element to make better decissions in the >> user space. As in ENOMEM case, we processed all bytes in a iovec element >> but still returned error which will confuse the user whether it is >> failed or succeeded to advise. > Do you have any specific example where the initial semantic is really > problematic or is this mostly a theoretical problem you have found when > reading the code? > > >> As an example, consider below ranges were passed by the user in struct >> iovec: iovec1(ranges: vma1), iovec2(ranges: vma2 -- vma3 -- hole) and >> iovec3(ranges: vma4). In the current implementation, it fully advise >> iovec1 and iovec2 but just returns number of processed bytes as iovec1 >> range. Then user may repeat the processing of iovec2, which is already >> processed, which then returns with ENOMEM. Then user may want to skip >> iovec2 and starts processing from iovec3. Here because of wrong return >> processed bytes, iovec2 is processed twice. > I think you should be much more specific why this is actually a problem. > This would surely be less optimal but is this a correctness issue? > Yes, this is a problem found when reading the code, but IMO we can easily expect an invalid vma/hole in the passed range because we are operating on other process VMA. More than solving the problem of being less optimal, this can be looked in the direction of helping the user to take better policy decisions with this syscall. And, not better policy decisions from user is just being sub optimal(i.e. issuing the syscall again on the processed range) with this syscall. Having said that, at present I don't have any reports/unit test showing the existing semantic is really a problematic. > [...] >> + vma = find_vma_prev(mm, start, &prev); >> + if (vma && start > vma->vm_start) >> + prev = vma; >> + >> + blk_start_plug(&plug); >> + for (;;) { >> + /* >> + * It it hits a unmapped address range in the [start, end), >> + * stop processing and return ENOMEM. >> + */ >> + if (!vma || start < vma->vm_start) { >> + error = -ENOMEM; >> + goto out; >> + } >> + >> + tmp = vma->vm_end; >> + if (end < tmp) >> + tmp = end; >> + >> + error = madvise_vma_behavior(vma, &prev, start, tmp, behavior); >> + if (error) >> + goto out; >> + tmp_bytes_advised += tmp - start; >> + start = tmp; >> + if (prev && start < prev->vm_end) >> + start = prev->vm_end; >> + if (start >= end) >> + goto out; >> + if (prev) >> + vma = prev->vm_next; >> + else >> + vma = find_vma(mm, start); >> + } >> +out: >> + /* >> + * partial_bytes_advised may contain non-zero bytes indicating >> + * the number of bytes advised before failure. Holds zero incase >> + * of success. >> + */ >> + *partial_bytes_advised = error ? tmp_bytes_advised : 0; > Although this looks like a fix I am not sure it is future proof. > madvise_vma_behavior doesn't report which part of the range has been > really processed. I do not think that currently supported madvise modes > for process_madvise support an early break out with return to the > userspace (madvise_cold_or_pageout_pte_range bails on fatal signals for > example) but this can change in the future and then you are back to > "imprecise" return value problem. Yes, this is a theoretical problem Agree here with the "imprecise" return value problem with processing a VMA range. Yes when it is decided to return proper processed value from madvise_vma_behavior(), this code too may need the maintenance. > but so it sounds the problem you are trying to fix IMHO. I think it > would be better to live with imprecise return values reporting rather > than aiming for perfection which would be fragile and add a future > maintenance burden. > Hmm. Should atleast this imprecise return values be documented in man page or in madvise.c file? > On the other hand if there are _real_ workloads which suffer from the > existing semantic then sure the above seems to be an appropriate fix > AFAICS. > -- Michal Hocko SUSE Labs