Received: by 2002:ac0:946b:0:0:0:0:0 with SMTP id j40csp1150759imj; Sat, 16 Feb 2019 22:31:31 -0800 (PST) X-Google-Smtp-Source: AHgI3IZs3TNShYmv17Zkg7vo0ys5g25Qb1fVqcr8FNxanTwkHRgjnf0DWWKj8uDYC0KlNhYQEGD2 X-Received: by 2002:a63:f412:: with SMTP id g18mr13126647pgi.262.1550385091218; Sat, 16 Feb 2019 22:31:31 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1550385091; cv=none; d=google.com; s=arc-20160816; b=rlC5D70DTGa3OP91aAx7LAGNob8tUmrwv/id5TnqKFqHP8S25N4hRNDYDWUEvmc98/ 8lPxcsmyn5XoF95ZS+pxtaq7TG7hq06UlHasyHi7ueUwpsrDD/8HA8ohp/twH6AbNsEC EMx9EYj+mqDVn7QxlVoVZYaqr7wNQPP5WQVVy7SFqGIk781WQ1EvJqRoK0x92nUmkXdZ qr0mpLBZfRsiX0VFTvhJgT1R5MYipSXMyhUikkyNVfRGeMaKfth9eKYAbNNdumhtGgAD zkAMNTezL5ttuc/z55GxToANS4mjjk0pQNA94wXr6N4epODUq+A4Zr398b/m4S6xvfVz jOOg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=p7Ow0Vj5h0qrD5qlnWL53MnZn0CwkH1eRuY2h6HkdnQ=; b=qXhrz67UprFXPe8AJ/L9+6YULnv6e/WD9hv4tV14C1Y4oBSX2CPP7ck3a6H0qu+q2A 9teaQ99yyUBwVY62SYaSmwRy53QlRxEcmOgZ+4aDVZErvZKCtgZ4D05j0fCuaa723YO4 AG0n+2EzPWlCtJnqxy8Qwq8vJSSzIBXoLs05Izi3h99afiKr5I4uMKXl+bdWZ66s04J0 86dqsxK/4rYsECwSkPXyLMsnHEFVcy0j7h/eIIyVZ6OiH1f3l031F2gGZjIX5YMek8O5 aTscSTsYwGLqhFIPLjDEuZGWavAfyrAPbomSxL0rcZ0AhYjRrQZN/gw75VtQkm3Tmsc5 ghTQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e11si9588666pfn.124.2019.02.16.22.31.14; Sat, 16 Feb 2019 22:31:31 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731960AbfBPRF0 (ORCPT + 99 others); Sat, 16 Feb 2019 12:05:26 -0500 Received: from mga14.intel.com ([192.55.52.115]:10208 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731907AbfBPRFZ (ORCPT ); Sat, 16 Feb 2019 12:05:25 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 16 Feb 2019 09:05:25 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.58,377,1544515200"; d="scan'208";a="319608737" Received: from cli6-desk1.ccr.corp.intel.com (HELO [10.239.161.118]) ([10.239.161.118]) by fmsmga006.fm.intel.com with ESMTP; 16 Feb 2019 09:05:23 -0800 Subject: Re: [PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time To: Thomas Gleixner Cc: mingo@redhat.com, peterz@infradead.org, hpa@zytor.com, ak@linux.intel.com, tim.c.chen@linux.intel.com, dave.hansen@intel.com, arjan@linux.intel.com, aubrey.li@intel.com, linux-kernel@vger.kernel.org References: <20190213023748.6614-1-aubrey.li@linux.intel.com> <20190213023748.6614-2-aubrey.li@linux.intel.com> <85b22e16-fffc-7ebe-2ab1-3b6fe7e036ab@linux.intel.com> From: "Li, Aubrey" Message-ID: Date: Sun, 17 Feb 2019 01:05:20 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.1.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2019/2/16 20:55, Thomas Gleixner wrote: > On Fri, 15 Feb 2019, Li, Aubrey wrote: >> On 2019/2/14 19:29, Thomas Gleixner wrote: >> Under this scenario, the elapsed time becomes longer than normal indeed, see below: >> >> $ while [ 1 ]; do cat /proc/6985/status | grep AVX; sleep 1; done >> AVX512_elapsed_ms: 3432 >> AVX512_elapsed_ms: 440 >> AVX512_elapsed_ms: 1444 >> AVX512_elapsed_ms: 2448 >> AVX512_elapsed_ms: 3456 >> AVX512_elapsed_ms: 460 >> AVX512_elapsed_ms: 1464 >> AVX512_elapsed_ms: 2468 >> >> But AFAIK, google's Heracles do a 15s polling, so this worst case is still acceptable.? > > I have no idea what Google's thingy does and you surely have to ask those > people who want to use this whether they are OK with that. I personally > think the numbers are largely useless, but I don't know the use case. > >>> IOW, this needs crystal ball magic to decode because >>> there is no correlation between that elapsed time and the time when the >>> last context switch happened simply because that time is not available in >>> /proc/$PID/status. Sure you can oracle it out from /proc/$PID/stat with >>> even more crystal ball magic, but there is no explanation at all. >>> >>> There may be use case scenarios where this crystal ball prediction is >>> actually useful, but the inaccuracy of that information and the possible >>> pitfalls for any user space application which uses it need to be documented >>> in detail. Without that, this is going to cause more trouble and confusion >>> than benefit. >>> >> Not sure if the above experiment addressed your concern, please correct me if >> I totally misunderstood. > > The above experiment just confirms what I said: The numbers are inaccurate > and potentially misleading to a large extent when the AVX using task is not > scheduled out for a longer time. > > So what I'm asking for is proper documentation which explains how this > 'hint' is generated in the kernel and why it can be completely inaccurate > and misleading. We don't want to end up in a situation where people start > to rely on this information and then have to go and read kernel code to > understand why the numbers do not make sense. > > I'm not convinced that this interface in the current form is actually > useful. Even if you ignore the single task example, then on a loaded > machine where tasks are scheduled in and out by time slices, then the > calculation is: > > delta = (long)(jiffies - timestamp); > > delta is what you expose as elapsed_ms. Now assume that the task is seen as > using AVX when being scheduled out. So depending on the time it is > scheduled out, whether it's due lots of other tasks occupying the CPU or > due to a blocking syscall, the result can be completely misleading. The job > scheduler will see for example: 80ms ago was last AVX usage recorded and > decide that this is just an occasional usage and migrate it away. Now the > task gets on the other CPU and starts using AVX again, which makes the > scheduler see a short delta and decide to move it back. All in all AVX usage elapsed time threshold is the matter. I guess you'll be more comfortable if the job scheduler decides to migrate task depends on the condition if AVX512_elapsed_time > 10000 (10s), ;) That is, every time the interface is queried, if AVX512 usage is recorded within 10s, the task is classified as an AVX512 task and should be grouped. This is reasonable because usually the AVX512 using task is the long running task, for example, I have an intern who trained a deep reinforcement learning model for 21 days... > > So interpreting the value is voodoo which becomes even harder when there is > no documentation. Sure, let me try to put more details in Documentation/filesystems/proc.txt Thanks, -Aubrey