Received: by 10.223.164.202 with SMTP id h10csp4219694wrb; Mon, 20 Nov 2017 11:51:02 -0800 (PST) X-Google-Smtp-Source: AGs4zMbm+8CArU25VxKmIpdqFtI8XXUS4TpfBnGTC9Ny//lJ/7BfycorCqbhU/HG4KJ9pxtc1tbT X-Received: by 10.159.252.11 with SMTP id n11mr15144901pls.196.1511207462028; Mon, 20 Nov 2017 11:51:02 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1511207461; cv=none; d=google.com; s=arc-20160816; b=KwjDEsEW8ajsA/NSbAq63vV+XIB4+Blm8NWUl4Ea/V0ueem4dDRGQX//G6TJA0xIjA aFbx9Vtrbo5SkvRbyb6V4A7BR0Xl+E9bxcJkbANfpodRjEn1PG/CHYw1qgRolKrb3dge RG2tB+SDK6nt7Jhio322W5uoMKfpEgMmt6f6xwlHZvcpt+0RFWZzxugKNimupcsjdvFy sVSOcXYmjugD5ginlDISGAliWuXTG8m4uKn1xp0Ti4IYYDy/XOxntJlrNMecxPe5Xhff XJZxoeaUdH/8ppFUbWyh4M0SUyn77m9JcQJs5wwGTKQxh4UjKCLlWiSyrCWSriS+jt4l uGng== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:references:cc:to:from:subject:dkim-signature :arc-authentication-results; bh=VygkoL5w6Wl8Enb3VBvXcv6uwDEHqUXDaIln8H82urk=; b=TbrF/yhoBh7ZDTpcsGdSeTLaDUdCIw60lg1rSsLkHlQ3bUv+razJIO52LAkXq4oAwO TVT2VVh1QqGvhFVI3Wz/Y/ryNXMZBn4fJwOsj+tEAvSIHDBYx/3+kk3SLaqL10O1VDaE ZYlDjM/f4eZ1TJc5ooEoehFJj58wcYkeLERp43DzCyO/pcBkWBrnkwJtdUcTuEHtFLaa L+7mh3IsJLjdZARmq2TYV/TlmHtbEPApNjoGd4vwFwaoJQEV5kFlFBk6bIvtm0yRzjLC gWHWreGHEkkJSW7ozBtrm/kb/dVOK70joy/H0LtGYA3HSBL7XysIrRpkGQSQzACoO4lA XHfQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cisco.com header.s=iport header.b=fmuJc3QO; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cisco.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q2si8975103plh.626.2017.11.20.11.50.52; Mon, 20 Nov 2017 11:51:01 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@cisco.com header.s=iport header.b=fmuJc3QO; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cisco.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752562AbdKTTuT (ORCPT + 67 others); Mon, 20 Nov 2017 14:50:19 -0500 Received: from alln-iport-7.cisco.com ([173.37.142.94]:21590 "EHLO alln-iport-7.cisco.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752365AbdKTTuR (ORCPT ); Mon, 20 Nov 2017 14:50:17 -0500 X-Greylist: delayed 555 seconds by postgrey-1.27 at vger.kernel.org; Mon, 20 Nov 2017 14:50:16 EST DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=8883; q=dns/txt; s=iport; t=1511207417; x=1512417017; h=subject:from:to:cc:references:message-id:date: mime-version:in-reply-to:content-transfer-encoding; bh=4OPbKzKtj2qSnoQrTJpb5rvAZaow9wFzdtaWcV05fpE=; b=fmuJc3QO6KHAo9sEhqtjNMLHi+/c5ggqKFZ57Go4fndtFPnrM3o2nU6w Y2qbfZYFvu0h9DexaNCntxT9mcf+cjVsVHoN1VizrnZthuw+Yj17wUEqy 1IOI9Kf0KJURo1LbiTgbJur7oNL3JGeAzhSSIQNvosRIKUHiOBdrHl3tI E=; X-IronPort-AV: E=Sophos;i="5.44,429,1505779200"; d="scan'208";a="33530517" Received: from alln-core-10.cisco.com ([173.36.13.132]) by alln-iport-7.cisco.com with ESMTP/TLS/DHE-RSA-AES256-SHA; 20 Nov 2017 19:41:00 +0000 Received: from [10.21.68.236] (sjc-vpn3-1260.cisco.com [10.21.68.236]) by alln-core-10.cisco.com (8.14.5/8.14.5) with ESMTP id vAKJeuQH022986 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Mon, 20 Nov 2017 19:40:59 GMT Subject: Re: Detecting page cache trashing state From: "Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)" To: Johannes Weiner Cc: Taras Kondratiuk , Michal Hocko , linux-mm@kvack.org, xe-linux-external@cisco.com, linux-kernel@vger.kernel.org References: <150543458765.3781.10192373650821598320@takondra-t460s> <20170915143619.2ifgex2jxck2xt5u@dhcp22.suse.cz> <150549651001.4512.15084374619358055097@takondra-t460s> <20170918163434.GA11236@cmpxchg.org> <20171025175424.GA14039@cmpxchg.org> Message-ID: Date: Mon, 20 Nov 2017 21:40:56 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-Auto-Response-Suppress: DR, OOF, AutoReply Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Johannes, I tested with your patches but situation is still mostly the same. Spend some time for debugging and found that the problem is squashfs specific (probably some others fs's too). The point is that iowait for squashfs reads will be awaited inside squashfs readpage() callback. Here is some backtrace for page fault handling to illustrate this: �1)�������������� |� handle_mm_fault() { �1)�������������� |��� filemap_fault() { �1)�������������� |����� __do_page_cache_readahead() �1)�������������� |������� add_to_page_cache_lru() �1)�������������� |������� squashfs_readpage() { �1)�������������� |��������� squashfs_readpage_block() { �1)�������������� |����������� squashfs_get_datablock() { �1)�������������� |������������� squashfs_cache_get() { �1)�������������� |��������������� squashfs_read_data() { �1)�������������� |����������������� ll_rw_block() { �1)�������������� |������������������� submit_bh_wbc.isra.42() �1)�������������� |����������������� __wait_on_buffer() { �1)�������������� |������������������� io_schedule() { �------------------------------------------ �0)�� kworker-79�� =>��� -0 �------------------------------------------ �0)�� 0.382 us��� |� blk_complete_request(); �0)�������������� |� blk_done_softirq() { �0)�������������� |��� blk_update_request() { �0)�������������� |����� end_buffer_read_sync() �0) + 38.559 us�� |��� } �0) + 48.367 us�� |� } �------------------------------------------ �0)�� kworker-79�� =>� memhog-781 �------------------------------------------ �0) ! 278.848 us� |������������������� } �0) ! 279.612 us� |����������������� } �0)�������������� |����������������� squashfs_decompress() { �0) # 4919.082 us |������������������� squashfs_xz_uncompress(); �0) # 4919.864 us |����������������� } �0) # 5479.212 us |��������������� } /* squashfs_read_data */ �0) # 5479.749 us |������������� } /* squashfs_cache_get */ �0) # 5480.177 us |����������� } /* squashfs_get_datablock */ �0)�������������� |����������� squashfs_copy_cache() { �0)�� 0.057 us��� |������������� unlock_page(); �0) ! 142.773 us� |����������� } �0) # 5624.113 us |��������� } /* squashfs_readpage_block */ �0) # 5628.814 us |������� } /* squashfs_readpage */ �0) # 5665.097 us |����� } /* __do_page_cache_readahead */ �0) # 5667.437 us |��� } /* filemap_fault */ �0) # 5672.880 us |� } /* handle_mm_fault */ As you can see squashfs_read_data() schedules IO by ll_rw_block() and then it waits for IO to finish inside wait_on_buffer(). After that read buffer is decompressed and page is unlocked inside squashfs_readpage() handler. Thus by the the time when filemap_fault() calls lock_page_or_retry() page will be uptodate and unlocked, wait_on_page_bit() is not called at all, and time spent for read/decompress is not accounted. Tried to apply quick workaround for test: diff --git a/mm/readahead.c b/mm/readahead.c index c4ca702..5e2be2b 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -126,9 +126,21 @@ static int read_pages(struct address_space *mapping, struct file *filp, ���� for (page_idx = 0; page_idx < nr_pages; page_idx++) { ���� ��� struct page *page = lru_to_page(pages); +��� ��� bool refault = false; +��� ��� unsigned long mdflags; + ���� ��� list_del(&page->lru); -��� ��� if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) +��� ��� if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) { +��� ��� ��� if (!PageUptodate(page) && PageWorkingset(page)) { +��� ��� ��� ��� memdelay_enter(&mdflags); +��� ��� ��� ��� refault = true; +��� ��� ��� } + ���� ��� ��� mapping->a_ops->readpage(filp, page); + +��� ��� ��� if (refault) +��� ��� ��� ��� memdelay_leave(&mdflags); +��� ��� } ���� ��� put_page(page); But found that situation is not much different. The reason is that at least in my synthetic tests I'm exhausting whole memory leaving almost no place for page cache: Active(anon):�� 15901788 kB Inactive(anon):��� 44844 kB Active(file):������� 488 kB Inactive(file):����� 612 kB As result refault distance is always higher that LRU_ACTIVE_FILE size and Workingset flag is not set for refaulting page even if it were active during it's lifecycle before eviction: ������� workingset_refault�� 7773 ������ workingset_activate�� 250 ������� workingset_restore�� 233 ��� workingset_nodereclaim�� 49 Tried to apply following workaround: diff --git a/mm/workingset.c b/mm/workingset.c index 264f049..8035ef6 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -305,6 +305,11 @@ void workingset_refault(struct page *page, void *shadow) ���� inc_lruvec_state(lruvec, WORKINGSET_REFAULT); +��� /* Page was active prior to eviction */ +��� if (workingset) { +��� ��� SetPageWorkingset(page); +��� ��� inc_lruvec_state(lruvec, WORKINGSET_RESTORE); +��� } ���� /* ���� �* Compare the distance to the existing workingset size. We ���� �* don't act on pages that couldn't stay resident even if all @@ -314,13 +319,9 @@ void workingset_refault(struct page *page, void *shadow) ���� ��� goto out; ���� SetPageActive(page); -��� SetPageWorkingset(page); ���� atomic_long_inc(&lruvec->inactive_age); ���� inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); -��� /* Page was active prior to eviction */ -��� if (workingset) -��� ��� inc_lruvec_state(lruvec, WORKINGSET_RESTORE); �out: ���� rcu_read_unlock(); �} Now I see that refaults for pages a indeed accounted: ������� workingset_refault�� 4987 ������ workingset_activate�� 590 ������� workingset_restore�� 4358 ��� workingset_nodereclaim�� 944 And memdelay counters are actively incrementing too indicating the trashing state: [:~]$ cat /proc/memdelay 7539897381 63.22 63.19 44.58 14.36 15.11 11.80 So do you know what is the proper way to fix both issues? -- Thanks, Ruslan On 10/27/2017 11:19 PM, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote: > Hi Johannes, > > On 10/25/2017 08:54 PM, Johannes Weiner wrote: >> Hi Ruslan, >> >> sorry about the delayed response, I missed the new activity in this >> older thread. >> >> On Thu, Sep 28, 2017 at 06:49:07PM +0300, Ruslan Ruslichenko -X >> (rruslich - GLOBALLOGIC INC at Cisco) wrote: >>> Hi Johannes, >>> >>> Hopefully I was able to rebase the patch on top v4.9.26 (latest >>> supported >>> version by us right now) >>> and test a bit. >>> The overall idea definitely looks promising, although I have one >>> question on >>> usage. >>> Will it be able to account the time which processes spend on >>> handling major >>> page faults >>> (including fs and iowait time) of refaulting page? >> That's the main thing it should measure! :) >> >> The lock_page() and wait_on_page_locked() calls are where iowaits >> happen on a cache miss. If those are refaults, they'll be counted. >> >>> As we have one big application which code space occupies big amount >>> of place >>> in page cache, >>> when the system under heavy memory usage will reclaim some of it, the >>> application will >>> start constantly thrashing. Since it code is placed on squashfs it >>> spends >>> whole CPU time >>> decompressing the pages and seem memdelay counters are not detecting >>> this >>> situation. >>> Here are some counters to indicate this: >>> >>> 19:02:44������� CPU���� %user���� %nice�� %system�� %iowait >>> %steal���� %idle >>> 19:02:45������� all����� 0.00����� 0.00��� 100.00����� 0.00 >>> 0.00����� 0.00 >>> >>> 19:02:44���� pgpgin/s pgpgout/s�� fault/s� majflt/s� pgfree/s pgscank/s >>> pgscand/s pgsteal/s��� %vmeff >>> 19:02:45���� 15284.00����� 0.00��� 428.00��� 352.00� 19990.00 >>> 0.00����� 0.00 >>> 15802.00����� 0.00 >>> >>> And as nobody actively allocating memory anymore looks like memdelay >>> counters are not >>> actively incremented: >>> >>> [:~]$ cat /proc/memdelay >>> 268035776 >>> 6.13 5.43 3.58 >>> 1.90 1.89 1.26 >> How does it correlate with /proc/vmstat::workingset_activate during >> that time? It only counts thrashing time of refaults it can actively >> detect. > The workingset counters are growing quite actively too. Here are > some numbers per second: > > workingset_refault�� 8201 > workingset_activate�� 389 > workingset_restore�� 187 > workingset_nodereclaim�� 313 > >> Btw, how many CPUs does this system have? There is a bug in this >> version on how idle time is aggregated across multiple CPUs. The error >> compounds with the number of CPUs in the system. > The system has 2 CPU cores. >> I'm attaching 3 bugfixes that go on top of what you have. There might >> be some conflicts, but they should be minor variable naming issues. >> > I will test with your patches and get back to you. > > Thanks, > Ruslan From 1582444201861045524@xxx Fri Oct 27 20:33:12 +0000 2017 X-GM-THRID: 1578563211273176438 X-Gmail-Labels: Inbox,Category Forums