Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp1067873imm; Fri, 27 Jul 2018 10:34:44 -0700 (PDT) X-Google-Smtp-Source: AAOMgpfgv2ZQhsi2IHE4CGNW8uPqiG1wkh3EdUPE3euj6YnGmgIlLvqgUUgg17YzqH5RLmBgbzGT X-Received: by 2002:a17:902:988a:: with SMTP id s10-v6mr6047156plp.200.1532712884694; Fri, 27 Jul 2018 10:34:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1532712884; cv=none; d=google.com; s=arc-20160816; b=Q8mXBWzAiUunxGcyzRtaoTx6j65ovqkjlMwTTM45BBTH3Wi1lm5fiomdDLv/a2z7AS WIzpHnTUSNRFwlJ/HuxtSqe35U1rg/xsqgC7zr2QEm4WSMAz6mKjYu0PGSR+L++4G2/E qWw9koWaGOCR+pJJUJfW6DMGT0mOVMZudxEDnie20lqJBIf+yB51zkxY5MW+6nFJtY/C 3oLqWrsKqnO5VMr5JYp2efq3tq2DsVB/0/RRGkIuxoKEQsRIA9yHImlJGGBr0EYbU7fD nhhoOFyGcVail7Bxg8hjh6oeZI+MHET98+b3kESFzwzUWhXzMSt0XO+JkM8DTbDfKbKx dUXQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:user-agent:in-reply-to :content-disposition:mime-version:references:subject:cc:to:from:date :arc-authentication-results; bh=S6nyCMvkYvuOhjgkb+jDwf7kOh7XXoo/85Wdr86fSbY=; b=aMlZKluyUSsNR5L81mSE5vVp/T/abTLCn6BP/eaZn7THUud0NivqrcllN5JxE8Qopx FwIu/RK3r88KMoa8Wnb4WOig2sqaWKdj271bExwk0BiP9R9hTqPvKLSJgVRQzTmPWfrX FeJxDsOYqOIIPQ8WWjx6aIiROoubc2Ym39oyYZla95o7oQ/eRzu6ppGgGdNOBquDgOxl TALvdPO8dgMXWCxcXB9kyWQuh8nAxR4XjoX/GC3wUfGYF5R5b2VRuJXLC1av34nExm2B d5t3FhNVRrx0dIcB3uN6Ute5jtsN90gwSAm7/HvrFTQxZgtz7JeUTadWFfsTW78zSDSz UCOA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l63-v6si4324821pfg.326.2018.07.27.10.34.30; Fri, 27 Jul 2018 10:34:44 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388998AbeG0Sz7 (ORCPT + 99 others); Fri, 27 Jul 2018 14:55:59 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:45626 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728713AbeG0Sz6 (ORCPT ); Fri, 27 Jul 2018 14:55:58 -0400 Received: from pps.filterd (m0098413.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w6RHNwLC025296 for ; Fri, 27 Jul 2018 13:33:04 -0400 Received: from e33.co.us.ibm.com (e33.co.us.ibm.com [32.97.110.151]) by mx0b-001b2d01.pphosted.com with ESMTP id 2kg77fswdk-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Fri, 27 Jul 2018 13:33:04 -0400 Received: from localhost by e33.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 27 Jul 2018 11:33:03 -0600 Received: from b03cxnp07029.gho.boulder.ibm.com (9.17.130.16) by e33.co.us.ibm.com (192.168.1.133) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Fri, 27 Jul 2018 11:33:00 -0600 Received: from b03ledav005.gho.boulder.ibm.com (b03ledav005.gho.boulder.ibm.com [9.17.130.236]) by b03cxnp07029.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w6RHWxL459310294 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Fri, 27 Jul 2018 10:33:00 -0700 Received: from b03ledav005.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D5093BE06A; Fri, 27 Jul 2018 11:32:59 -0600 (MDT) Received: from b03ledav005.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C491EBE05F; Fri, 27 Jul 2018 11:32:59 -0600 (MDT) Received: from localhost (unknown [9.41.92.153]) by b03ledav005.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 27 Jul 2018 11:32:59 -0600 (MDT) Date: Fri, 27 Jul 2018 12:32:59 -0500 From: John Allen To: Michal Hocko Cc: linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kamezawa.hiroyu@jp.fujitsu.com, n-horiguchi@ah.jp.nec.com, mgorman@suse.de, nfont@linux.vnet.ibm.com Subject: Re: Infinite looping observed in __offline_pages References: <20180725181115.hmlyd3tmnu3mn3sf@p50.austin.ibm.com> <20180725200336.GP28386@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline In-Reply-To: <20180725200336.GP28386@dhcp22.suse.cz> User-Agent: NeoMutt/20180622-63-e52393 X-TM-AS-GCONF: 00 x-cbid: 18072717-0036-0000-0000-00000A15C94F X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00009439; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000266; SDB=6.01066959; UDB=6.00548217; IPR=6.00844823; MB=3.00022356; MTD=3.00000008; XFM=3.00000015; UTC=2018-07-27 17:33:02 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18072717-0037-0000-0000-0000484098E5 Message-Id: <20180727173259.htdxpn4i2fxprpaj@p50.austin.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2018-07-27_07:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=745 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1806210000 definitions=main-1807270176 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 25, 2018 at 10:03:36PM +0200, Michal Hocko wrote: >On Wed 25-07-18 13:11:15, John Allen wrote: >[...] >> Does a failure in do_migrate_range indicate that the range is unmigratable >> and the loop in __offline_pages should terminate and goto failed_removal? Or >> should we allow a certain number of retrys before we >> give up on migrating the range? > >Unfortunatelly not. Migration code doesn't tell a difference between >ephemeral and permanent failures. We are relying on >start_isolate_page_range to tell us this. So the question is, what kind >of page is not migratable and for what reason. > >Are you able to add some debugging to give us more information. The >current debugging code in the hotplug/migration sucks... After reproducing the problem a couple times, it seems that it can occur for different types of pages. Running page-types on the offending page over two separate instances produced the following: # tools/vm/page-types -a 307968-308224 flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000400 1 0 __________B________________________________ buddy total 1 0 And the following on a separate run: # tools/vm/page-types -a 313088-313344 flags page-count MB symbolic-flags long-symbolic-flags 0x000000000000006c 1 0 __RU_lA____________________________________ referenced,uptodate,lru,active total 1 0 The source of the failure in migrate_pages actually doesn't seem to be that we're hitting the case of the permanent failure, but instead the -EAGAIN case. I traced the EAGAIN return back to migrate_page_move_mapping which I've seen return EAGAIN in two places: mm/migrate.c:453 if (!mapping) { /* Anonymous page without mapping */ if (page_count(page) != expected_count) return -EAGAIN; mm/migrate.c:476 if (page_count(page) != expected_count || radix_tree_deref_slot_protected(pslot, &mapping->i_pages.xa_lock) != page) { xa_unlock_irq(&mapping->i_pages); return -EAGAIN; } So it seems in each case, the actual reference count for the page is not what it is expected to be.