Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp3726685imm; Mon, 30 Jul 2018 02:17:10 -0700 (PDT) X-Google-Smtp-Source: AAOMgpfO2O4GkG+j+NoaJW6+6X5u+mNfgW3jH9AkvsvQaIlCuhFaPV0XUR1Tbw8v6YG6mE8PJ7le X-Received: by 2002:a62:ea05:: with SMTP id t5-v6mr17005149pfh.228.1532942230302; Mon, 30 Jul 2018 02:17:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1532942230; cv=none; d=google.com; s=arc-20160816; b=vJU9lgOI6skRAmeI706O7QCfoeciaXa3vfgs+wZOAK3QueRPG2qVREK/hJ6IQQ2EnY e79qJTmyv2Xkqknkj+JCJM0z4WI9atnY3pjuMNvQavPdMRzNuPu1NWnQgyNwHoSOHl6y 8j29PQmXmod41ugl0FYZvJDFlVP+29oeFVGXTU1diUKmO7PTEAmgkn1fOcpNh9gOfcZQ HDCDCIYT2K3zJVV56kvKzObGM21TaQSAIPVWnnqJJS1Oq1koVoCppdNdLz6+AOK9hx7B 9S81m0n38HGG/qutUS70CZ3GKjZ6XWyg3FaWtZssPC7Bj4qwxkJgszvfTyJpPICwgRrl OWpg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=u0epYbXHBuuWT5VFsVwmjLe9cIA8L6mWOujAPs0olTw=; b=bYPltH9pHxvLCpbjnsAs3pxWLY4QSyjPcZOBDJgPu/NyukV0hGyshFKvEMgZx+8aPB TimM1ZU9/JdGwZ/2AgDhzzcPzDoCCvCB/KWOA7ABGGYAmjr9X7Gqq7TEaP5fd5kctG6b R7ACsvq/dQp4CtJF4OWQ73G0iMUTtHDh1gaIVTfmAodbWtsBt+nGc3UYgoZf1fZyzVvs ey5Eokg+7DAKTnNMnePvl3iRRwvg4Qawwry1cwogYv5Gy8BAWG68h3cT4XQrMdwB1e2q k2NHHIOOOhDQR1c+oUJuDieHAuBm+X8AOtKIFJXbdRSPRyf44FJQiX4kk7A/X7jIhRM6 o6nQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a16-v6si10822230pga.168.2018.07.30.02.16.56; Mon, 30 Jul 2018 02:17:10 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726737AbeG3KuK (ORCPT + 99 others); Mon, 30 Jul 2018 06:50:10 -0400 Received: from mx2.suse.de ([195.135.220.15]:38070 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726686AbeG3KuK (ORCPT ); Mon, 30 Jul 2018 06:50:10 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 767C4AE08; Mon, 30 Jul 2018 09:16:06 +0000 (UTC) Date: Mon, 30 Jul 2018 11:16:05 +0200 From: Michal Hocko To: John Allen Cc: linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kamezawa.hiroyu@jp.fujitsu.com, n-horiguchi@ah.jp.nec.com, mgorman@suse.de, nfont@linux.vnet.ibm.com Subject: Re: Infinite looping observed in __offline_pages Message-ID: <20180730091605.GF24267@dhcp22.suse.cz> References: <20180725181115.hmlyd3tmnu3mn3sf@p50.austin.ibm.com> <20180725200336.GP28386@dhcp22.suse.cz> <20180727173259.htdxpn4i2fxprpaj@p50.austin.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180727173259.htdxpn4i2fxprpaj@p50.austin.ibm.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 27-07-18 12:32:59, John Allen wrote: > On Wed, Jul 25, 2018 at 10:03:36PM +0200, Michal Hocko wrote: > > On Wed 25-07-18 13:11:15, John Allen wrote: > > [...] > > > Does a failure in do_migrate_range indicate that the range is unmigratable > > > and the loop in __offline_pages should terminate and goto failed_removal? Or > > > should we allow a certain number of retrys before we > > > give up on migrating the range? > > > > Unfortunatelly not. Migration code doesn't tell a difference between > > ephemeral and permanent failures. We are relying on > > start_isolate_page_range to tell us this. So the question is, what kind > > of page is not migratable and for what reason. > > > > Are you able to add some debugging to give us more information. The > > current debugging code in the hotplug/migration sucks... > > After reproducing the problem a couple times, it seems that it can occur for > different types of pages. Running page-types on the offending page over two > separate instances produced the following: > > # tools/vm/page-types -a 307968-308224 > flags page-count MB symbolic-flags long-symbolic-flags > 0x0000000000000400 1 0 __________B________________________________ buddy > total 1 0 Huh! How come a buddy page has non zero reference count. > > And the following on a separate run: > > # tools/vm/page-types -a 313088-313344 > flags page-count MB symbolic-flags long-symbolic-flags > 0x000000000000006c 1 0 __RU_lA____________________________________ referenced,uptodate,lru,active > total 1 0 Hmm, what is the expected page count in this case? Seeing 1 doesn't look particularly wrong. -- Michal Hocko SUSE Labs