Received: by 10.213.65.68 with SMTP id h4csp706417imn; Sat, 7 Apr 2018 07:49:54 -0700 (PDT) X-Google-Smtp-Source: AIpwx49PPJciL3n/G/BASABJLNtFkb5pBAJlVm0ClI4qTMROiYSaFZJ6Ddddi+DjJIwasqlSQj/T X-Received: by 2002:a17:902:2e:: with SMTP id 43-v6mr31475181pla.282.1523112594349; Sat, 07 Apr 2018 07:49:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1523112594; cv=none; d=google.com; s=arc-20160816; b=oUD9SsLaNJx3YdMGEYpudxfMpf3te7x1jZkurNE2xBEHhtcXwonRM4sdARx4Q5S8Fm 10vCr1Rz7ZsjJHyw5DE0jtrceOg1y7HJY+bL9sDzP/niOiDtONT2wlFdLqJN3cgNBqhO Qp3Y8bfwdFzrkIKfgTw25gV93HeetPD3rZd5wge5eF72J2DEfLZ1hPTrQFutferdNhO/ x/LTbHhvek682Du2b4N6CtI2hiXjTQe15TtKp1M3ancKnncMf5O1qo2v4C/J38WSD0T2 aDNk8SSUesrPI8TTf5qKhKRDJuVW9eaUo7kyYtsVmL0HAd680f3dFLDAlNmPXJ+fIXXp Ernw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=9tNeKExtqNI/dkSV0ha4yxFPM5sx5Bsh3yTsq6slxYk=; b=ddsC4Z2kHDDnNG1+6OHuObD52euv9fXuAuPB9MpeDVzB8mvYvOGeIwG9w9vVIaCkq4 ScdigCMzii1uzjhLhnch/50dyeqpzLGKRQcSxPpejxrQFuoG+/l1qEMfMKI78/h7E9ek NOvEZZiab5X3tXItP6gWRAzKAQVN41VihSOjqXG8glhRXCAAlxuutk4zz4mU/ML9U4UL fpcTMbsjSOnSlyvP0fA08EjjP6U9RnVtHH6o/vuoD7gGYlYZ5tbbBiDV8cN3qbEQvq7F /NlYc0cMXFBZ9Mpu3XI0tCkqxymdbv+lSDMXH65HkVtRqG9F9tFZCPpe5JE/x+NET4Hm EPeg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=OL8AstIm; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u69si9719134pfg.292.2018.04.07.07.49.05; Sat, 07 Apr 2018 07:49:54 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=OL8AstIm; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751981AbeDGOpr (ORCPT + 99 others); Sat, 7 Apr 2018 10:45:47 -0400 Received: from userp2120.oracle.com ([156.151.31.85]:59424 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751871AbeDGOpq (ORCPT ); Sat, 7 Apr 2018 10:45:46 -0400 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w37E4S3c148643 for ; Sat, 7 Apr 2018 14:45:45 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=mime-version : in-reply-to : references : from : date : message-id : subject : to : cc : content-type; s=corp-2017-10-26; bh=9tNeKExtqNI/dkSV0ha4yxFPM5sx5Bsh3yTsq6slxYk=; b=OL8AstImW/WQRvbI9anMg1Rq7qkvO7gJmls8K7BbbJQMJJaChTbkFSEApJjZMLBvJCqw AnwxT/pOei4zg2GoesL1XGdUsTu0TkUQvh2G3DZuOsy6KoRpfXHhRjf3Ej9t7VdHthZD /gxz0lOd+ohGKZhpMOLWq6LINmGxYEQLN4eqp6KVnaykfvUfHCqTi5IK5RQ6ijYCB+0Z w4Saf5CHxK4F+nkrvq1H77myGWHxGdoYm6gKdmHWssCd9bvuOvCKtzYUqXAS4Wu01YlG bZmIkzD+4hu62UiqsxOJg5lvrxR35wrhrtQ0LCogow5W8eL1IZl9cUvJ8wn6tOReI6gF mg== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by userp2120.oracle.com with ESMTP id 2h6pn48nwe-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Sat, 07 Apr 2018 14:45:45 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id w37EjjlK031676 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Sat, 7 Apr 2018 14:45:45 GMT Received: from abhmp0003.oracle.com (abhmp0003.oracle.com [141.146.116.9]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id w37Eji9j016536 for ; Sat, 7 Apr 2018 14:45:44 GMT Received: from mail-ot0-f182.google.com (/74.125.82.182) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Sat, 07 Apr 2018 07:45:44 -0700 Received: by mail-ot0-f182.google.com with SMTP id n40-v6so4198614otd.3 for ; Sat, 07 Apr 2018 07:45:44 -0700 (PDT) X-Gm-Message-State: AElRT7E5FY+1FdQQS1n3MC0z7Q+MBGce8KVR64u+wcA0uCDJ5AT2aQ+6 nEjVgnZgChteEovmUtEwyvumWgTiwj/6C7gNnv8= X-Received: by 2002:a9d:2d83:: with SMTP id g3-v6mr17654496otb.259.1523112343517; Sat, 07 Apr 2018 07:45:43 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a9d:232b:0:0:0:0:0 with HTTP; Sat, 7 Apr 2018 07:45:03 -0700 (PDT) In-Reply-To: <20180406124535.k3qyxjfrlo55d5if@xakep.localdomain> References: <20180131210300.22963-1-pasha.tatashin@oracle.com> <20180131210300.22963-2-pasha.tatashin@oracle.com> <20180313234333.j3i43yxeawx5d67x@sasha-lappy> <20180314005350.6xdda2uqzuy4n3o6@sasha-lappy> <20180315190430.o3vs7uxlafzdwgzd@xakep.localdomain> <20180315204312.n7p4zzrftgg6m7zw@sasha-lappy> <20180404021746.m77czxidkaumkses@xakep.localdomain> <20180405134940.2yzx4p7hjed7lfdk@xakep.localdomain> <20180405192256.GQ7561@sasha-vm> <20180406124535.k3qyxjfrlo55d5if@xakep.localdomain> From: Pavel Tatashin Date: Sat, 7 Apr 2018 10:45:03 -0400 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v2 1/2] mm: uninitialized struct page poisoning sanity checking To: Sasha Levin Cc: "steven.sistare@oracle.com" , "daniel.m.jordan@oracle.com" , "akpm@linux-foundation.org" , "mgorman@techsingularity.net" , "mhocko@suse.com" , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "gregkh@linuxfoundation.org" , "vbabka@suse.cz" , "bharata@linux.vnet.ibm.com" Content-Type: text/plain; charset="UTF-8" X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8855 signatures=668698 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=1 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1804070155 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > Let me study your trace, perhaps I will able to figure out the issue > without reproducing it. Hi Sasha, I've been studying this problem more. The issue happens in this stack: ...subsys_init... topology_init() register_one_node(nid) link_mem_sections(nid, pgdat->node_start_pfn, pgdat->node_spanned_pages) register_mem_sect_under_node(mem_blk, nid) get_nid_for_pfn(pfn) pfn_to_nid(pfn) page_to_nid(page) PF_POISONED_CHECK(page) We are trying to get nid from struct page which has not been initialized. My patches add this extra scrutiny to make sure that we never get invalid nid from a "struct page" by adding PF_POISONED_CHECK() to page_to_nid(). So, the bug already exists in Linux where incorrect nid is read. The question is why does happen? First, I thought, that perhaps struct page is not yet initialized. But, the initcalls are done after deferred pages are initialized, and thus every struct page must be initialized by now. Also, if deferred pages were enabled, we would take a slightly different path and avoid this bug by getting nid from memblock instead of struct page: get_nid_for_pfn(pfn) #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT if (system_state < SYSTEM_RUNNING) return early_pfn_to_nid(pfn); #endif I also verified in your config that CONFIG_DEFERRED_STRUCT_PAGE_INIT is not set. So, one way to fix this issue, is to remove this "#ifdef" (I have not checked for dependancies), but this is simply addressing symptom, not fixing the actual issue. Thus, we have a "struct page" backing memory for this pfn, but we have not initialized it. For some reason memmap_init_zone() decided to skip it, and I am not sure why. Looking at the code we skip initializing if: !early_pfn_valid(pfn)) aka !pfn_valid(pfn) and if !early_pfn_in_nid(pfn, nid). I suspect, this has something to do with !pfn_valid(pfn). But, without having a machine on which I could reproduce this problem, I cannot study it further to determine exactly why pfn is not valid. Please replace !pfn_valid_within() with !pfn_valid() in get_nid_for_pfn() and see if problem still happens. If it does not happen, lets study the memory map, pgdata's start end, and the value of this pfn. Thank you, Pasha