Received: by 10.192.165.148 with SMTP id m20csp212033imm; Fri, 4 May 2018 09:05:52 -0700 (PDT) X-Google-Smtp-Source: AB8JxZrdzYfNoyJMO0VQ38U+G7UJZWhl7WZaZWRrgVYJGqapLCNwhSYB6573trih+tm0jy4wj5DM X-Received: by 2002:a17:902:290a:: with SMTP id g10-v6mr28410450plb.155.1525449952325; Fri, 04 May 2018 09:05:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525449952; cv=none; d=google.com; s=arc-20160816; b=MU+7KO31XjPN9Yqcrq78IJYsa0qOr/o4Ykw7JGA5wNxBPVrZHxjLA2mVTd61vyGXXb YYEoRDhM98WGw0rU0XYt9KIoYcnXzPrSEJI8mHtaw7I1azCeejRFVIfdwhvMcvRl5tGJ wJY6x4MGelsbe1WQ27DnV/fIHkv66Pl75cwIul6GdtrUKglAPh2wFh6FqW3Knx3tM8dI DlGMCxzgrHGNQzWVeQIZT3Oshyfga5vEpQpQwG9niz1PrPX8ZGKmWBUFOIRytwxz/qdz zmyJTWCd9qerYd6uawiIHgywa2uMSwuRwGFrLHUkULSwjg2qM5AwjjZgr99I+74BRuc3 22GQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=4g/c/9qDeHPPxzwagt9H1k6zLHQTTWzY7kiaeR2TcXI=; b=MR1VQvVWMq8bscCqhRa7iJiEc+f2H8SXueOyWARidc11nNfZeTqbXAeTMvk7CKNhmC Rxe3plOcGhV4Ag3WAA+LAjcUpLzgZOcYuHqXE2wm15c/jTdsPmmjYpU2PXV9Lh+1+5Lm CvIM5RfCN5jiqeyVjgBFfj6iVFgd6xtrC15jzV7/eLGyfPckPGCMzAgJRuDXDne1oL/+ 431Zd13asjeInFUqi6BeGaYcgEIciPO27irGcuBpDoJbYFoaMrsoH4/OOh0WOMWnhmd1 bSIBUJJOGLMTYZEifVnZWFj+jAnPS4HUenygKy3H5gt7RnO/rn/3U5pR8qkfVWQLJ27G tagw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=alTWPgst; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s1-v6si9395177pgr.572.2018.05.04.09.05.38; Fri, 04 May 2018 09:05:52 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=alTWPgst; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751673AbeEDQDw (ORCPT + 99 others); Fri, 4 May 2018 12:03:52 -0400 Received: from userp2130.oracle.com ([156.151.31.86]:43128 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751459AbeEDQDt (ORCPT ); Fri, 4 May 2018 12:03:49 -0400 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w44Fqa1t065758; Fri, 4 May 2018 16:03:07 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2017-10-26; bh=4g/c/9qDeHPPxzwagt9H1k6zLHQTTWzY7kiaeR2TcXI=; b=alTWPgstx/B1d0k2vhaHrcioN1LFNYr2IKEbg5qSmC8nj0lUlpOKIjp4MhbqcgOaUnTi vmCLgSlZshyY2QxG8Q/O43176goczfTxOBa0CbD8/vU6IVxMM+LZU7+Aa7daIiodOV0h GjJFkPsWESSPUoFkm+fMHAVwvRzAJfj5QgpHsVwuAjREhgsZnXjreKa59w/vYCQ+qeOa 54qwK/pWw+0E85hgN+RZDwZsxUgjXaG6T6M1ioib31QnSziGR2TzTPqE1y5K92lzumJm 1bTr5h1Yk3a0839/I+8TNgJXkbIDXWF0THp7u8aMs8/SVXJUMeA4h07VOKvQX1H2noWh YQ== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2130.oracle.com with ESMTP id 2hmgdjxpfk-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 04 May 2018 16:03:07 +0000 Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w44G36b7004465 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 4 May 2018 16:03:06 GMT Received: from abhmp0002.oracle.com (abhmp0002.oracle.com [141.146.116.8]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id w44G350u021277; Fri, 4 May 2018 16:03:05 GMT Received: from [192.168.1.10] (/73.69.118.222) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Fri, 04 May 2018 09:03:05 -0700 Subject: Re: [v2] mm: access to uninitialized struct page To: Andrei Vagin Cc: Steven Sistare , Daniel Jordan , Andrew Morton , LKML , tglx@linutronix.de, Michal Hocko , Linux Memory Management List , mgorman@techsingularity.net, mingo@kernel.org, peterz@infradead.org, Steven Rostedt , Fengguang Wu , Dennis Zhou References: <20180426202619.2768-1-pasha.tatashin@oracle.com> <20180504082731.GA2782@outlook.office365.com> <20180504160139.GA4693@outlook.office365.com> From: Pavel Tatashin Message-ID: <22c34b1b-15d9-6a28-d7f2-697bac42bde2@oracle.com> Date: Fri, 4 May 2018 12:03:02 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <20180504160139.GA4693@outlook.office365.com> Content-Type: text/plain; charset=koi8-r Content-Language: en-US Content-Transfer-Encoding: 7bit X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8882 signatures=668698 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1805040148 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Thank you, I will try to figure out what is happening. Pavel On 05/04/2018 12:01 PM, Andrei Vagin wrote: > On Fri, May 04, 2018 at 12:47:53PM +0000, Pavel Tatashin wrote: >> Hi Andrei, >> >> Could you please provide me with scripts to reproduce this issue? > > I boot this kernel in a kvm virtual machine. The kernel is built without > modules. A config file is attahced. > > Here is a qemu command line what I use to reproduce the problem: > > qemu-kvm -kernel /home/avagin/git/linux-next/arch/x86/boot/bzImage \ > -append 'root=/dev/vda2 ro debug console=ttyS0,115200 LANG=en_US.UTF-8 slub_debug=FZP raid=noautodetect selinux=0 earlyprintk=serial,ttyS0,115200' \ > -boot c \ > -smp 2,sockets=2,cores=1,threads=1 \ > -drive file=/home/vms/fc22.img,format=raw,if=none,id=drive-virtio-disk0 \ > --display none \ > -serial telnet:127.0.0.1:4444,server,nowait -cpu Skylake-Client-IBRS,ss=on,hypervisor=on,tsc_adjust=on,clflushopt=on,xsaves=on,pdpe1gb=on,ibpb=on \ > -m 4096 \ > -realtime mlock=off \ > -machine pc-i440fx-2.3,accel=kvm,usb=off,dump-guest-core=off \ > -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x6.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x6 \ > -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x6.0x1 \ > -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x6.0x2 \ > -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on > > > [avagin@laptop linux-next]$ cat /proc/cpuinfo > processor : 0 > vendor_id : GenuineIntel > cpu family : 6 > model : 78 > model name : Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz > stepping : 3 > microcode : 0xc2 > cpu MHz : 1213.986 > cache size : 3072 KB > physical id : 0 > siblings : 4 > core id : 0 > cpu cores : 2 > apicid : 0 > initial apicid : 0 > fpu : yes > fpu_exception : yes > cpuid level : 22 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves ibpb ibrs stibp dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp > bugs : cpu_meltdown spectre_v1 spectre_v2 > bogomips : 4992.00 > clflush size : 64 > cache_alignment : 64 > address sizes : 39 bits physical, 48 bits virtual > power management: > >> >> Thank you, >> Pavel >> On Fri, May 4, 2018 at 4:27 AM Andrei Vagin wrote: >> >>> Hello, >> >>> We have a robot which runs criu tests on linux-next kernels. >> >>> All tests passed on 4.17.0-rc3-next-20180502. >> >>> But the 4.17.0-rc3-next-20180504 kernel didn't boot. >> >>> git bisect points on this patch. >> >>> On Thu, Apr 26, 2018 at 04:26:19PM -0400, Pavel Tatashin wrote: >>>> The following two bugs were reported by Fengguang Wu: >>>> >>>> kernel reboot-without-warning in early-boot stage, last printk: >>>> early console in setup code >>>> >>>> >> http://lkml.kernel.org/r/20180418135300.inazvpxjxowogyge@wfg-t540p.sh.intel.com >> >>> The problem looks similar with this one. >> >>> [ 5.596975] devtmpfs: mounted >>> [ 5.855754] Freeing unused kernel memory: 1704K >>> [ 5.858162] Write protecting the kernel read-only data: 18432k >>> [ 5.860772] Freeing unused kernel memory: 2012K >>> [ 5.861838] Freeing unused kernel memory: 160K >>> [ 5.862572] rodata_test: all tests were successful >>> [ 5.866857] random: fast init done >>> early console in setup code >>> [ 0.000000] Linux version 4.17.0-rc3-00023-g7c4cc2d022a1 >>> (avagin@laptop) (gcc version 8.0.1 20180324 (Red Hat 8.0.1-0.20) (GCC)) >>> #13 SMP Fri May 4 01:10:51 PDT 2018 >>> [ 0.000000] Command line: root=/dev/vda2 ro debug >>> console=ttyS0,115200 LANG=en_US.UTF-8 slub_debug=FZP raid=noautodetect >>> selinux=0 earlyprintk=serial,ttyS0,115200 >>> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating >>> point registers' >>> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' >>> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' >>> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds >>> registers' >> >>> $ git describe HEAD >>> v4.17-rc3-23-g7c4cc2d022a1 >> >>> [avagin@laptop linux-next]$ git log --pretty=oneline | head -n 1 >>> 7c4cc2d022a1fd56eb2ee555533b8666bc780f1e mm: access to uninitialized >> struct page >> >> >>>> >>>> And, also: >>>> [per_cpu_ptr_to_phys] PANIC: early exception 0x0d >>>> IP 10:ffffffffa892f15f error 0 cr2 0xffff88001fbff000 >>>> >>>> >> http://lkml.kernel.org/r/20180419013128.iurzouiqxvcnpbvz@wfg-t540p.sh.intel.com >>>> >>>> Both of the problems are due to accessing uninitialized struct page from >>>> trap_init(). We must first do mm_init() in order to initialize allocated >>>> struct pages, and than we can access fields of any struct page that >> belongs >>>> to memory that's been allocated. >>>> >>>> Below is explanation of the root cause. >>>> >>>> The issue arises in this stack: >>>> >>>> start_kernel() >>>> trap_init() >>>> setup_cpu_entry_areas() >>>> setup_cpu_entry_area(cpu) >>>> get_cpu_gdt_paddr(cpu) >>>> per_cpu_ptr_to_phys(addr) >>>> pcpu_addr_to_page(addr) >>>> virt_to_page(addr) >>>> pfn_to_page(__pa(addr) >> PAGE_SHIFT) >>>> The returned "struct page" is sometimes uninitialized, and thus >>>> failing later when used. It turns out sometimes is because it depends >>>> on KASLR. >>>> >>>> When boot is failing we have this when pfn_to_page() is called: >>>> kasrl: 0x000000000d600000 >>>> addr: ffffffff83e0d000 >>>> pa: 1040d000 >>>> pfn: 1040d >>>> page: ffff88001f113340 >>>> page->flags ffffffffffffffff <- Uninitialized! >>>> >>>> When boot is successful: >>>> kaslr: 0x000000000a800000 >>>> addr: ffffffff83e0d000 >>>> pa: d60d000 >>>> pfn: d60d >>>> page: ffff88001f05b340 >>>> page->flags 280000000000 <- Initialized! >>>> >>>> Here are physical addresses that BIOS provided to us: >>>> e820: BIOS-provided physical RAM map: >>>> BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable >>>> BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved >>>> BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved >>>> BIOS-e820: [mem 0x0000000000100000-0x000000001ffdffff] usable >>>> BIOS-e820: [mem 0x000000001ffe0000-0x000000001fffffff] reserved >>>> BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved >>>> BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved >>>> >>>> In both cases, working and non-working the real physical address is >>>> the same: >>>> >>>> pa - kasrl = 0x2E0D000 >>>> >>>> The only thing that is different is PFN. >>>> >>>> We initialize struct pages in four places: >>>> >>>> 1. Early in boot a small set of struct pages is initialized to fill >>>> the first section, and lower zones. >>>> 2. During mm_init() we initialize "struct pages" for all the memory >>>> that is allocated, i.e reserved in memblock. >>>> 3. Using on-demand logic when pages are allocated after mm_init call >>>> 4. After smp_init() when the rest free deferred pages are initialized. >>>> >>>> The above path happens before deferred memory is initialized, and thus >>>> it must be covered either by 1, 2 or 3. >>>> >>>> So, lets check what PFNs are initialized after (1). >>>> >>>> memmap_init_zone() is called for pfn ranges: >>>> 1 - 1000, and 1000 - 1ffe0, but it quits after reaching pfn 0x10000, >>>> as it leaves the rest to be initialized as deferred pages. >>>> >>>> In the working scenario pfn ended up being below 1000, but in the >>>> failing scenario it is above. Hence, we must initialize this page in >>>> (2). But trap_init() is called before mm_init(). >>>> >>>> The bug was introduced by "mm: initialize pages on demand during boot" >>>> because we lowered amount of pages that is initialized in the step >>>> (1). But, it still could happen, because the number of initialized >>>> pages was a guessing. >>>> >>>> The current fix moves trap_init() to be called after mm_init, but as >>>> alternative, we could increase pgdat->static_init_pgcnt: >>>> In free_area_init_node we can increase: >>>> pgdat->static_init_pgcnt = min_t(unsigned long, >> PAGES_PER_SECTION, >>>> pgdat->node_spanned_pages); >>>> Instead of one PAGES_PER_SECTION, set several, so the text is >>>> covered for all KASLR offsets. But, this would still be guessing. >>>> Therefore, I prefer the current fix. >>>> >>>> Fixes: c9e97a1997fb ("mm: initialize pages on demand during boot") >>>> >>>> Signed-off-by: Pavel Tatashin >>>> Reviewed-by: Steven Rostedt (VMware) >>>> --- >>>> init/main.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/init/main.c b/init/main.c >>>> index b795aa341a3a..870f75581cea 100644 >>>> --- a/init/main.c >>>> +++ b/init/main.c >>>> @@ -585,8 +585,8 @@ asmlinkage __visible void __init start_kernel(void) >>>> setup_log_buf(0); >>>> vfs_caches_init_early(); >>>> sort_main_extable(); >>>> - trap_init(); >>>> mm_init(); >>>> + trap_init(); >>>> >>>> ftrace_init(); >>>>