Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756287AbcCCU7n (ORCPT ); Thu, 3 Mar 2016 15:59:43 -0500 Received: from mail1.windriver.com ([147.11.146.13]:34367 "EHLO mail1.windriver.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751256AbcCCU7m (ORCPT ); Thu, 3 Mar 2016 15:59:42 -0500 Date: Thu, 3 Mar 2016 15:59:24 -0500 From: Paul Gortmaker To: Borislav Petkov , Richard Purdie , Toshi Kani CC: Bruce Ashfield , openembedded-core , "Hart, Darren" , "saul.wold" , Subject: runtime regression with "x86/mm/pat: Emulate PAT when it is disabled" Message-ID: <20160303205924.GA25222@windriver.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3754 Lines: 76 So, the yocto folks moved from 4.1 to 4.4 and one of their automated qemu x86-32 boot tests started failing. None of the yocto details seem to matter since I offered to help and I've repropduced it using 100% mainline kernels and a generic distro toolchain as well. The test case is slightly complicated, in that it relies on uvesafb being modular, and so one has to juggle modules within an ext4 image that qemu boots from. We tried making uvesafb builtin, but that made the issue magically vanish. Given PAT, this isn't too surprising. Richard did the preliminary investigation and analysis, and from that I did a bisect, and found the commit in $SUBJECT to be the root cause, as per the discussion here: http://lists.openembedded.org/pipermail/openembedded-core/2016-March/118397.html I'd mentioned the above to bpetkov on IRC and after confirming it was still an issue on 4.5-rc6, he'd asked if I had a portable reproducer. Not sure how complicated that would be, I set out to make one from my build. With a little LD_PRELOAD type magic and ensuring all the qemu components are in ./ I have one that runs on an otherwise qemu-free x86-64 box. The stand alone reproducer is here; launched in 00-runme: http://openlinux.wrs.com/pat-splat/reproducer.tar.bz2 It is nothing fancy, just a generic yocto build of "sato" (gfx enabled rootfs). When it "works" it boots to a UI touchscreen interface. When it fails, you get a black screen with a blinking cursor (as seen in "vncviewer localhost:0"). Upon failure, you can do --<2> to get to a passwd-less root login ; there you can run dmesg and see the splat. The image is currently using 4.5-rc6 ; but any kernel can be inserted; "make modules_install INSTALL_MOD_PATH=here" and then populating those modules from "here" into /lib/modules of the loopback mounted image. And of course updating the bzImage on the qemu cmdline. Currently it contains a bzImage and modules for 4.5-rc6 as I last tested that. Also note that vncviewer will disconnect when it goes from early boot 80x25 to a higer res gfx mode; just reconnect and continue observing the target. I've ruled out yocto kernel changes, and yocto toolchain -- but maybe it is a qemu issue this commit triggers ; who knows at this point. Since I've NFI what component(s) cause this, I wanted to have the qemu binary, all libraries etc as part of the reproducer and nothing left to chance, and I've tested the reproducer on an ancient dual core w/o vmx and w/o any qemu binaries installed. Bruce also tested it on a slightly more modern dual socket xeon with vmx and confirmed it failed there.. Inside there is a 00-runme ; mostly a copy of qemu args the yocto automated tests were using. There is also everything the qemu binaries need to run ; toplevel dir is noisy since qemu only looks in ./ it seems. There is also an ext4.img ; as mentioned earlier, this only happens when uvesafb.ko is a module, so one has to loopback mount that image and repopulate /lib/modules/ for each boot test/bisect step. I've also included 00-bisect.txt as the output of git bisect log. And there is also 00-configs/ dir that has the ".config" kernel file for each build (dir names are "git describe" in here for easy correlation) done for the bisect (plus the latest mainline build). The failing commit in the subject is v4.1-rc5-22-g9cd25aac1f44 . My contribution here is largely a bisect that can be relied on and providing a portable reproducer of the regression; I am by no means a PAT expert ; Richard invested more time into actually understanding the problem than I did, so I'm going to totally throw him under the bus on this when it comes to considering the ultimate root cause and possible fixes. :) Paul. --