Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752313Ab3HZUQC (ORCPT ); Mon, 26 Aug 2013 16:16:02 -0400 Received: from mail-ve0-f170.google.com ([209.85.128.170]:50877 "EHLO mail-ve0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751700Ab3HZUQA (ORCPT ); Mon, 26 Aug 2013 16:16:00 -0400 MIME-Version: 1.0 In-Reply-To: <20130826190757.GB27768@redhat.com> References: <20130807153030.GA25515@redhat.com> <20130819231836.GD14369@redhat.com> <20130821204901.GA19802@redhat.com> <20130823032127.GA5098@redhat.com> <20130823035344.GB5098@redhat.com> <20130826190757.GB27768@redhat.com> Date: Mon, 26 Aug 2013 13:15:59 -0700 X-Google-Sender-Auth: vcEzV4VeXwFmErqs8K0ZTTF1xLg Message-ID: Subject: Re: unused swap offset / bad page map. From: Linus Torvalds To: Dave Jones , Hillf Danton , Linux-MM , Linux Kernel , Hugh Dickins Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2408 Lines: 52 On Mon, Aug 26, 2013 at 12:08 PM, Dave Jones wrote: > > [ 4588.541886] swap_free: Unused swap offset entry 00002d15 > [ 4588.541952] BUG: Bad page map in process trinity-kid12 pte:005a2a80 pmd:22c01f067 > > I can reproduce this pretty quickly by driving the system into swapping using > a few instances of 'trinity -C64' (this creates 64 threads) > > I'm not sure how far back this bug goes, so I'll try some older kernels > and see if I can bisect it, because we don't seem to be getting closer > to figuring out what's actually happening.. Bisecting would indeed be good. But I get the feeling that you'll need to go back a *long* time, because the swap_map[] code hasn't changed in ages. I'm adding Hugh Dickins to the cc just in case he hasn't seen this on linux-mm, because the swap_map[] code is complex as hell, and Hugh did touch some of it last. The whole swap_map[] thing is complicated by: - it's a single byte per swap entry - it's not even a *structured* byte, but a single counter that has several "fields" by hand - it has a count in the low 6 bits, with a magic "bad" value (which is also a magic "continuation" value if one of the high bits are set) - it has two magic bits: HAS_CACHE and CONTINUED - it has a _third_ magic value (SWAP_MAP_SHMEM) which is "CONTINUED+BAD" - we increment this nasty pseudo-counter wildly hackily, and and have magic special case checks for the odd cases and if we get any of the special cases wrong, we'll increment/decrement it wrong, and we're screwed. The *locking* looks pretty simple, though. It's a simple spinlock. We do some optimistic tests outside the spinlock, but the actual allocation and modification seem to all be inside the lock and re-check any optimistic values afaik. So I'm almost likely to think that we are more likely to have something wrong in the messy magical special cases. I'm wondering if we should get rid of the continuation crap, for example, and expand the "one byte per swap page" to two bytes instead. Hugh, I think you know this code best, because you added the last special case (that SWAP_MAP_SHMEM value). Comments? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/