Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757301AbaKTV6a (ORCPT ); Thu, 20 Nov 2014 16:58:30 -0500 Received: from www.linutronix.de ([62.245.132.108]:53338 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756223AbaKTV63 (ORCPT ); Thu, 20 Nov 2014 16:58:29 -0500 Date: Thu, 20 Nov 2014 22:58:26 +0100 (CET) From: Thomas Gleixner To: Tejun Heo cc: Frederic Weisbecker , Linus Torvalds , Dave Jones , Don Zickus , Linux Kernel , the arch/x86 maintainers , Peter Zijlstra , Andy Lutomirski , Arnaldo Carvalho de Melo Subject: Re: frequent lockups in 3.18rc4 In-Reply-To: <20141120122339.GA14877@htj.dyndns.org> Message-ID: References: <20141119021902.GA14216@redhat.com> <20141119145902.GA13387@redhat.com> <20141119190215.GA10796@lerouge> <20141119225615.GA11386@lerouge> <20141119235033.GE11386@lerouge> <20141120122339.GA14877@htj.dyndns.org> User-Agent: Alpine 2.11 (DEB 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 20 Nov 2014, Tejun Heo wrote: > On Thu, Nov 20, 2014 at 12:50:36AM +0100, Frederic Weisbecker wrote: > > > Are we talking about different per cpu allocators here or am I missing > > > something completely non obvious? > > > > That's the same allocator yeah. So if the whole memory is dereferenced, > > faults shouldn't happen indeed. > > > > Maybe that was a bug a few years ago but not anymore. > > It has been always like that tho. Percpu memory given out is always > populated and cleared. > > > Is it possible that, somehow, some part isn't zeroed by pcpu_alloc()? > > After all it's allocated with vzalloc() so that part could be skipped. The memset(0) > > The vzalloc call is for the internal allocation bitmap not the actual > percpu memory area. The actual address areas for percpu memory are > obtained using pcpu_get_vm_areas() call and later get populated using > map_kernel_range_noflush() (flush is performed after mapping is > complete). > > Trying to remember what happens with vmalloc_fault(). Ah okay, so > when a new PUD gets created for vmalloc area, we don't go through all > PGDs and update them. The PGD entries get faulted in lazily. Percpu > memory allocator clearing or not clearing the allocated area doesn't > have anything to do with it. The memory area is always fully > populated in the kernel page table. It's just that the population > happened while a different PGD was active and this PGD hasn't been > populated with the new PUD yet. It's completely undocumented behaviour, whether it has been that way for ever or not. And I agree with Fredric, that it is insane. Actuallu it's beyond insane, really. > So, yeap, vmalloc_fault() can always happen when accessing vmalloc > areas and the only way to avoid that would be removing lazy PGD > population - going through all PGDs and populating new PUDs > immediately. There is no requirement to go through ALL PGDs and populate that stuff immediately. Lets look at the two types of allocations 1) Kernel percpu allocations 2) Per process/task percpu allocations Of course we do not have a way to distinguish those, but we really should have one. #1 Kernel percpu allocations usually happen in the context of driver bringup, subsystem initialization, interrupt setup etc. So this is functionality which is not a hotpath and usually requires some form of synchronization versus the rest of the system anyway. The per cpu population stuff is serialized with a mutex anyway, so what's wrong to have a globaly visible percpu sequence counter, which is incremented whenever a new allocation is populated or torn down? We can make that sequence counter a per cpu variable as well to avoid the issues of a global variable (preferrably that's a compile/boot time allocated percpu variable to avoid the obvious circulus vitiosus) Now after that increment the allocation side needs to wait for a scheduling cycle on all cpus (we have mechanisms for that) So in the scheduler if the same task gets reselected you check that sequence count and update the PGD if different. If a task switch happens then you also need to check the sequence count and act accordingly. If we make the sequence counter a percpu variable as outlined above the overhead of checking this is just noise versus the other nonsense we do in schedule(). #2 That's process related statistics and instrumentation stuff. Now that just needs a immediate population on the process->mm->pgd aside of the init_mm.pgd, but that's really not a big deal. Of course that does not solve the issues we have with the current infrastructure retroactively, but it allows us to avoid fuckups like the one Frederic was talking about that perf invented its own kmalloc based 'percpu' replacement just to workaround the shortcoming in a particular place. What really frightens me is the potential and well hidden fuckup potential which lurks around the corner and the hard to debug once in a while fallout which might be caused by this. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/