Date: Wed, 3 Dec 2014 14:40:49 -0600
From: Alex Thorlton <athorlton@sgi.com>
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
        Peter Zijlstra <peterz@infradead.org>,
        Fabian Frederick <fabf@skynet.be>, Ingo Molnar <mingo@kernel.org>,
        Alex Thorlton <athorlton@sgi.com>, Russ Anderson <rja@sgi.com>,
        linux-kernel@vger.kernel.org
Subject: [BUG] Possible locking issues in stop_machine code on 6k core machine
Message-ID: <20141203204048.GJ4720@sgi.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org

Hey guys,                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                          
While working to get our newly upgraded 6k core machine online, we've                                                                                                                                                                                                                                                                                                                     
discovered a few possible locking issues in the stop_machine code that                                                                                                                                                                                                                                                                                                                    
we're trying to get sorted out.  (We think) the problems we're seeing                                                                                                                                                                                                                                                                                                                     
stem from possible interaction between stop_cpus and stop_one_cpu.  The                                                                                                                                                                                                                                                                                                                   
issue presents as a deadlock, and seems to only show itself                                                                                                                                                                                                                                                                                                                               
intermittently.                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                          
After quite a bit of debugging we think we've narrowed the issue down to                                                                                                                                                                                                                                                                                                                  
the fact that stop_one_cpu does not respect many of the locks that are                                                                                                                                                                                                                                                                                                                    
taken in the stop_cpus code path.  For reference the stop_cpus code path                                                                                                                                                                                                                                                                                                                  
takes the stop_cpus_mutex, then stop_cpus_lock, and then takes each                                                                                                                                                                                                                                                                                                                       
cpu's stopper->lock.  stop_one_cpu seems to rely solely on the                                                                                                                                                                                                                                                                                                                            
stopper->lock.                                                                                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                          
What appears to be happening to cause our deadlock is, stop_cpus works                                                                                                                                                                                                                                                                                                                    
its way down to queue_stop_cpus_work, which tells each cpu's stopper                                                                                                                                                                                                                                                                                                                      
task to wake up, take its lock, and do its work.  As the loop that does                                                                                                                                                                                                                                                                                                                   
this progresses, the lowest numbered cpus complete their work, and are                                                                                                                                                                                                                                                                                                                    
allowed to go on about their business.  The problem occurs when one of                                                                                                                                                                                                                                                                                                                    
these lower numbered cpus calls stop_one_cpu, targeting one of the                                                                                                                                                                                                                                                                                                                        
higher numbered cpus, which the stop_cpus loop has not yet reached.  If                                                                                                                                                                                                                                                                                                                   
this happens, that higher numbered cpu's completion variable will get                                                                                                                                                                                                                                                                                                                     
stomped on, and the wait_for_completion in the stop_cpus code path will                                                                                                                                                                                                                                                                                                                   
never return.                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                          
A quick example: CPU 0 calls stop_cpus, which will hit all 6,000 cores.                                                                                                                                                                                                                                                                                                                   
CPU 50 completes its stopper work, and at some point in the near future                                                                                                                                                                                                                                                                                                                   
calls stop_one_cpu on CPU 5000.  This clobbers CPU 5000's pointer to the                                                                                                                                                                                                                                                                                                                  
cpu_stop_done struct set up in queue_stop_cpus_work, meaning that, once                                                                                                                                                                                                                                                                                                                   
CPU 5000 completes its work, it won't be able to decrement the nr_todo                                                                                                                                                                                                                                                                                                                    
for the correct cpu_stop_done struct, and CPU 0's wait_for_completion                                                                                                                                                                                                                                                                                                                     
will never return.                                                                                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                                          
Again, much of this is semi-educated guesswork, put together based on                                                                                                                                                                                                                                                                                                                     
information gathered from examining lots of debug output, in an attempt                                                                                                                                                                                                                                                                                                                   
to spot the problem.  We're fairly certain that we've pinned down our                                                                                                                                                                                                                                                                                                                     
issue, but we'd like to ask those who are more knowledgeable of these                                                                                                                                                                                                                                                                                                                     
code paths to weigh in their opinions here.                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                                          
We'd really appreciate any help that anyone can offer.  Thanks!                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                          
- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/