Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757074AbZIQNfT (ORCPT ); Thu, 17 Sep 2009 09:35:19 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754414AbZIQNfS (ORCPT ); Thu, 17 Sep 2009 09:35:18 -0400 Received: from rtr.ca ([76.10.145.34]:52783 "EHLO mail.rtr.ca" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753864AbZIQNfR (ORCPT ); Thu, 17 Sep 2009 09:35:17 -0400 Message-ID: <4AB23B17.2040204@rtr.ca> Date: Thu, 17 Sep 2009 09:35:19 -0400 From: Mark Lord Organization: Real-Time Remedies Inc. User-Agent: Thunderbird 2.0.0.23 (X11/20090817) MIME-Version: 1.0 To: Tejun Heo Cc: Chris Webb , Ric Wheeler , Andrei Tanas , NeilBrown , linux-kernel@vger.kernel.org, IDE/ATA development list , linux-scsi@vger.kernel.org, Jeff Garzik , Mark Lord Subject: Re: MD/RAID time out writing superblock References: <20090916222842.GB16053@arachsys.com> <4AB17905.90606@kernel.org> In-Reply-To: <4AB17905.90606@kernel.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2912 Lines: 62 Tejun Heo wrote: > Hello, > > Chris Webb wrote: >> Hi Tejun. Thanks for following up to this. We've done some more >> experimentation over the last couple of days based on your >> suggestions and thoughts. >> >> Tejun Heo writes: >>> Seriously, it's most likely a hardware malfunction although I can't tell >>> where the problem is with the given data. Get the hardware fixed. >> We know this isn't caused by a single faulty piece of hardware, >> because we have a cluster of identical machines and all have shown >> this behaviour. This doesn't mean that there isn't a hardware >> problem, but if there is one, it's a design problem or firmware bug >> affecting all of our hosts. > > If it's multiple machines, it's much less likely to be faulty drives, > but if the machines are configured mostly identically, hardware > problems can't be ruled out either. > >> There have also been a few reports of problems which look very >> similar in this thread from people with somewhat different hardware >> and drives to ours. > > I wouldn't connect the reported cases too eagerly at this point. Too > many different causes end up showing similar symptoms especially with > timeouts. > >>> The aboves are IDENTIFY. Who's issuing IDENTIFY regularly? It isn't >>> from the regular IO paths or md. It's probably being issued via SG_IO >>> from userland. These failures don't affect normal operation. >> [...] >>> Oooh, another possibility is the above continuous IDENTIFY tries. >>> Doing things like that generally isn't a good idea because vendors >>> don't expect IDENTIFY to be mixed regularly with normal IOs and >>> firmwares aren't tested against that. Even smart commands sometimes >>> cause problems. So, finding out the thing which is obsessed with the >>> identity of the drive and stopping it might help. >> We tracked this down to some (excessively frequent!) monitoring we >> were doing using smartctl. Things were improved considerably by >> stopping smartd and disabling all callers of smartctl, although it >> doesn't appear to have been a cure. The frequency of these timeouts >> during resync seems to have gone from about once every two hours to >> about once a day, which means we've been able to complete some >> resyncs whereas we were unable to before. > > That's interesting. One important side effect of issuing IDENTIFY is > that they will serialize command streams as they are not NCQ commands > and thus could change command patterns significantly. .. SMART is the opcode that is most frequently implicated here, not IDENTIFY. Note that even a barrier FLUSH CACHE is non NCQ and will serialize the stream. Cheers -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/