Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755933AbXKIUh3 (ORCPT ); Fri, 9 Nov 2007 15:37:29 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751200AbXKIUhQ (ORCPT ); Fri, 9 Nov 2007 15:37:16 -0500 Received: from lessem.org ([206.124.10.8]:47294 "EHLO shiner.lessem.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751074AbXKIUhO (ORCPT ); Fri, 9 Nov 2007 15:37:14 -0500 Message-ID: <4734C4BB.1050000@Lessem.org> Date: Fri, 09 Nov 2007 13:36:11 -0700 From: Jeff Lessem User-Agent: Mozilla-Thunderbird 2.0.0.6 (X11/20071008) MIME-Version: 1.0 To: Dan Williams CC: Bill Davidsen , =?ISO-8859-1?Q?BERTRAND_Jo=EBl?= , Justin Piszcz , Neil Brown , linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org Subject: Re: 2.6.23.1: mdadm/raid5 hung/d-state References: <18222.16003.92062.970530@notabene.brown> <47303FB8.7000801@systella.fr> <1194398700.2970.18.camel@dwillia2-linux.ch.intel.com> <47314653.80905@Lessem.org> <47334B46.7000809@tmr.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-3.0 (shiner.lessem.org [192.168.169.3]); Fri, 09 Nov 2007 13:36:25 -0700 (MST) Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2525 Lines: 52 Dan Williams wrote: > On 11/8/07, Bill Davidsen wrote: >> Jeff Lessem wrote: >>> Dan Williams wrote: >>>> The following patch, also attached, cleans up cases where the code >>> looks >>>> at sh->ops.pending when it should be looking at the consistent >>>> stack-based snapshot of the operations flags. >>> I tried this patch (against a stock 2.6.23), and it did not work for >>> me. Not only did I/O to the effected RAID5 & XFS partition stop, but >>> also I/O to all other disks. I was not able to capture any debugging >>> information, but I should be able to do that tomorrow when I can hook >>> a serial console to the machine. >> That can't be good! This is worrisome because Joel is giddy with joy >> because it fixes his iSCSI problems. I was going to try it with nbd, but >> perhaps I'll wait a week or so and see if others have more information. >> Applying patches before a holiday weekend is a good way to avoid time >> off. :-( > > We need to see more information on the failure that Jeff is seeing, > and whether it goes away with the two known patches applied. He > applied this most recent patch against stock 2.6.23 which means that > the platform was still open to the first biofill flags issue. I applied both of the patches. The biofill one did not apply cleanly, as it was adding biofill to one section, and removing it from another, but it appears that biofill does not need to be removed from a stock 2.6.23 kernel. The second patch applies with a slight offset, but no errors. I can report success so far with both patches applied. I created an 1100GB RAID5, formated it XFS, and successfully "tar c | tar x" 895GB of data onto it. I'm also in the process of rsync-ing the 895GB of data from the (slightly changed) original. In the past, I would always get a hang within 0-50GB of data transfer. For each drive in the RAID I also: echo 128 > /sys/block/"$i"/queue/max_sectors_kb echo 512 > /sys/block/"$i"/queue/nr_requests echo 1 > /sys/block/"$i"/device/queue_depth blockdev --setra 65536 /dev/md3 echo 16384 > /sys/block/md3/md/stripe_cache_size These changes appear to improve performance, along with a RAID5 chunk size of 1024k, but these changes alone (without the patches) do not fix the problem. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/