Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753863Ab0FCMqi (ORCPT ); Thu, 3 Jun 2010 08:46:38 -0400 Received: from moutng.kundenserver.de ([212.227.17.8]:62633 "EHLO moutng.kundenserver.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753038Ab0FCMqg (ORCPT ); Thu, 3 Jun 2010 08:46:36 -0400 Message-ID: <4C07A442.1030502@vlnb.net> Date: Thu, 03 Jun 2010 16:46:58 +0400 From: Vladislav Bolkhovitin User-Agent: Thunderbird 2.0.0.23 (X11/20090825) MIME-Version: 1.0 To: Boaz Harrosh CC: James Bottomley , Christof Schmitt , "Martin K. Petersen" , linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Chris Mason , Gennadiy Nerubayev Subject: Re: Wrong DIF guard tag on ext2 write References: <20100531112817.GA16260@schmichrtp.mainz.de.ibm.com> <1275318102.2823.47.camel@mulgrave.site> <4C03D5FD.3000202@panasas.com> <20100601103041.GA15922@schmichrtp.mainz.de.ibm.com> <1275398876.21962.6.camel@mulgrave.site> <4C078FE2.9000804@vlnb.net> <4C079B07.5020303@panasas.com> <4C07A2EB.4080006@vlnb.net> In-Reply-To: <4C07A2EB.4080006@vlnb.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: V01U2FsdGVkX1+7Dds1OvsPkAOzKti6esMrsEwMD2Vdw9ywmcm qbDPhF/2seRfAiZyH3hnEP5qqRorUDlwAjtflQeHayTHRpSqOt Zifh3Tid1Y3Z5FlQRGB5A== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4386 Lines: 89 Vladislav Bolkhovitin, on 06/03/2010 04:41 PM wrote: > Boaz Harrosh, on 06/03/2010 04:07 PM wrote: >> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote: >>> There's one interesting problem here, at least theoretically, with SCSI >>> or similar transports which allow to have commands queue depth >1 and >>> allowed to internally reorder queued requests. I don't know the FS/block >>> layers sufficiently well to tell if sending several requests for the >>> same page really possible or not, but we can see a real life problem, >>> which can be well explained if it's possible. >>> >>> The problem could be if the second (rewrite) request (SCSI command) for >>> the same page queued to the corresponding device before the original >>> request finished. Since the device allowed to freely reorder requests, >>> there's a probability that the original write request would hit the >>> permanent storage *AFTER* the retry request, hence the data changes it's >>> carrying would be lost, hence welcome data corruption. >>> >> I might be totally wrong here but I think NCQ can reorder sectors but >> not writes. That is if the sector is cached in device memory and a later >> write comes to modify the same sector then the original should be >> replaced not two values of the same sector be kept in device cache at the >> same time. >> >> Failing to do so is a scsi device problem. > > SCSI devices supporting Full task management model (almost all) and > having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1 > allowed to freely reorder any commands with SIMPLE task attribute. If an > application wants to maintain order of some commands for such devices, > it must issue them with ORDERED task attribute and over a _single_ MPIO > path to the device. > > Linux neither uses ORDERED attribute, nor honors or enforces anyhow > QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with > order dependencies (overlapping writes in our case) over a single MPIO path. > >> Please note that page-to-sector is not necessary constant. And the same page >> might get written at a different sector, next time. But FSs will have to >> barrier in this case. >> >>> For single parallel SCSI or SAS devices such race may look practically >>> impossible, but for sophisticated clusters when many nodes pretending to >>> be a single SCSI device in a load balancing configuration, it becomes >>> very real. >>> >>> The real life problem we can see in an active-active DRBD-setup. In this >>> configuration 2 nodes act as a single SCST-powered SCSI device and they >>> both run DRBD to keep their backstorage in-sync. The initiator uses them >>> as a single multipath device in an active-active round-robin >>> load-balancing configuration, i.e. sends requests to both nodes in >>> parallel, then DRBD takes care to replicate the requests to the other node. >>> >>> The problem is that sometimes DRBD complies about concurrent local >>> writes, like: >>> >>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! >>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192 >>> >>> This message means that DRBD detected that both nodes received >>> overlapping writes on the same block(s) and DRBD can't figure out which >>> one to store. This is possible only if the initiator sent the second >>> write request before the first one completed. >> It is totally possible in today's code. >> >> DRBD should store the original command_sn of the write and discard >> the sector with the lower SN. It should appear as a single device >> to the initiator. > > How can it find the SN? The commands were sent over _different_ MPIO > paths to the device, so at the moment of the sending all the order > information was lost. > > Until SCSI generally allowed to preserve ordering information between > MPIO paths in such configurations the only way to maintain commands > order would be queue draining. Hence, for safety all initiators working > with such devices must do it. > > But looks like Linux doesn't do it, so unsafe with MPIO clusters? I meant load balancing MPIO clusters. Vlad -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/