Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757997AbYLJSf2 (ORCPT ); Wed, 10 Dec 2008 13:35:28 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1757909AbYLJSef (ORCPT ); Wed, 10 Dec 2008 13:34:35 -0500 Received: from moutng.kundenserver.de ([212.227.17.8]:50850 "EHLO moutng.kundenserver.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757881AbYLJSea (ORCPT ); Wed, 10 Dec 2008 13:34:30 -0500 Message-ID: <49400B9D.6060903@vlnb.net> Date: Wed, 10 Dec 2008 21:34:05 +0300 From: Vladislav Bolkhovitin User-Agent: Thunderbird 2.0.0.9 (X11/20071115) MIME-Version: 1.0 To: linux-scsi@vger.kernel.org CC: James Bottomley , Andrew Morton , FUJITA Tomonori , Mike Christie , Jeff Garzik , Boaz Harrosh , Linus Torvalds , linux-kernel@vger.kernel.org, scst-devel@lists.sourceforge.net, Bart Van Assche , "Nicholas A. Bellinger" Subject: [PATCH][RFC 3/23]: SCST core docs References: <494009D7.4020602@vlnb.net> In-Reply-To: <494009D7.4020602@vlnb.net> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit X-Provags-ID: V01U2FsdGVkX18tYISVXDjRPQx9seocrZBb6N3jjbKvf2eTW+r 4GJeJbzh+rxaAb79W3lMOAE8oLxixNvbeTfrYo1wiWsNUulqb2 Lo/aTBtTP6P6lQ6GUqvWA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 40428 Lines: 842 This patch contains documentation for SCST core. Signed-off-by: Vladislav Bolkhovitin --- Documentation/scst/README.scst | 823 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 823 insertions(+) diff -uprN orig/linux-2.6.27/Documentation/scst/README.scst linux-2.6.27/Documentation/scst/README.scst --- orig/linux-2.6.27/Documentation/scst/README.scst +++ linux-2.6.27/Documentation/scst/README.scst @@ -0,0 +1,823 @@ +Generic SCSI target mid-level for Linux (SCST) +============================================== + +SCST is designed to provide unified, consistent interface between SCSI +target drivers and Linux kernel and simplify target drivers development +as much as possible. Detail description of SCST's features and internals +could be found in "Generic SCSI Target Middle Level for Linux" document +SCST's Internet page http://scst.sourceforge.net. + +SCST supports the following I/O modes: + + * Pass-through mode with one to many relationship, i.e. when multiple + initiators can connect to the exported pass-through devices, for + the following SCSI devices types: disks (type 0), tapes (type 1), + processors (type 3), CDROMs (type 5), MO disks (type 7), medium + changers (type 8) and RAID controllers (type 0xC) + + * FILEIO mode, which allows to use files on file systems or block + devices as virtual remotely available SCSI disks or CDROMs with + benefits of the Linux page cache + + * BLOCKIO mode, which performs direct block IO with a block device, + bypassing page-cache for all operations. This mode works ideally with + high-end storage HBAs and for applications that either do not need + caching between application and disk or need the large block + throughput + + * User space mode using scst_user device handler, which allows to + implement in the user space virtual SCSI devices in the SCST + environment + + * "Performance" device handlers, which provide in pseudo pass-through + mode a way for direct performance measurements without overhead of + actual data transferring from/to underlying SCSI device + +In addition, SCST supports advanced per-initiator access and devices +visibility management, so different initiators could see different set +of devices with different access permissions. See below for details. + +This is quite stable (but still beta) version. + +Installation +------------ + +To see your devices remotely, you need to add them to at least "Default" +security group (see below how). By default, no local devices are seen +remotely. There must be LUN 0 in each security group, i.e. LUs +numeration must not start from, e.g., 1. + +It is highly recommended to use scstadmin utility for configuring +devices and security groups. + +If you experience problems during modules load or running, check your +kernel logs (or run dmesg command for the few most recent messages). + +IMPORTANT: Without loading appropriate device handler, corresponding devices +========= will be invisible for remote initiators, which could lead to holes + in the LUN addressing, so automatic device scanning by remote SCSI + mid-level could not notice the devices. Therefore you will have + to add them manually via + 'echo "- - -" >/sys/class/scsi_host/hostX/scan', + where X - is the host number. + +IMPORTANT: Working of target and initiator on the same host isn't +========= supported. This is a limitation of the Linux memory/cache + manager, because in this case an OOM deadlock like: system + needs some memory -> it decides to clear some cache -> cache + needs to write on a target exported device -> initiator sends + request to the target -> target needs memory -> problem is + possible. + +IMPORTANT: In the current version simultaneous access to local SCSI devices +========= via standard high-level SCSI drivers (sd, st, sg, etc.) and + SCST's target drivers is unsupported. Especially it is + important for execution via sg and st commands that change + the state of devices and their parameters, because that could + lead to data corruption. If any such command is done, at + least related device handler(s) must be restarted. For block + devices READ/WRITE commands using direct disk handler look to + be safe. + +Device handlers +--------------- + +Device specific drivers (device handlers) are plugins for SCST, which +help SCST to analyze incoming requests and determine parameters, +specific to various types of devices. If an appropriate device handler +for a SCSI device type isn't loaded, SCST doesn't know how to handle +devices of this type, so they will be invisible for remote initiators +(more precisely, "LUN not supported" sense code will be returned). + +In addition to device handlers for real devices, there are VDISK, user +space and "performance" device handlers. + +VDISK device handler works over files on file systems and makes from +them virtual remotely available SCSI disks or CDROM's. In addition, it +allows to work directly over a block device, e.g. local IDE or SCSI disk +or ever disk partition, where there is no file systems overhead. Using +block devices comparing to sending SCSI commands directly to SCSI +mid-level via scsi_do_req()/scsi_execute_async() has advantage that data +are transferred via system cache, so it is possible to fully benefit from +caching and read ahead performed by Linux's VM subsystem. The only +disadvantage here that in the FILEIO mode there is superfluous data +copying between the cache and SCST's buffers. This issue is going to be +addressed in the next release. Virtual CDROM's are useful for remote +installation. See below for details how to setup and use VDISK device +handler. + +SCST user space device handler provides an interface between SCST and +the user space, which allows to create pure user space devices. The +simplest example, where one would want it is if he/she wants to write a +VTL. With scst_user he/she can write it purely in the user space. Or one +would want it if he/she needs some sophisticated for kernel space +processing of the passed data, like encrypting them or making snapshots. + +"Performance" device handlers for disks, MO disks and tapes in their +exec() method skip (pretend to execute) all READ and WRITE operations +and thus provide a way for direct link performance measurements without +overhead of actual data transferring from/to underlying SCSI device. + +NOTE: Since "perf" device handlers on READ operations don't touch the +==== commands' data buffer, it is returned to remote initiators as it + was allocated, without even being zeroed. Thus, "perf" device + handlers impose some security risk, so use them with caution. + +Compilation options +------------------- + +There are the following compilation options, that could be change using +your favorit kernel configuration Makefile target, e.g. "make xconfig": + + - CONFIG_SCST_DEBUG - if defined, turns on some debugging code, + including some logging. Makes the driver considerably bigger and slower, + producing large amount of log data. + + - CONFIG_SCST_TRACING - if defined, turns on ability to log events. Makes the + driver considerably bigger and leads to some performance loss. + + - CONFIG_SCST_EXTRACHECKS - if defined, adds extra validity checks in + the various places. + + - CONFIG_SCST_USE_EXPECTED_VALUES - if not defined (default), initiator + supplied expected data transfer length and direction will be used only for + verification purposes to return error or warn in case if one of them + is invalid. Instead, locally decoded from SCSI command values will be + used. This is necessary for security reasons, because otherwise a + faulty initiator can crash target by supplying invalid value in one + of those parameters. This is especially important in case of + pass-through mode. If CONFIG_SCST_USE_EXPECTED_VALUES is defined, initiator + supplied expected data transfer length and direction will override + the locally decoded values. This might be necessary if internal SCST + commands translation table doesn't contain SCSI command, which is + used in your environment. You can know that if you have messages like + "Unknown opcode XX for YY. Should you update scst_scsi_op_table?" in + your kernel log and your initiator returns an error. Also report + those messages in the SCST mailing list + scst-devel@lists.sourceforge.net. Note, that not all SCSI transports + support supplying expected values. + + - CONFIG_SCST_DEBUG_TM - if defined, turns on task management functions + debugging, when on LUN 0 in the default access control group some of the + commands will be delayed for about 60 sec., so making the remote + initiator send TM functions, eg ABORT TASK and TARGET RESET. Also + define CONFIG_SCST_TM_DBG_GO_OFFLINE symbol in the Makefile if you + want that the device eventually become completely unresponsive, or + otherwise to circle around ABORTs and RESETs code. Needs CONFIG_SCST_DEBUG + turned on. + + - CONFIG_SCST_STRICT_SERIALIZING - if defined, makes SCST send all commands to + underlying SCSI device synchronously, one after one. This makes task + management more reliable, with cost of some performance penalty. This + is mostly actual for stateful SCSI devices like tapes, where the + result of command's execution depends from device's settings defined + by previous commands. Disk and RAID devices are stateless in the most + cases. The current SCSI core in Linux doesn't allow to abort all + commands reliably if they sent asynchronously to a stateful device. + Turned off by default, turn it on if you use stateful device(s) and + need as much error recovery reliability as possible. As a side + effect, no kernel patching is necessary. + + - CONFIG_SCST_ALLOW_PASSTHROUGH_IO_SUBMIT_IN_SIRQ - if defined, it will be + allowed to submit pass-through commands to real SCSI devices via the SCSI + middle layer using scsi_execute_async() function from soft IRQ + context (tasklets). This used to be the default, but currently it + seems the SCSI middle layer starts expecting only thread context on + the IO submit path, so it is disabled now by default. Enabling it + will decrease amount of context switches and improve performance. It + is more or less safe, in the worst case, if in your configuration the + SCSI middle layer really doesn't expect SIRQ context in + scsi_execute_async() function, you will get a warning message in the + kernel log. + + - CONFIG_SCST_STRICT_SECURITY - if defined, makes SCST zero allocated data + buffers. Undefining it (default) considerably improves performance + and eases CPU load, but could create a security hole (information + leakage), so enable it, if you have strict security requirements. + + - CONFIG_SCST_ABORT_CONSIDER_FINISHED_TASKS_AS_NOT_EXISTING - if defined, + in case when TASK MANAGEMENT function ABORT TASK is trying to abort a + command, which has already finished, remote initiator, which sent the + ABORT TASK request, will receive TASK NOT EXIST (or ABORT FAILED) + response for the ABORT TASK request. This is more logical response, + since, because the command finished, attempt to abort it failed, but + some initiators, particularly VMware iSCSI initiator, consider TASK + NOT EXIST response as if the target got crazy and try to RESET it. + Then sometimes get crazy itself. So, this option is disabled by + default. + + - CONFIG_SCST_MEASURE_LATENCY - if defined, provides in /proc/scsi_tgt/latency + file average commands processing latency. You can clear already + measured results by writing 0 in this file. Note, you need a + non-preemtible kernel to have correct results. + +HIGHMEM kernel configurations are fully supported, but not recommended +for performance reasons, except for scst_user, where they are not +supported, because this module deals with user supplied memory on a +zero-copy manner. If you need to use it, consider change VMSPLIT option +or use 64-bit system configuration instead. + +For changing VMSPLIT option (CONFIG_VMSPLIT to be precise) you should in +"make menuconfig" command set the following variables: + + - General setup->Configure standard kernel features (for small systems): ON + + - General setup->Prompt for development and/or incomplete code/drivers: ON + + - Processor type and features->High Memory Support: OFF + + - Processor type and features->Memory split: according to amount of + memory you have. If it is less than 800MB, you may not touch this + option at all. + +Module parameters +----------------- + +Module scst supports the following parameters: + + - scst_threads - allows to set count of SCST's threads. By default it + is CPU count. + + - scst_max_cmd_mem - sets maximum amount of memory in Mb allowed to be + consumed by the SCST commands for data buffers at any given time. By + default it is approximately TotalMem/4. + +SCST "/proc" commands +--------------------- + +For communications with user space programs SCST provides proc-based +interface in "/proc/scsi_tgt" directory. It contains the following +entries: + + - "help" file, which provides online help for SCST commands + + - "scsi_tgt" file, which on read provides information of serving by SCST + devices and their dev handlers. On write it supports the following + command: + + * "assign H:C:I:L HANDLER_NAME" assigns dev handler "HANDLER_NAME" + on device with host:channel:id:lun + + - "sessions" file, which lists currently connected initiators (open sessions) + + - "sgv" file provides some statistic about with which block sizes + commands from remote initiators come and how effective sgv_pool in + serving those allocations from the cache, i.e. without memory + allocations requests to the kernel. "Size" - is the commands data + size upper rounded to power of 2, "Hit" - how many there are + allocations from the cache, "Total" - total number of allocations. + + - "threads" file, which allows to read and set number of SCST's threads + + - "version" file, which shows version of SCST + + - "trace_level" file, which allows to read and set trace (logging) level + for SCST. See "help" file for list of trace levels. If you want to + enable logging options, which produce a lot of events, like "debug", + to not loose logged events you should also: + + * Increase in .config of your kernel CONFIG_LOG_BUF_SHIFT variable + to much bigger value, then recompile it. For example, I use 25, + but to use it I needed to modify the maximum allowed value for + CONFIG_LOG_BUF_SHIFT in the corresponding Kconfig. + + * Change in your /etc/syslog.conf or other config file of your favorite + logging program to store kernel logs in async manner. For example, + I added in my rsyslog.conf line "kern.info -/var/log/kernel" + and added "kern.none" in line for /var/log/messages, so I had: + "*.info;kern.none;mail.none;authpriv.none;cron.none /var/log/messages" + +Each dev handler has own subdirectory. Most dev handler have only two +files in this subdirectory: "trace_level" and "type". The first one is +similar to main SCST "trace_level" file, the latter one shows SCSI type +number of this handler as well as some text description. + +For example, "echo "assign 1:0:1:0 dev_disk" >/proc/scsi_tgt/scsi_tgt" +will assign device handler "dev_disk" to real device sitting on host 1, +channel 0, ID 1, LUN 0. + +Access and devices visibility management (LUN masking) +------------------------------------------------------ + +Access and devices visibility management allows for an initiator or +group of initiators to have different views of LUs/LUNs (security groups) +each with appropriate access permissions. It is highly recommended to +use scstadmin utility for that purpose instead of described in this +section low level interface. + +Initiator is represented as an SCST session. The session is bound to +security group on its registration time by character "name" parameter of +the registration function, which provided by target driver, based on its +internal authentication. For example, for FC "name" could be WWN or just +loop ID. For iSCSI this could be iSCSI login credentials or iSCSI +initiator name. Each security group has set of names assigned to it by +system administrator. Session is bound to security group with provided +name. If no such groups found, the session bound to either +"Default_target_name", or "Default" group, depending from either +"Default_target_name" exists or not. In "Default_target_name" target +name means name of the target. + +In /proc/scsi_tgt each group represented as "groups/GROUP_NAME/" +subdirectory. In it there are files "devices" and "names". File +"devices" lists all devices and their LUNs in the group, file "names" +lists all names that should be bound to this group. + +To configure access and devices visibility management SCST provides the +following files and directories under /proc/scsi_tgt: + + - "add_group GROUP" to /proc/scsi_tgt/scsi_tgt adds group "GROUP" + + - "del_group GROUP" to /proc/scsi_tgt/scsi_tgt deletes group "GROUP" + + - "add H:C:I:L lun [READ_ONLY]" to /proc/scsi_tgt/groups/GROUP/devices adds + device with host:channel:id:lun as LUN "lun" in group "GROUP". Optionally, + the device could be marked as read only. + + - "del H:C:I:L" to /proc/scsi_tgt/groups/GROUP/devices deletes device with + host:channel:id:lun from group "GROUP" + + - "add V_NAME lun [READ_ONLY]" to /proc/scsi_tgt/groups/GROUP/devices adds + device with virtual name "V_NAME" as LUN "lun" in group "GROUP". + Optionally, the device could be marked as read only. + + - "del V_NAME" to /proc/scsi_tgt/groups/GROUP/devices deletes device with + virtual name "V_NAME" from group "GROUP" + + - "clear" to /proc/scsi_tgt/groups/GROUP/devices clears the list of devices + for group "GROUP" + + - "add NAME" to /proc/scsi_tgt/groups/GROUP/names adds name "NAME" to group + "GROUP" + + - "del NAME" to /proc/scsi_tgt/groups/GROUP/names deletes name "NAME" from group + "GROUP" + + - "clear" to /proc/scsi_tgt/groups/GROUP/names clears the list of names + for group "GROUP" + +There must be LUN 0 in each security group, i.e. LUs numeration must not +start from, e.g., 1. + +Examples: + + - "echo "add 1:0:1:0 0" >/proc/scsi_tgt/groups/Default/devices" will + add real device sitting on host 1, channel 0, ID 1, LUN 0 to "Default" + group with LUN 0. + + - "echo "add disk1 1" >/proc/scsi_tgt/groups/Default/devices" will + add virtual VDISK device with name "disk1" to "Default" group + with LUN 1. + +VDISK device handler +-------------------- + +After loading VDISK device handler creates in "/proc/scsi_tgt/" +subdirectories "vdisk" and "vcdrom". They have similar layout: + + - "trace_level" and "type" files as described for other dev handlers + + - "help" file, which provides online help for VDISK commands + + - "vdisk"/"vcdrom" files, which on read provides information of + currently open device files. On write it supports the following + command: + + * "open NAME [PATH] [BLOCK_SIZE] [FLAGS]" - opens file "PATH" as + device "NAME" with block size "BLOCK_SIZE" bytes with flags + "FLAGS". "PATH" could be empty only for VDISK CDROM. "BLOCK_SIZE" + and "FLAGS" are valid only for disk VDISK. The block size must be + power of 2 and >= 512 bytes. Default is 512. Possible flags: + + - WRITE_THROUGH - write back caching disabled. Note, this option + has sense only if you also *manually* disable write-back cache + in *all* your backstorage devices and make sure it's actually + disabled, since many devices are known to lie about this mode to + get better benchmark results. + + - READ_ONLY - read only + + - O_DIRECT - both read and write caching disabled. This mode + isn't currently fully implemented, you should use user space + fileio_tgt program in O_DIRECT mode instead (see below). + + - NULLIO - in this mode no real IO will be done, but success will be + returned. Intended to be used for performance measurements at the same + way as "*_perf" handlers. + + - NV_CACHE - enables "non-volatile cache" mode. In this mode it is + assumed that the target has a GOOD UPS with ability to cleanly + shutdown target in case of power failure and it is + software/hardware bugs free, i.e. all data from the target's + cache are guaranteed sooner or later to go to the media. Hence + all data synchronization with media operations, like + SYNCHRONIZE_CACHE, are ignored in order to bring more + performance. Also in this mode target reports to initiators that + the corresponding device has write-through cache to disable all + write-back cache workarounds used by initiators. Use with + extreme caution, since in this mode after a crash of the target + journaled file systems don't guarantee the consistency after + journal recovery, therefore manual fsck MUST be ran. Note, that + since usually the journal barrier protection (see "IMPORTANT" + note below) turned off, enabling NV_CACHE could change nothing + from data protection point of view, since no data + synchronization with media operations will go from the + initiator. This option overrides WRITE_THROUGH. + + - BLOCKIO - enables block mode, which will perform direct block + IO with a block device, bypassing page-cache for all operations. + This mode works ideally with high-end storage HBAs and for + applications that either do not need caching between application + and disk or need the large block throughput. See also below. + + - REMOVABLE - with this flag set the device is reported to remote + initiators as removable. + + * "close NAME" - closes device "NAME". + + * "change NAME [PATH]" - changes a virtual CD in the VDISK CDROM. + +By default, if neither BLOCKIO, nor NULLIO option is supplied, FILEIO +mode is used. + +For example, "echo "open disk1 /vdisks/disk1" >/proc/scsi_tgt/vdisk/vdisk" +will open file /vdisks/disk1 as virtual FILEIO disk with name "disk1". + +CAUTION: If you partitioned/formatted your device with block size X, *NEVER* +======== ever try to export and then mount it (even accidentally) with another + block size. Otherwise you can *instantly* damage it pretty + badly as well as all your data on it. Messages on initiator + like: "attempt to access beyond end of device" is the sign of + such damage. + + Moreover, if you want to compare how well different block sizes + work for you, you **MUST** EVERY TIME AFTER CHANGING BLOCK SIZE + **COMPLETELY** **WIPE OFF** ALL THE DATA FROM THE DEVICE. In + other words, THE **WHOLE** DEVICE **MUST** HAVE ONLY **ZEROS** + AS THE DATA AFTER YOU SWITCH TO NEW BLOCK SIZE. Switching block + sizes isn't like switching between FILEIO and BLOCKIO, after + changing block size all previously written with another block + size data MUST BE ERASED. Otherwise you will have a full set of + very weird behaviors, because blocks addressing will be + changed, but initiators in most cases will not have a + possibility to detect that old addresses written on the device + in, e.g., partition table, don't refer anymore to what they are + intended to refer. + +IMPORTANT: By default for performance reasons VDISK FILEIO devices use write +========= back caching policy. This is generally safe from the consistence of + journaled file systems, laying over them, point of view, but + your unsaved cached data will be lost in case of + power/hardware/software failure, so you must supply your + target server with some kind of UPS or disable write back + caching using WRITE_THROUGH flag. You also should note, that + the file systems journaling over write back caching enabled + devices works reliably *ONLY* if the order of journal writes + is guaranteed or it uses some kind of data protection + barriers (i.e. after writing journal data some kind of + synchronization with media operations is used), otherwise, + because of possible reordering in the cache, even after + successful journal rollback, you very much risk to loose your + data on the FS. Currently, Linux IO subsystem guarantees + order of write operations only using data protection + barriers. Some info about it from the XFS point of view could + be found at http://oss.sgi.com/projects/xfs/faq.html#wcache. + On Linux initiators for EXT3 and ReiserFS file systems the + barrier protection could be turned on using "barrier=1" and + "barrier=flush" mount options correspondingly. Note, that + usually it turned off by default and the status of barriers + usage isn't reported anywhere in the system logs as well as + there is no way to know it on the mounted file system (at + least no known one). Windows and, AFAIK, other UNIX'es don't + need any special explicit options and do necessary barrier + actions on write-back caching devices by default. Also note + that on some real-life workloads write through caching might + perform better, than write back one with the barrier + protection turned on. + Also you should realize that Linux doesn't provide a + guarantee that after sync()/fsync() all written data really + hit permanent storage, they can be then in the cache of your + backstorage device and lost on power failure event. Thus, + ever with write-through cache mode, you still need a good UPS + to protect yourself from your data loss (note, data loss, not + the file system integrity corruption). + +IMPORTANT: Some disk and partition table management utilities don't support +========= block sizes >512 bytes, therefore make sure that your favorite one + supports it. Currently only cfdisk is known to work only with + 512 bytes blocks, other utilities like fdisk on Linux or + standard disk manager on Windows are proved to work well with + non-512 bytes blocks. Note, if you export a disk file or + device with some block size, different from one, with which + it was already partitioned, you could get various weird + things like utilities hang up or other unexpected behavior. + Hence, to be sure, zero the exported file or device before + the first access to it from the remote initiator with another + block size. On Window initiator make sure you "Set Signature" + in the disk manager on the imported from the target drive + before doing any other partitioning on it. After you + successfully mounted a file system over non-512 bytes block + size device, the block size stops matter, any program will + work with files on such file system. + +BLOCKIO VDISK mode +------------------ + +This module works best for these types of scenarios: + +1) Data that are not aligned to 4K sector boundaries and <4K block sizes +are used, which is normally found in virtualization environments where +operating systems start partitions on odd sectors (Windows and it's +sector 63). + +2) Large block data transfers normally found in database loads/dumps and +streaming media. + +3) Advanced relational database systems that perform their own caching +which prefer or demand direct IO access and, because of the nature of +their data access, can actually see worse performance with +non-discriminate caching. + +4) Multiple layers of targets were the secondary and above layers need +to have a consistent view of the primary targets in order to preserve +data integrity which a page cache backed IO type might not provide +reliably. + +Also it has an advantage over FILEIO that it doesn't copy data between +the system cache and the commands data buffers, so it saves a +considerable amount of CPU power and memory bandwidth. + +IMPORTANT: Since data in BLOCKIO and FILEIO modes are not consistent between +========= them, if you try to use a device in both those modes simultaneously, + you will almost instantly corrupt your data on that device. + +Pass-through mode +----------------- + +In the pass-through mode (i.e. using the pass-through device handlers +scst_disk, scst_tape, etc) SCSI commands, coming from remote initiators, +are passed to local SCSI hardware on target as is, without any +modifications. As any other hardware, the local SCSI hardware can not +handle commands with amount of data and/or segments count in +scatter-gather array bigger some values. Therefore, when using the +pass-through mode you should note that values for maximum number of +segments and maximum amount of transferred data for each SCSI command on +devices on initiators can not be bigger, than corresponding values of +the corresponding SCSI devices on the target. Otherwise you will see +symptoms like small transfers work well, but large ones stall and +messages like: "Unable to complete command due to SG IO count +limitation" are printed in the kernel logs. + +You can't control from the user space limit of the scatter-gather +segments, but for block devices usually it is sufficient if you set on +the initiators /sys/block/DEVICE_NAME/queue/max_sectors_kb in the same +or lower value as in /sys/block/DEVICE_NAME/queue/max_hw_sectors_kb for +the corresponding devices on the target. + +For not-block devices SCSI commands are usually generated directly by +applications, so, if you experience large transfers stalls, you should +check documentation for your application how to limit the transfer +sizes. + +User space mode using scst_user dev handler +------------------------------------------- + +User space program fileio_tgt uses interface of scst_user dev handler +and allows to see how it works in various modes. Fileio_tgt provides +mostly the same functionality as scst_vdisk handler with the most +noticeable difference that it supports O_DIRECT mode. O_DIRECT mode is +basically the same as BLOCKIO, but also supports files, so for some +loads it could be significantly faster, than the regular FILEIO access. +All the words about BLOCKIO from above apply to O_DIRECT as well. See +fileio_tgt's README file for more details. + +Performance +----------- + +Before doing any performance measurements note that: + +I. Performance results are very much dependent from your type of load, +so it is crucial that you choose access mode (FILEIO, BLOCKIO, +O_DIRECT, pass-through), which suits your needs the best. + +II. In order to get the maximum performance you should: + +1. For SCST: + + - Disable in Makefile CONFIG_SCST_STRICT_SERIALIZING, CONFIG_SCST_EXTRACHECKS, + CONFIG_SCST_TRACING, CONFIG_SCST_DEBUG*, CONFIG_SCST_STRICT_SECURITY + + - For pass-through devices enable + CONFIG_SCST_ALLOW_PASSTHROUGH_IO_SUBMIT_IN_SIRQ. + +2. For target drivers: + + - Disable in Makefiles CONFIG_SCST_EXTRACHECKS, CONFIG_SCST_TRACING, + CONFIG_SCST_DEBUG* + +3. For device handlers, including VDISK: + + - Disable in Makefile CONFIG_SCST_TRACING and CONFIG_SCST_DEBUG. + + - If your initiator(s) use dedicated exported from the target virtual + SCSI devices and have more or equal amount of memory, than the + target, it is recommended to use O_DIRECT option (currently it is + available only with fileio_tgt user space program) or BLOCKIO. With + them you could have up to 100% increase in throughput. + +IMPORTANT: Some of the compilation options enabled by default, i.e. SCST +========= is optimized currently rather for development and bug hunting, + than for performance. + +If you use SCST version taken directly from the SVN repository, you can +set the above options, except CONFIG_SCST_ALLOW_PASSTHROUGH_IO_SUBMIT_IN_SIRQ, +using debug2perf Makefile target. + +4. For other target and initiator software parts: + + - Don't enable debug/hacking features in the kernel, i.e. use them as + they are by default. + + - The default kernel read-ahead and queuing settings are optimized + for locally attached disks, therefore they are not optimal if they + attached remotely (SCSI target case), which sometimes could lead to + unexpectedly low throughput. You should increase read-ahead size to at + least 512KB or even more on all initiators and the target. + + You should also limit on all initiators maximum amount of sectors per + SCSI command. To do it on Linux initiators, run: + + echo ?64? > /sys/block/sdX/queue/max_sectors_kb + + where specify instead of X your imported from target device letter, + like 'b', i.e. sdb. + + To increase read-ahead size on Linux, run: + + blockdev --setra N /dev/sdX + + where N is a read-ahead number in 512-byte sectors and X is a device + letter like above. + + Note: you need to set read-ahead setting for device sdX again after + you changed the maximum amount of sectors per SCSI command for that + device. + + - You may need to increase amount of requests that OS on initiator + sends to the target device. To do it on Linux initiators, run + + echo ?64? > /sys/block/sdX/queue/nr_requests + + where X is a device letter like above. + + You may also experiment with other parameters in /sys/block/sdX + directory, they also affect performance. If you find the best values, + please share them with us. + + - On the target CFQ IO scheduler. In most cases it has performance + advantage over other IO schedulers, sometimes huge (2+ times + aggregate throughput increase). + + - It is recommended to turn the kernel preemption off, i.e. set + the kernel preemption model to "No Forced Preemption (Server)". + + - Looks like XFS is the best filesystem on the target to store device + files, because it allows considerably better linear write throughput, + than ext3. + +5. For hardware on target. + + - Make sure that your target hardware (e.g. target FC or network card) + and underlaying IO hardware (e.g. IO card, like SATA, SCSI or RAID to + which your disks connected) don't share the same PCI bus. You can + check it using lspci utility. They have to work in parallel, so it + will be better if they don't compete for the bus. The problem is not + only in the bandwidth, which they have to share, but also in the + interaction between cards during that competition. This is very + important, because in some cases if target and backend storage + controllers share the same PCI bus, it could lead up to 5-10 times + less performance, than expected. Moreover, some motherboard (by + Supermicro, particularly) have serious stability issues if there are + several high speed devices on the same bus working in parallel. If + you have no choice, but PCI bus sharing, set in the BIOS PCI latency + as low as possible. + +6. If you use VDISK IO module in FILEIO mode, NV_CACHE option will +provide you the best performance. But using it make sure you use a good +UPS with ability to shutdown the target on the power failure. + +IMPORTANT: If you use on initiator some versions of Windows (at least W2K) +========= you can't get good write performance for VDISK FILEIO devices with + default 512 bytes block sizes. You could get about 10% of the + expected one. This is because of the partition alignment, which + is (simplifying) incompatible with how Linux page cache + works, so for each write the corresponding block must be read + first. Use 4096 bytes block sizes for VDISK devices and you + will have the expected write performance. Actually, any OS on + initiators, not only Windows, will benefit from block size + max(PAGE_SIZE, BLOCK_SIZE_ON_UNDERLYING_FS), where PAGE_SIZE + is the page size, BLOCK_SIZE_ON_UNDERLYING_FS is block size + on the underlying FS, on which the device file located, or 0, + if a device node is used. Both values are from the target. + See also important notes about setting block sizes >512 bytes + for VDISK FILEIO devices above. + +What if target's backstorage is too slow +---------------------------------------- + +If under high load you experience I/O stalls or see in the kernel log on +the target abort or reset messages, then your backstorage is too slow +comparing with your target link speed and amount of simultaneously +queued commands. On some seek intensive workloads even fast disks or +RAIDs, which able to serve continuous data stream on 500+ MB/s speed, +can be as slow as 0.3 MB/s. Another possible cause for that can be +MD/LVM/RAID on your target as in http://lkml.org/lkml/2008/2/27/96 +(check the whole thread as well). + +Thus, in such situations simply processing of one or more commands takes +too long time, hence initiator decides that they are stuck on the target +and tries to recover. Particularly, it is known that the default amount +of simultaneously queued commands (48) is sometimes too high if you do +intensive writes from VMware on a target disk, which uses LVM in the +snapshot mode. In this case value like 16 or even 8-10 depending of your +backstorage speed could be more appropriate. + +Unfortunately, currently SCST lacks dynamic I/O flow control, when the +queue depth on the target is dynamically decreased/increased based on +how slow/fast the backstorage speed comparing to the target link. So, +there are only 5 possible actions, which you can do to workaround or fix +this issue: + +1. Ignore incoming task management (TM) commands. It's fine if there are +not too many of them, so average performance isn't hurt and the +corresponding device isn't put offline, i.e. if the backstorage isn't +too much slow. + +2. Decrease /sys/block/sdX/device/queue_depth on the initiator in case +if it's Linux (see below how) or/and SCST_MAX_TGT_DEV_COMMANDS constant +in scst_priv.h file until you stop seeing incoming TM commands. +ISCSI-SCST driver also has its own iSCSI specific parameter for that. + +3. Try to avoid such seek intensive workloads. + +4. Insrease speed of the target's backstorage. + +5. Implement in SCST the dynamic I/O flow control. + +To decrease device queue depth on Linux initiators run command: + +# echo Y >/sys/block/sdX/device/queue_depth + +where Y is the new number of simultaneously queued commands, X - your +imported device letter, like 'a' for sda device. There are no special +limitations for Y value, it can be any value from 1 to possible maximum +(usually, 32), so start from dividing the current value on 2, i.e. set +16, if /sys/block/sdX/device/queue_depth contains 32. + +Note, that logged messages about QUEUE_FULL status are quite different +by nature. This is a normal work, just SCSI flow control in action. +Simply don't enable "mgmt_minor" logging level, or, alternatively, if +you are confident in the worst case performance of your back-end +storage, you can increase SCST_MAX_TGT_DEV_COMMANDS in scst_priv.h to +64. Usually initiators don't try to push more commands on the target. + +Credits +------- + +Thanks to: + + * Mark Buechler for a lot of useful + suggestions, bug reports and help in debugging. + + * Ming Zhang for fixes and comments. + + * Nathaniel Clark for fixes and comments. + + * Calvin Morrow for testing and useful + suggestions. + + * Hu Gang for the original version of the + LSI target driver. + + * Erik Habbinga for fixes and support + of the LSI target driver. + + * Ross S. W. Walker for the original block IO + code and Vu Pham who updated it for the VDISK dev + handler. + + * Michael G. Byrnes for fixes. + + * Alessandro Premoli for fixes + + * Nathan Bullock for fixes. + + * Terry Greeniaus for fixes. + + * Krzysztof Blaszkowski for many fixes and bug reports. + + * Jianxi Chen for fixing problem with + devices >2TB in size + + * Bart Van Assche for a lot of help + +Vladislav Bolkhovitin , http://scst.sourceforge.net -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/