Please review this new (although, perhaps, the oldest) SCSI target
framework for Linux SCST and 4 target drivers for it: for Qlogic
22xx/23xx cards (qla2x00t), for iSCSI (iscsi-scst), for Infiniband SRP
(srpt) and for local access to SCST provided backend for purpose of
testing and creating target drivers in user space (scst_local).
The activity on the mailing lists of the existing Linux SCSI storage
target implementations shows that many people are using Linux to build a
networked storage solutions. We believe that a SCSI storage target
implementation should run inside the Linux kernel and not in user space.
It would be a great service to the users of SCSI target software to
include a mature and high-performance SCSI target framework in the
mainline kernel. We are convinced that SCST is the implementation that
is best suited for inclusion in the mainline Linux kernel. The current
SCST implementation and selected target drivers posted here as a series
of patches to gather feedback about how inclusion in the mainline kernel
should proceed. Any comments and suggestions would be greatly appreciated.
The posted modules are almost ready for inclusion into the kernel, the
main thing is left to be done is change of the interface with user space
from procfs to sysfs. If procfs-based interface was allowed for new
kernel modules, SCST and above target drivers could be called fully
mainline ready. See SCST proc interface patch below for description of
the proposed sysfs interface replacement.
The strengths of SCST are explained below by comparing SCST with STGT.
Although we have a lot of respect for the STGT project and the people
who have worked on it, the STGT SCSI framework doesn't satisfy its users
in the following areas:
I. Performance.
Especially many questions have risen about performance of STGT.
Architecture of STGT has SCSI target state machine and memory management
in user space. This approach isn't too well suitable, if target driver
has to be written in kernel space. In this case incoming SCSI commands
should many times during processing pass user-kernel space boundary with
all the related overhead, which increases the commands processing
latency and limits the resulting performance. See, for instance,
http://thread.gmane.org/gmane.linux.iscsi.tgt.devel/219 thread as well
as http://lkml.org/lkml/2008/1/29/387 or
http://lists.wpkg.org/pipermail/stgt/2007-December/001211.html.
Modern SCSI transports, e.g. Infiniband, have link latency in range 1-3
microseconds. This is the same order of magnitude as the time needed for
a system call on a modern system (between 1 and 4 microseconds) and not
much more than the time needed for an empty system call (about 0.1
microseconds). So, only ten empty syscalls or one context switch add the
same latency as the link. Even 1Gbps Ethernet has less than 100
microseconds of round-trip latency.
For instance, recently there was comparison using SCST SRP target driver
and NULLIO backend between processing done in a thread and processing
done in SIRQ context (tasklet). Load was multithreaded 4K random writes
from a single initiator. Advantage of SIRQ processing was quite impressive:
- 1 drive - 57% improvement (140K vs 89K IOPS)
- 2 drives - 51% improvement (155K vs 102K IOPS)
- 3 drives - 47% improvement (173K vs 118K IOPS)
Note, that 173K IOPS on 4K blocks means ~700MB/s of throughput and 118K
means ~470MB/s. I.e., elimination of few context switches from commands
processing brings 230MB/s increase!
Another source of additional unavoidable latency with the user space
approach is copying data to and from the page cache. Modern memory has
about 2.5GB/s copy throughput. So, with a high speed interface, like
20Gbps Infiniband, memory copy almost doubles the overall latency. With
the fully kernel space approach, the page cache can be used directly in
zero-copy manner, including by user space backend handlers (see below).
The only way how zero-copy cache can be implemented in STGT is to
completely duplicate functionality of page-cache, including read-ahead,
in user space.
In the past there have been objections to the above argumentation
arguing that the latency of backing storage (e.g. rotating magnetic
disks) is a lot more than one microsecond, especially for seek-intensive
workloads. This argument is void however: system operators know very
well that in order to reach an acceptable performance, a network storage
device has to run in write-back mode and has to be equipped with
sufficient RAM memory. Nothing prevents a target from having 8 or even
64GB of cache, so most even random accesses could be served by it.
II. Microkernel-like architecture and overall simplicity.
The general architecture of STGT doesn't suit too well to the widely
acknowledged Linux paradigm to avoid microkernel-like distributed
processing. This is because STGT has two parts: an in-kernel
"microkernel" and a user space part, which does most of the job, but not
all. For a SCSI target, especially with hardware target card, data
arrives in the kernel and eventually served by kernel, which does the
actual I/O or get/put data from/to cache. Dividing the requests
processing job between user and kernel spaces creates unnecessary
interface boundary and effectively makes the requests processing job
distributed with all its complexity and reliability problems. As an
example, what will currently happen in STGT if the user space part
suddenly dies? Will the kernel part gracefully recover from it? How much
effort will be needed to implement that?
Yes, such architecture allows to create less kernel code, but at the
expense of more complex user space code, more complex interface between
kernel and user spaces and, hence, more complicated maintenance of the
combined code. See http://lkml.org/lkml/2007/4/24/364 for the perfect
description of why.
III. Complete pass-through mode.
When a SCSI target provides direct access to its backend SCSI devices in
pass-through mode to comply with SAM it must intercept coming commands
and emulate some functionality of a SCSI host. This is necessary,
because backend devices see only a single nexus (SCSI target host), not
each initiator as separate nexuses. See
http://thread.gmane.org/gmane.linux.scsi/31288 for more details.
Particularly, access to the exported local backend SCSI devices locally
from the target should be from a separate nexus as well. To achieve that
all commands to those devices should go through the SCSI target engine
to allow it to make all the necessary emulation. Definitely, passing all
local commands through a user space process is absolutely unacceptable,
because it would ruin performance of the system. But with in-kernel SCSI
target it can be done simply and painlessly.
SCST addresses all the above issues. It has the best performance, solid
processing architecture and allows straightforward, high performance
implementation of the local nexus handling. But it also:
1. Has more target drivers, which cover most of SCSI transports: Fibre
Channel, iSCSI, Infiniband SRP, parallel SCSI and SAS (most likely).
2. Has additional features. Particularly:
- Faster backend handlers in user space. The architecture where the
SCSI target state machine and memory management are in kernel, allows
user space backend handlers to handle SCSI commands with 1
syscall/command, no additional context switches (no kernel threads
involved in processing) and no need to map/unmap user space pages. In
future, it is possible to add an splice-like interface, which allows
complete zero-copy with page cache. (At the moment zero-copy is only on
the target side, i.e. between SCST and user space handlers; backend IO
side with cache has to be done by the handlers using regular
read()/write() calls, which copy data.)
- Near complete 1:N pass-through mode.
- Advanced per-initiator device visibility management (LUN masking),
which allows different initiators to see different set of devices with
different access permissions. For instance, initiator A could see
exported from target T devices X and Y read-writable, and initiator B
from the same target T could see devices Y read-only and Z read-writable.
3. Stable, mature and complete for all necessary basic functionality[*].
At least 3 companies already have designed and are selling products
with SCST-based engines (for instance,
http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=35&menu_section=34;
unfortunately, I don't have rights to name others) and, at least, 5
companies are developing their SCST-based products in areas of HA
solutions, VTLs, FC/iSCSI bridges, thin provisioning appliances, regular
SANs, etc. How many are there commercially sold STGT-based products?
Also you can compare yourself how widely and production-wise used SCST
and STGT by simply looking at their mail lists archives:
http://sourceforge.net/mailarchive/forum.php?forum_name=scst-devel for
SCST and http://lists.wpkg.org/pipermail/stgt. In the SCST's archive you
can find many questions from people actively using it in production or
developing own SCST-based products. In the STGT's archive you can find
that people implementing basic things, like management interface, and
fixing basic problems, like crash when there are more than 40 LUNs per
target (SCST successfully tested with >4000 LUNs). (STGT developers, no
personal offense here, please.)
Neither STGT nor LIO can claim that they are complete in the basic
functionality area. For instance, both lack support for transports,
which don't supply expected transfer values, e.g. parallel SCSI/SAS, and
this feature requires a major code surgery to be added. But SCST can do
it *now*. Sure, there are still several areas for improvement (see,
e.g., http://scst.sourceforge.net/contributing.html), but those are
pretty advanced features, like zero-copy cache IO, persistent
reservations, dynamic flow control, etc.
Thus, we believe, that SCST should replace STGT in the kernel. From the
kernel point of view STGT doesn't have any advantage over SCST, because
SCST also allows creating backend devices handlers and target mode
drivers in user space. The only exception is in-kernel target driver for
IBM virtual SCSI (ibmvscsi), which is done for STGT and doesn't support
SCST. So, until it is changed to work with SCST, I guess, both SCSI
target frameworks should coexist.
Target driver for iSER is completely user space, so it will not be
affected. And such drivers is the place where SCST and user space part
of STGT can (and should) supplement each other. User space part of STGT
is a good framework/library for creation of user space target drivers
and together with scst_local driver SCST can be a good backend for them.
It would allow to exploit all SCST's features, including pass-through
mode and user space backend handlers. Overhead of such architecture
would be (per command):
1. For pass-through mode - 2 context switches (to backend thread and
back), which can be avoided (see below).
2. For vdisk mode (FILEIO/BLOCKIO) - overhead of SG/BSG driver, SCSI
subsystem/block layer, 2 context switches, which can be avoided (see
below) and minus own STGT overhead, which it currently has to submit
requests in this mode using pthreads (quite high, in fact).
3. For user space backend handlers - overhead of SG/BSG driver, SCSI
subsystem/block layer and 2 context switches.
In all three cases data would be passed in zero-copy manner, for user
space backend handlers - using shared memory.
In all the cases I don't count SCST overhead, because, basically, STGT
at the moment does about the same processing. If the duplicated code
removed, this overhead would be almost completely eliminated.
The referred in (1) and (2) context switches are necessary, because at
the moment queuecommand() is called under host_lock and IRQs disabled.
If we can find a way to drop that lock and reenable IRQs, those context
switches wouldn't be needed.
Thus, since, when people decide to do something in user space, they are
willing to accept some performance loss, overhead of SG/BSG driver and
SCSI subsystem/block layer should be very acceptable for them.
This set of patches places SCST in drivers/scst. It isn't
drivers/scsi/scst, as one might expect, because, in fact, SCST shares
almost nothing with the Linux SCSI subsystem. Relation between them is
the same as between client and server. For instance, between Apache
(HTTP server) and Firefox (HTTP client). Or between NFS client and
server. How much is it shared code between them? Near zero? For
instance, in Linux 2.6.27:
$ find fs/nfs/ -name *.[ch] | xargs wc -l
...
30004 total
$ find fs/nfsd/ -name *.[ch] | xargs wc -l
...
19905 total
$ find fs/nfs_common/ -name *.[ch] | xargs wc -l
...
277 total
I.e., between NFS client and server shared only 277 lines of code among
almost 50000 (<0.6%).
The same is true for SCSI target and initiator. The only code I see can
be made shared between them is scsi_alloc_sgtable(), if it is made
exported, not static as at the moment. Usage of this function would
allow to make scst_alloc() more robust, because of usage of mempool.
Scst_alloc() used for buffers allocation in corner cases processing.
Everything else, including memory management, serves different needs, so
making them shared would only make them more complicated for no gain.
Particularly, nothing of SCSI target code can be used by SCSI initiator
code. For example, SCSI initiator subsystem deals with memory already
mapped from user space or allocated by VM layer. Then, after the
corresponding command completed, that memory can't be reused by it. In
contrast, SCSI target subsystem always manages memory itself and then,
after the corresponding command completed, reuse of that memory is a
good chance to gain some performance. If you object me and think that
there is much more code, which can be shared, please be specific and
point out to *exact* functions to share.
STGT gives another example, why coupling SCSI target and initiators
subsystems together is a bad idea. Consider, if I make a general purpose
kernel, for which 1% of users would run target mode. I would have to
enable as module "SCSI target support" as well as "SCSI target support
for transport attributes". Now 99% of users of my kernel, who don't need
SCSI target, but need SCSI initiator drivers, would have to have
scsi_tgt loaded, because transport attribute drivers would depend on it:
# lsmod
Module Size Used by
qla2xxx 130844 0
firmware_class 8064 1 qla2xxx
scsi_transport_fc 40900 1 qla2xxx
scsi_tgt 12196 1 scsi_transport_fc
brd 6924 0
xfs 511280 1
dm_mirror 24368 0
dm_mod 51148 1 dm_mirror
uhci_hcd 21400 0
sg 31784 0
e1000 114536 0
pcspkr 3328 0
No target functionality is needed, but target mode subsystem is needed.
Is it a good design? SCST doesn't have such issue.
At last, why SCST and not LIO. Most important reasons:
1. LIO is terribly overengineered, hence the code isn't clear and very
hard (near to impossible, actually) to review and audit. Its interfaces
a lot more complicated, than SCST's ones, with, basically, the same
functionality.
2. LIO only supports software iSCSI target and is iSCSI centric by
design. There is a lot of work to do to allow non-iSCSI target driver be
ran with none iSCSI-specific code loaded.
3. LIO supports neither user space backend, nor user space target drivers.
...
Also, few years ago one of the main reasons to reject the core-iscsi
iSCSI initiator from being included in the kernel was its support for
MC/S. Then, to be accepted, support for MC/S was removed from
open-iscsi. LIO iSCSI target supports MC/S and I'm not sure that times
have changed so much so now MC/S become an advantage from the inclusion
in the kernel POV.
SCST, in contrast, has clear, well commented and documented (see
http://scst.sourceforge.net/scst_pg.html) code, supports many target
drivers, transport-neutral by design, supports user space backend/target
drivers and has not overcomplicated by MC/S iSCSI target driver.
So, those patches add completely new subsystem to Linux. The code layout
was made by the referred above example of NFS: client is in
drivers/scsi, server is in drivers/scst. Possible future shared code can
be places in drivers/scsi_common.
Currently SCST is maintained separately from the kernel, so the patches
sometimes have code, which isn't needed, if SCST included in the kernel.
This code will be removed in the next iteration.
Those patches were ran through checkpatch and sparse (huge thanks to
Bart!). See
http://sourceforge.net/mailarchive/forum.php?thread_name=e2e108260811261131k52d248bs10d52064273620e%40mail.gmail.com&forum_name=scst-devel.
All the warnings and errors either acknowledged by the checkpatch
authors false positives, or acknowledged problems in the latest
development sparse (see http://lkml.org/lkml/2008/12/2/256 thread), or
harmless and acceptable in our opinion.
The patches in this iteration are made against 2.6.27.x. In the next
iteration they will be prepared against the necessary kernel version as
we will be told (linux-next, I guess?)
The patch set is quite big, but self-containing and does only minimal,
straightforward changes in other parts of the kernel. The only exception
is put_page_callback patch, which implements in TCP notifications about
completion of data transmit. This patch is an optimization necessary to
implement zero-copy data transfer by iSCSI-SCST from user space backend
handlers. But this patch isn't required. Without it data from the user
space backend handlers will be transferred on the regular way with data
copy to TCP send buffers, i.e. on the same way as STGT iSCSI target
driver does. Data from in-kernel backend will still be transmitted
zero-copy.
Here is the list of patches. They depend on each other, only
put_page_callback.diff is independent and optional.
[PATCH][RFC 1/23]: SCST public headers
[PATCH][RFC 2/23]: SCST core
[PATCH][RFC 3/23]: SCST core docs
[PATCH][RFC 4/23]: SCST debug support
[PATCH][RFC 5/23]: SCST /proc interface
[PATCH][RFC 6/23]: SCST SGV cache
[PATCH][RFC 7/23]: SCST integration into the kernel
[PATCH][RFC 8/23]: SCST pass-through backend handlers
[PATCH][RFC 9/23]: SCST virtual disk backend handler
[PATCH][RFC 10/23]: SCST user space backend handler
[PATCH][RFC 11/23]: Makefile for SCST backend handlers
[PATCH][RFC 12/23]: Patch to add necessary support for SCST pass-through
[PATCH][RFC 13/23]: Export of alloc_io_context() function
[PATCH][RFC 14/23]: Necessary functionality in qla2xxx driver to support
target mode
[PATCH][RFC 15/23]: Qlogic target driver
[PATCH][RFC 16/23]: Documentation for Qlogic target driver
[PATCH][RFC 17/23]: InfiniBand SRP target driver
[PATCH][RFC 18/23]: Documentation for SRP target driver
[PATCH][RFC 19/23]: scst_local target driver
[PATCH][RFC 20/23]: Documentation for scst_local driver
[PATCH][RFC 21/23]: iSCSI target driver
[PATCH][RFC 22/23]: Documentation for iSCSI-SCST
[PATCH][RFC 23/23]: Support for zero-copy TCP transmit of user space data
This patchset contains more than 15 patches as required. Sorry for that,
but I don't see how to make it smaller, except either by making each particular
patch bigger, or excluding some target drivers.
See further comments in descriptions of each patch.
SCST home page is http://scst.sourceforge.net. You can always find the
latest complete SCST source code with additional target drivers in its
SVN repository by command:
$ svn co https://scst.svn.sourceforge.net/svnroot/scst/trunk
Thanks in advance,
Vlad
[*] The only area, which can *possibly* be not fully complete, is
handling of various target hardware limitations, like transfer length
alignment. etc. But I think it's completed, because:
1. SCST allocates page aligned data buffers, so they should satisfy the
buffers alignment requirements.
2. It looks like if a SCSI card is modern, i.e. it is possible to
buy it from its manufacturer (hence it is worth writing a driver for
it), and it supports target mode, then it's sufficiently advanced to
have sane limitations, like support for full 64-bit addressing space, no
length alignment, etc. At least, so far I've not seen exceptions.
In Linux it was always accepted that only features for real life needs
should be implemented, but "just in case" features with no real life
reflection should be rejected. Hence, I will start thinking about
handling more hardware restrictions when there is such target hardware,
not earlier. Until that I'd prefer memory management simplicity. See
section 2 subsection 4 in SubmittingPatches: "Don't over-design".
Also I don't believe that the lack of SG chaining support in SCST is
something to be fixed, because this is one of the areas, where
requirements for target and initiator are completely opposite. To
minimize latency, initiator should build commands with as much data to
transfer as possible, but target should transfer those data with as
small chunks as possible. This technique called pipelining and widely
used in many areas, especially in modern CPUs. All SCSI transports I
know, including iSCSI, Fibre Channel, SRP, parallel SCSI and SAS,
support pipelining. So, per page limit of 512K is more than satisfying
for an SCSI target. I can elaborate more, if someone's interested.
P.S. SCST can also be used with non-SCSI transports, like AoE. It is
possible, since those transports' commands can be AFAIK seen as a subset
of SCSI. So, the only thing necessary to use them with SCST is to
convert their internal commands to the corresponding SCSI commands, then
send to SCST, and on the way back from SCST convert SCSI sense codes to
their internal status codes.
This patch contains declarations of all externally visible constants,
types and functions prototypes.
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
include/scst/scst.h | 2676 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
include/scst/scst_const.h | 316 ++++++
2 files changed, 2992 insertions(+)
The patch is too big to be submitted inline. You can find it in
http://scst.sourceforge.net/patches/scst_public_headers.diff
This patch contains SCST core code.
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
drivers/scst/Kconfig | 256 ++
drivers/scst/Makefile | 12
drivers/scst/scst_cdbprobe.h | 519 +++++
drivers/scst/scst_lib.c | 3689 +++++++++++++++++++++++++++++++++++++
drivers/scst/scst_main.c | 1919 +++++++++++++++++++
drivers/scst/scst_module.c | 69
drivers/scst/scst_priv.h | 513 +++++
drivers/scst/scst_targ.c | 5458 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
8 files changed, 12435 insertions(+)
The patch is too big to be submitted inline. You can find it in
http://scst.sourceforge.net/patches/scst_core.diff
This patch contains documentation for SCST core.
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
Documentation/scst/README.scst | 823
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 823 insertions(+)
diff -uprN orig/linux-2.6.27/Documentation/scst/README.scst linux-2.6.27/Documentation/scst/README.scst
--- orig/linux-2.6.27/Documentation/scst/README.scst
+++ linux-2.6.27/Documentation/scst/README.scst
@@ -0,0 +1,823 @@
+Generic SCSI target mid-level for Linux (SCST)
+==============================================
+
+SCST is designed to provide unified, consistent interface between SCSI
+target drivers and Linux kernel and simplify target drivers development
+as much as possible. Detail description of SCST's features and internals
+could be found in "Generic SCSI Target Middle Level for Linux" document
+SCST's Internet page http://scst.sourceforge.net.
+
+SCST supports the following I/O modes:
+
+ * Pass-through mode with one to many relationship, i.e. when multiple
+ initiators can connect to the exported pass-through devices, for
+ the following SCSI devices types: disks (type 0), tapes (type 1),
+ processors (type 3), CDROMs (type 5), MO disks (type 7), medium
+ changers (type 8) and RAID controllers (type 0xC)
+
+ * FILEIO mode, which allows to use files on file systems or block
+ devices as virtual remotely available SCSI disks or CDROMs with
+ benefits of the Linux page cache
+
+ * BLOCKIO mode, which performs direct block IO with a block device,
+ bypassing page-cache for all operations. This mode works ideally with
+ high-end storage HBAs and for applications that either do not need
+ caching between application and disk or need the large block
+ throughput
+
+ * User space mode using scst_user device handler, which allows to
+ implement in the user space virtual SCSI devices in the SCST
+ environment
+
+ * "Performance" device handlers, which provide in pseudo pass-through
+ mode a way for direct performance measurements without overhead of
+ actual data transferring from/to underlying SCSI device
+
+In addition, SCST supports advanced per-initiator access and devices
+visibility management, so different initiators could see different set
+of devices with different access permissions. See below for details.
+
+This is quite stable (but still beta) version.
+
+Installation
+------------
+
+To see your devices remotely, you need to add them to at least "Default"
+security group (see below how). By default, no local devices are seen
+remotely. There must be LUN 0 in each security group, i.e. LUs
+numeration must not start from, e.g., 1.
+
+It is highly recommended to use scstadmin utility for configuring
+devices and security groups.
+
+If you experience problems during modules load or running, check your
+kernel logs (or run dmesg command for the few most recent messages).
+
+IMPORTANT: Without loading appropriate device handler, corresponding devices
+========= will be invisible for remote initiators, which could lead to holes
+ in the LUN addressing, so automatic device scanning by remote SCSI
+ mid-level could not notice the devices. Therefore you will have
+ to add them manually via
+ 'echo "- - -" >/sys/class/scsi_host/hostX/scan',
+ where X - is the host number.
+
+IMPORTANT: Working of target and initiator on the same host isn't
+========= supported. This is a limitation of the Linux memory/cache
+ manager, because in this case an OOM deadlock like: system
+ needs some memory -> it decides to clear some cache -> cache
+ needs to write on a target exported device -> initiator sends
+ request to the target -> target needs memory -> problem is
+ possible.
+
+IMPORTANT: In the current version simultaneous access to local SCSI devices
+========= via standard high-level SCSI drivers (sd, st, sg, etc.) and
+ SCST's target drivers is unsupported. Especially it is
+ important for execution via sg and st commands that change
+ the state of devices and their parameters, because that could
+ lead to data corruption. If any such command is done, at
+ least related device handler(s) must be restarted. For block
+ devices READ/WRITE commands using direct disk handler look to
+ be safe.
+
+Device handlers
+---------------
+
+Device specific drivers (device handlers) are plugins for SCST, which
+help SCST to analyze incoming requests and determine parameters,
+specific to various types of devices. If an appropriate device handler
+for a SCSI device type isn't loaded, SCST doesn't know how to handle
+devices of this type, so they will be invisible for remote initiators
+(more precisely, "LUN not supported" sense code will be returned).
+
+In addition to device handlers for real devices, there are VDISK, user
+space and "performance" device handlers.
+
+VDISK device handler works over files on file systems and makes from
+them virtual remotely available SCSI disks or CDROM's. In addition, it
+allows to work directly over a block device, e.g. local IDE or SCSI disk
+or ever disk partition, where there is no file systems overhead. Using
+block devices comparing to sending SCSI commands directly to SCSI
+mid-level via scsi_do_req()/scsi_execute_async() has advantage that data
+are transferred via system cache, so it is possible to fully benefit from
+caching and read ahead performed by Linux's VM subsystem. The only
+disadvantage here that in the FILEIO mode there is superfluous data
+copying between the cache and SCST's buffers. This issue is going to be
+addressed in the next release. Virtual CDROM's are useful for remote
+installation. See below for details how to setup and use VDISK device
+handler.
+
+SCST user space device handler provides an interface between SCST and
+the user space, which allows to create pure user space devices. The
+simplest example, where one would want it is if he/she wants to write a
+VTL. With scst_user he/she can write it purely in the user space. Or one
+would want it if he/she needs some sophisticated for kernel space
+processing of the passed data, like encrypting them or making snapshots.
+
+"Performance" device handlers for disks, MO disks and tapes in their
+exec() method skip (pretend to execute) all READ and WRITE operations
+and thus provide a way for direct link performance measurements without
+overhead of actual data transferring from/to underlying SCSI device.
+
+NOTE: Since "perf" device handlers on READ operations don't touch the
+==== commands' data buffer, it is returned to remote initiators as it
+ was allocated, without even being zeroed. Thus, "perf" device
+ handlers impose some security risk, so use them with caution.
+
+Compilation options
+-------------------
+
+There are the following compilation options, that could be change using
+your favorit kernel configuration Makefile target, e.g. "make xconfig":
+
+ - CONFIG_SCST_DEBUG - if defined, turns on some debugging code,
+ including some logging. Makes the driver considerably bigger and slower,
+ producing large amount of log data.
+
+ - CONFIG_SCST_TRACING - if defined, turns on ability to log events. Makes the
+ driver considerably bigger and leads to some performance loss.
+
+ - CONFIG_SCST_EXTRACHECKS - if defined, adds extra validity checks in
+ the various places.
+
+ - CONFIG_SCST_USE_EXPECTED_VALUES - if not defined (default), initiator
+ supplied expected data transfer length and direction will be used only for
+ verification purposes to return error or warn in case if one of them
+ is invalid. Instead, locally decoded from SCSI command values will be
+ used. This is necessary for security reasons, because otherwise a
+ faulty initiator can crash target by supplying invalid value in one
+ of those parameters. This is especially important in case of
+ pass-through mode. If CONFIG_SCST_USE_EXPECTED_VALUES is defined, initiator
+ supplied expected data transfer length and direction will override
+ the locally decoded values. This might be necessary if internal SCST
+ commands translation table doesn't contain SCSI command, which is
+ used in your environment. You can know that if you have messages like
+ "Unknown opcode XX for YY. Should you update scst_scsi_op_table?" in
+ your kernel log and your initiator returns an error. Also report
+ those messages in the SCST mailing list
+ [email protected]. Note, that not all SCSI transports
+ support supplying expected values.
+
+ - CONFIG_SCST_DEBUG_TM - if defined, turns on task management functions
+ debugging, when on LUN 0 in the default access control group some of the
+ commands will be delayed for about 60 sec., so making the remote
+ initiator send TM functions, eg ABORT TASK and TARGET RESET. Also
+ define CONFIG_SCST_TM_DBG_GO_OFFLINE symbol in the Makefile if you
+ want that the device eventually become completely unresponsive, or
+ otherwise to circle around ABORTs and RESETs code. Needs CONFIG_SCST_DEBUG
+ turned on.
+
+ - CONFIG_SCST_STRICT_SERIALIZING - if defined, makes SCST send all commands to
+ underlying SCSI device synchronously, one after one. This makes task
+ management more reliable, with cost of some performance penalty. This
+ is mostly actual for stateful SCSI devices like tapes, where the
+ result of command's execution depends from device's settings defined
+ by previous commands. Disk and RAID devices are stateless in the most
+ cases. The current SCSI core in Linux doesn't allow to abort all
+ commands reliably if they sent asynchronously to a stateful device.
+ Turned off by default, turn it on if you use stateful device(s) and
+ need as much error recovery reliability as possible. As a side
+ effect, no kernel patching is necessary.
+
+ - CONFIG_SCST_ALLOW_PASSTHROUGH_IO_SUBMIT_IN_SIRQ - if defined, it will be
+ allowed to submit pass-through commands to real SCSI devices via the SCSI
+ middle layer using scsi_execute_async() function from soft IRQ
+ context (tasklets). This used to be the default, but currently it
+ seems the SCSI middle layer starts expecting only thread context on
+ the IO submit path, so it is disabled now by default. Enabling it
+ will decrease amount of context switches and improve performance. It
+ is more or less safe, in the worst case, if in your configuration the
+ SCSI middle layer really doesn't expect SIRQ context in
+ scsi_execute_async() function, you will get a warning message in the
+ kernel log.
+
+ - CONFIG_SCST_STRICT_SECURITY - if defined, makes SCST zero allocated data
+ buffers. Undefining it (default) considerably improves performance
+ and eases CPU load, but could create a security hole (information
+ leakage), so enable it, if you have strict security requirements.
+
+ - CONFIG_SCST_ABORT_CONSIDER_FINISHED_TASKS_AS_NOT_EXISTING - if defined,
+ in case when TASK MANAGEMENT function ABORT TASK is trying to abort a
+ command, which has already finished, remote initiator, which sent the
+ ABORT TASK request, will receive TASK NOT EXIST (or ABORT FAILED)
+ response for the ABORT TASK request. This is more logical response,
+ since, because the command finished, attempt to abort it failed, but
+ some initiators, particularly VMware iSCSI initiator, consider TASK
+ NOT EXIST response as if the target got crazy and try to RESET it.
+ Then sometimes get crazy itself. So, this option is disabled by
+ default.
+
+ - CONFIG_SCST_MEASURE_LATENCY - if defined, provides in /proc/scsi_tgt/latency
+ file average commands processing latency. You can clear already
+ measured results by writing 0 in this file. Note, you need a
+ non-preemtible kernel to have correct results.
+
+HIGHMEM kernel configurations are fully supported, but not recommended
+for performance reasons, except for scst_user, where they are not
+supported, because this module deals with user supplied memory on a
+zero-copy manner. If you need to use it, consider change VMSPLIT option
+or use 64-bit system configuration instead.
+
+For changing VMSPLIT option (CONFIG_VMSPLIT to be precise) you should in
+"make menuconfig" command set the following variables:
+
+ - General setup->Configure standard kernel features (for small systems): ON
+
+ - General setup->Prompt for development and/or incomplete code/drivers: ON
+
+ - Processor type and features->High Memory Support: OFF
+
+ - Processor type and features->Memory split: according to amount of
+ memory you have. If it is less than 800MB, you may not touch this
+ option at all.
+
+Module parameters
+-----------------
+
+Module scst supports the following parameters:
+
+ - scst_threads - allows to set count of SCST's threads. By default it
+ is CPU count.
+
+ - scst_max_cmd_mem - sets maximum amount of memory in Mb allowed to be
+ consumed by the SCST commands for data buffers at any given time. By
+ default it is approximately TotalMem/4.
+
+SCST "/proc" commands
+---------------------
+
+For communications with user space programs SCST provides proc-based
+interface in "/proc/scsi_tgt" directory. It contains the following
+entries:
+
+ - "help" file, which provides online help for SCST commands
+
+ - "scsi_tgt" file, which on read provides information of serving by SCST
+ devices and their dev handlers. On write it supports the following
+ command:
+
+ * "assign H:C:I:L HANDLER_NAME" assigns dev handler "HANDLER_NAME"
+ on device with host:channel:id:lun
+
+ - "sessions" file, which lists currently connected initiators (open sessions)
+
+ - "sgv" file provides some statistic about with which block sizes
+ commands from remote initiators come and how effective sgv_pool in
+ serving those allocations from the cache, i.e. without memory
+ allocations requests to the kernel. "Size" - is the commands data
+ size upper rounded to power of 2, "Hit" - how many there are
+ allocations from the cache, "Total" - total number of allocations.
+
+ - "threads" file, which allows to read and set number of SCST's threads
+
+ - "version" file, which shows version of SCST
+
+ - "trace_level" file, which allows to read and set trace (logging) level
+ for SCST. See "help" file for list of trace levels. If you want to
+ enable logging options, which produce a lot of events, like "debug",
+ to not loose logged events you should also:
+
+ * Increase in .config of your kernel CONFIG_LOG_BUF_SHIFT variable
+ to much bigger value, then recompile it. For example, I use 25,
+ but to use it I needed to modify the maximum allowed value for
+ CONFIG_LOG_BUF_SHIFT in the corresponding Kconfig.
+
+ * Change in your /etc/syslog.conf or other config file of your favorite
+ logging program to store kernel logs in async manner. For example,
+ I added in my rsyslog.conf line "kern.info -/var/log/kernel"
+ and added "kern.none" in line for /var/log/messages, so I had:
+ "*.info;kern.none;mail.none;authpriv.none;cron.none /var/log/messages"
+
+Each dev handler has own subdirectory. Most dev handler have only two
+files in this subdirectory: "trace_level" and "type". The first one is
+similar to main SCST "trace_level" file, the latter one shows SCSI type
+number of this handler as well as some text description.
+
+For example, "echo "assign 1:0:1:0 dev_disk" >/proc/scsi_tgt/scsi_tgt"
+will assign device handler "dev_disk" to real device sitting on host 1,
+channel 0, ID 1, LUN 0.
+
+Access and devices visibility management (LUN masking)
+------------------------------------------------------
+
+Access and devices visibility management allows for an initiator or
+group of initiators to have different views of LUs/LUNs (security groups)
+each with appropriate access permissions. It is highly recommended to
+use scstadmin utility for that purpose instead of described in this
+section low level interface.
+
+Initiator is represented as an SCST session. The session is bound to
+security group on its registration time by character "name" parameter of
+the registration function, which provided by target driver, based on its
+internal authentication. For example, for FC "name" could be WWN or just
+loop ID. For iSCSI this could be iSCSI login credentials or iSCSI
+initiator name. Each security group has set of names assigned to it by
+system administrator. Session is bound to security group with provided
+name. If no such groups found, the session bound to either
+"Default_target_name", or "Default" group, depending from either
+"Default_target_name" exists or not. In "Default_target_name" target
+name means name of the target.
+
+In /proc/scsi_tgt each group represented as "groups/GROUP_NAME/"
+subdirectory. In it there are files "devices" and "names". File
+"devices" lists all devices and their LUNs in the group, file "names"
+lists all names that should be bound to this group.
+
+To configure access and devices visibility management SCST provides the
+following files and directories under /proc/scsi_tgt:
+
+ - "add_group GROUP" to /proc/scsi_tgt/scsi_tgt adds group "GROUP"
+
+ - "del_group GROUP" to /proc/scsi_tgt/scsi_tgt deletes group "GROUP"
+
+ - "add H:C:I:L lun [READ_ONLY]" to /proc/scsi_tgt/groups/GROUP/devices adds
+ device with host:channel:id:lun as LUN "lun" in group "GROUP". Optionally,
+ the device could be marked as read only.
+
+ - "del H:C:I:L" to /proc/scsi_tgt/groups/GROUP/devices deletes device with
+ host:channel:id:lun from group "GROUP"
+
+ - "add V_NAME lun [READ_ONLY]" to /proc/scsi_tgt/groups/GROUP/devices adds
+ device with virtual name "V_NAME" as LUN "lun" in group "GROUP".
+ Optionally, the device could be marked as read only.
+
+ - "del V_NAME" to /proc/scsi_tgt/groups/GROUP/devices deletes device with
+ virtual name "V_NAME" from group "GROUP"
+
+ - "clear" to /proc/scsi_tgt/groups/GROUP/devices clears the list of devices
+ for group "GROUP"
+
+ - "add NAME" to /proc/scsi_tgt/groups/GROUP/names adds name "NAME" to group
+ "GROUP"
+
+ - "del NAME" to /proc/scsi_tgt/groups/GROUP/names deletes name "NAME" from group
+ "GROUP"
+
+ - "clear" to /proc/scsi_tgt/groups/GROUP/names clears the list of names
+ for group "GROUP"
+
+There must be LUN 0 in each security group, i.e. LUs numeration must not
+start from, e.g., 1.
+
+Examples:
+
+ - "echo "add 1:0:1:0 0" >/proc/scsi_tgt/groups/Default/devices" will
+ add real device sitting on host 1, channel 0, ID 1, LUN 0 to "Default"
+ group with LUN 0.
+
+ - "echo "add disk1 1" >/proc/scsi_tgt/groups/Default/devices" will
+ add virtual VDISK device with name "disk1" to "Default" group
+ with LUN 1.
+
+VDISK device handler
+--------------------
+
+After loading VDISK device handler creates in "/proc/scsi_tgt/"
+subdirectories "vdisk" and "vcdrom". They have similar layout:
+
+ - "trace_level" and "type" files as described for other dev handlers
+
+ - "help" file, which provides online help for VDISK commands
+
+ - "vdisk"/"vcdrom" files, which on read provides information of
+ currently open device files. On write it supports the following
+ command:
+
+ * "open NAME [PATH] [BLOCK_SIZE] [FLAGS]" - opens file "PATH" as
+ device "NAME" with block size "BLOCK_SIZE" bytes with flags
+ "FLAGS". "PATH" could be empty only for VDISK CDROM. "BLOCK_SIZE"
+ and "FLAGS" are valid only for disk VDISK. The block size must be
+ power of 2 and >= 512 bytes. Default is 512. Possible flags:
+
+ - WRITE_THROUGH - write back caching disabled. Note, this option
+ has sense only if you also *manually* disable write-back cache
+ in *all* your backstorage devices and make sure it's actually
+ disabled, since many devices are known to lie about this mode to
+ get better benchmark results.
+
+ - READ_ONLY - read only
+
+ - O_DIRECT - both read and write caching disabled. This mode
+ isn't currently fully implemented, you should use user space
+ fileio_tgt program in O_DIRECT mode instead (see below).
+
+ - NULLIO - in this mode no real IO will be done, but success will be
+ returned. Intended to be used for performance measurements at the same
+ way as "*_perf" handlers.
+
+ - NV_CACHE - enables "non-volatile cache" mode. In this mode it is
+ assumed that the target has a GOOD UPS with ability to cleanly
+ shutdown target in case of power failure and it is
+ software/hardware bugs free, i.e. all data from the target's
+ cache are guaranteed sooner or later to go to the media. Hence
+ all data synchronization with media operations, like
+ SYNCHRONIZE_CACHE, are ignored in order to bring more
+ performance. Also in this mode target reports to initiators that
+ the corresponding device has write-through cache to disable all
+ write-back cache workarounds used by initiators. Use with
+ extreme caution, since in this mode after a crash of the target
+ journaled file systems don't guarantee the consistency after
+ journal recovery, therefore manual fsck MUST be ran. Note, that
+ since usually the journal barrier protection (see "IMPORTANT"
+ note below) turned off, enabling NV_CACHE could change nothing
+ from data protection point of view, since no data
+ synchronization with media operations will go from the
+ initiator. This option overrides WRITE_THROUGH.
+
+ - BLOCKIO - enables block mode, which will perform direct block
+ IO with a block device, bypassing page-cache for all operations.
+ This mode works ideally with high-end storage HBAs and for
+ applications that either do not need caching between application
+ and disk or need the large block throughput. See also below.
+
+ - REMOVABLE - with this flag set the device is reported to remote
+ initiators as removable.
+
+ * "close NAME" - closes device "NAME".
+
+ * "change NAME [PATH]" - changes a virtual CD in the VDISK CDROM.
+
+By default, if neither BLOCKIO, nor NULLIO option is supplied, FILEIO
+mode is used.
+
+For example, "echo "open disk1 /vdisks/disk1" >/proc/scsi_tgt/vdisk/vdisk"
+will open file /vdisks/disk1 as virtual FILEIO disk with name "disk1".
+
+CAUTION: If you partitioned/formatted your device with block size X, *NEVER*
+======== ever try to export and then mount it (even accidentally) with another
+ block size. Otherwise you can *instantly* damage it pretty
+ badly as well as all your data on it. Messages on initiator
+ like: "attempt to access beyond end of device" is the sign of
+ such damage.
+
+ Moreover, if you want to compare how well different block sizes
+ work for you, you **MUST** EVERY TIME AFTER CHANGING BLOCK SIZE
+ **COMPLETELY** **WIPE OFF** ALL THE DATA FROM THE DEVICE. In
+ other words, THE **WHOLE** DEVICE **MUST** HAVE ONLY **ZEROS**
+ AS THE DATA AFTER YOU SWITCH TO NEW BLOCK SIZE. Switching block
+ sizes isn't like switching between FILEIO and BLOCKIO, after
+ changing block size all previously written with another block
+ size data MUST BE ERASED. Otherwise you will have a full set of
+ very weird behaviors, because blocks addressing will be
+ changed, but initiators in most cases will not have a
+ possibility to detect that old addresses written on the device
+ in, e.g., partition table, don't refer anymore to what they are
+ intended to refer.
+
+IMPORTANT: By default for performance reasons VDISK FILEIO devices use write
+========= back caching policy. This is generally safe from the consistence of
+ journaled file systems, laying over them, point of view, but
+ your unsaved cached data will be lost in case of
+ power/hardware/software failure, so you must supply your
+ target server with some kind of UPS or disable write back
+ caching using WRITE_THROUGH flag. You also should note, that
+ the file systems journaling over write back caching enabled
+ devices works reliably *ONLY* if the order of journal writes
+ is guaranteed or it uses some kind of data protection
+ barriers (i.e. after writing journal data some kind of
+ synchronization with media operations is used), otherwise,
+ because of possible reordering in the cache, even after
+ successful journal rollback, you very much risk to loose your
+ data on the FS. Currently, Linux IO subsystem guarantees
+ order of write operations only using data protection
+ barriers. Some info about it from the XFS point of view could
+ be found at http://oss.sgi.com/projects/xfs/faq.html#wcache.
+ On Linux initiators for EXT3 and ReiserFS file systems the
+ barrier protection could be turned on using "barrier=1" and
+ "barrier=flush" mount options correspondingly. Note, that
+ usually it turned off by default and the status of barriers
+ usage isn't reported anywhere in the system logs as well as
+ there is no way to know it on the mounted file system (at
+ least no known one). Windows and, AFAIK, other UNIX'es don't
+ need any special explicit options and do necessary barrier
+ actions on write-back caching devices by default. Also note
+ that on some real-life workloads write through caching might
+ perform better, than write back one with the barrier
+ protection turned on.
+ Also you should realize that Linux doesn't provide a
+ guarantee that after sync()/fsync() all written data really
+ hit permanent storage, they can be then in the cache of your
+ backstorage device and lost on power failure event. Thus,
+ ever with write-through cache mode, you still need a good UPS
+ to protect yourself from your data loss (note, data loss, not
+ the file system integrity corruption).
+
+IMPORTANT: Some disk and partition table management utilities don't support
+========= block sizes >512 bytes, therefore make sure that your favorite one
+ supports it. Currently only cfdisk is known to work only with
+ 512 bytes blocks, other utilities like fdisk on Linux or
+ standard disk manager on Windows are proved to work well with
+ non-512 bytes blocks. Note, if you export a disk file or
+ device with some block size, different from one, with which
+ it was already partitioned, you could get various weird
+ things like utilities hang up or other unexpected behavior.
+ Hence, to be sure, zero the exported file or device before
+ the first access to it from the remote initiator with another
+ block size. On Window initiator make sure you "Set Signature"
+ in the disk manager on the imported from the target drive
+ before doing any other partitioning on it. After you
+ successfully mounted a file system over non-512 bytes block
+ size device, the block size stops matter, any program will
+ work with files on such file system.
+
+BLOCKIO VDISK mode
+------------------
+
+This module works best for these types of scenarios:
+
+1) Data that are not aligned to 4K sector boundaries and <4K block sizes
+are used, which is normally found in virtualization environments where
+operating systems start partitions on odd sectors (Windows and it's
+sector 63).
+
+2) Large block data transfers normally found in database loads/dumps and
+streaming media.
+
+3) Advanced relational database systems that perform their own caching
+which prefer or demand direct IO access and, because of the nature of
+their data access, can actually see worse performance with
+non-discriminate caching.
+
+4) Multiple layers of targets were the secondary and above layers need
+to have a consistent view of the primary targets in order to preserve
+data integrity which a page cache backed IO type might not provide
+reliably.
+
+Also it has an advantage over FILEIO that it doesn't copy data between
+the system cache and the commands data buffers, so it saves a
+considerable amount of CPU power and memory bandwidth.
+
+IMPORTANT: Since data in BLOCKIO and FILEIO modes are not consistent between
+========= them, if you try to use a device in both those modes simultaneously,
+ you will almost instantly corrupt your data on that device.
+
+Pass-through mode
+-----------------
+
+In the pass-through mode (i.e. using the pass-through device handlers
+scst_disk, scst_tape, etc) SCSI commands, coming from remote initiators,
+are passed to local SCSI hardware on target as is, without any
+modifications. As any other hardware, the local SCSI hardware can not
+handle commands with amount of data and/or segments count in
+scatter-gather array bigger some values. Therefore, when using the
+pass-through mode you should note that values for maximum number of
+segments and maximum amount of transferred data for each SCSI command on
+devices on initiators can not be bigger, than corresponding values of
+the corresponding SCSI devices on the target. Otherwise you will see
+symptoms like small transfers work well, but large ones stall and
+messages like: "Unable to complete command due to SG IO count
+limitation" are printed in the kernel logs.
+
+You can't control from the user space limit of the scatter-gather
+segments, but for block devices usually it is sufficient if you set on
+the initiators /sys/block/DEVICE_NAME/queue/max_sectors_kb in the same
+or lower value as in /sys/block/DEVICE_NAME/queue/max_hw_sectors_kb for
+the corresponding devices on the target.
+
+For not-block devices SCSI commands are usually generated directly by
+applications, so, if you experience large transfers stalls, you should
+check documentation for your application how to limit the transfer
+sizes.
+
+User space mode using scst_user dev handler
+-------------------------------------------
+
+User space program fileio_tgt uses interface of scst_user dev handler
+and allows to see how it works in various modes. Fileio_tgt provides
+mostly the same functionality as scst_vdisk handler with the most
+noticeable difference that it supports O_DIRECT mode. O_DIRECT mode is
+basically the same as BLOCKIO, but also supports files, so for some
+loads it could be significantly faster, than the regular FILEIO access.
+All the words about BLOCKIO from above apply to O_DIRECT as well. See
+fileio_tgt's README file for more details.
+
+Performance
+-----------
+
+Before doing any performance measurements note that:
+
+I. Performance results are very much dependent from your type of load,
+so it is crucial that you choose access mode (FILEIO, BLOCKIO,
+O_DIRECT, pass-through), which suits your needs the best.
+
+II. In order to get the maximum performance you should:
+
+1. For SCST:
+
+ - Disable in Makefile CONFIG_SCST_STRICT_SERIALIZING, CONFIG_SCST_EXTRACHECKS,
+ CONFIG_SCST_TRACING, CONFIG_SCST_DEBUG*, CONFIG_SCST_STRICT_SECURITY
+
+ - For pass-through devices enable
+ CONFIG_SCST_ALLOW_PASSTHROUGH_IO_SUBMIT_IN_SIRQ.
+
+2. For target drivers:
+
+ - Disable in Makefiles CONFIG_SCST_EXTRACHECKS, CONFIG_SCST_TRACING,
+ CONFIG_SCST_DEBUG*
+
+3. For device handlers, including VDISK:
+
+ - Disable in Makefile CONFIG_SCST_TRACING and CONFIG_SCST_DEBUG.
+
+ - If your initiator(s) use dedicated exported from the target virtual
+ SCSI devices and have more or equal amount of memory, than the
+ target, it is recommended to use O_DIRECT option (currently it is
+ available only with fileio_tgt user space program) or BLOCKIO. With
+ them you could have up to 100% increase in throughput.
+
+IMPORTANT: Some of the compilation options enabled by default, i.e. SCST
+========= is optimized currently rather for development and bug hunting,
+ than for performance.
+
+If you use SCST version taken directly from the SVN repository, you can
+set the above options, except CONFIG_SCST_ALLOW_PASSTHROUGH_IO_SUBMIT_IN_SIRQ,
+using debug2perf Makefile target.
+
+4. For other target and initiator software parts:
+
+ - Don't enable debug/hacking features in the kernel, i.e. use them as
+ they are by default.
+
+ - The default kernel read-ahead and queuing settings are optimized
+ for locally attached disks, therefore they are not optimal if they
+ attached remotely (SCSI target case), which sometimes could lead to
+ unexpectedly low throughput. You should increase read-ahead size to at
+ least 512KB or even more on all initiators and the target.
+
+ You should also limit on all initiators maximum amount of sectors per
+ SCSI command. To do it on Linux initiators, run:
+
+ echo ?64? > /sys/block/sdX/queue/max_sectors_kb
+
+ where specify instead of X your imported from target device letter,
+ like 'b', i.e. sdb.
+
+ To increase read-ahead size on Linux, run:
+
+ blockdev --setra N /dev/sdX
+
+ where N is a read-ahead number in 512-byte sectors and X is a device
+ letter like above.
+
+ Note: you need to set read-ahead setting for device sdX again after
+ you changed the maximum amount of sectors per SCSI command for that
+ device.
+
+ - You may need to increase amount of requests that OS on initiator
+ sends to the target device. To do it on Linux initiators, run
+
+ echo ?64? > /sys/block/sdX/queue/nr_requests
+
+ where X is a device letter like above.
+
+ You may also experiment with other parameters in /sys/block/sdX
+ directory, they also affect performance. If you find the best values,
+ please share them with us.
+
+ - On the target CFQ IO scheduler. In most cases it has performance
+ advantage over other IO schedulers, sometimes huge (2+ times
+ aggregate throughput increase).
+
+ - It is recommended to turn the kernel preemption off, i.e. set
+ the kernel preemption model to "No Forced Preemption (Server)".
+
+ - Looks like XFS is the best filesystem on the target to store device
+ files, because it allows considerably better linear write throughput,
+ than ext3.
+
+5. For hardware on target.
+
+ - Make sure that your target hardware (e.g. target FC or network card)
+ and underlaying IO hardware (e.g. IO card, like SATA, SCSI or RAID to
+ which your disks connected) don't share the same PCI bus. You can
+ check it using lspci utility. They have to work in parallel, so it
+ will be better if they don't compete for the bus. The problem is not
+ only in the bandwidth, which they have to share, but also in the
+ interaction between cards during that competition. This is very
+ important, because in some cases if target and backend storage
+ controllers share the same PCI bus, it could lead up to 5-10 times
+ less performance, than expected. Moreover, some motherboard (by
+ Supermicro, particularly) have serious stability issues if there are
+ several high speed devices on the same bus working in parallel. If
+ you have no choice, but PCI bus sharing, set in the BIOS PCI latency
+ as low as possible.
+
+6. If you use VDISK IO module in FILEIO mode, NV_CACHE option will
+provide you the best performance. But using it make sure you use a good
+UPS with ability to shutdown the target on the power failure.
+
+IMPORTANT: If you use on initiator some versions of Windows (at least W2K)
+========= you can't get good write performance for VDISK FILEIO devices with
+ default 512 bytes block sizes. You could get about 10% of the
+ expected one. This is because of the partition alignment, which
+ is (simplifying) incompatible with how Linux page cache
+ works, so for each write the corresponding block must be read
+ first. Use 4096 bytes block sizes for VDISK devices and you
+ will have the expected write performance. Actually, any OS on
+ initiators, not only Windows, will benefit from block size
+ max(PAGE_SIZE, BLOCK_SIZE_ON_UNDERLYING_FS), where PAGE_SIZE
+ is the page size, BLOCK_SIZE_ON_UNDERLYING_FS is block size
+ on the underlying FS, on which the device file located, or 0,
+ if a device node is used. Both values are from the target.
+ See also important notes about setting block sizes >512 bytes
+ for VDISK FILEIO devices above.
+
+What if target's backstorage is too slow
+----------------------------------------
+
+If under high load you experience I/O stalls or see in the kernel log on
+the target abort or reset messages, then your backstorage is too slow
+comparing with your target link speed and amount of simultaneously
+queued commands. On some seek intensive workloads even fast disks or
+RAIDs, which able to serve continuous data stream on 500+ MB/s speed,
+can be as slow as 0.3 MB/s. Another possible cause for that can be
+MD/LVM/RAID on your target as in http://lkml.org/lkml/2008/2/27/96
+(check the whole thread as well).
+
+Thus, in such situations simply processing of one or more commands takes
+too long time, hence initiator decides that they are stuck on the target
+and tries to recover. Particularly, it is known that the default amount
+of simultaneously queued commands (48) is sometimes too high if you do
+intensive writes from VMware on a target disk, which uses LVM in the
+snapshot mode. In this case value like 16 or even 8-10 depending of your
+backstorage speed could be more appropriate.
+
+Unfortunately, currently SCST lacks dynamic I/O flow control, when the
+queue depth on the target is dynamically decreased/increased based on
+how slow/fast the backstorage speed comparing to the target link. So,
+there are only 5 possible actions, which you can do to workaround or fix
+this issue:
+
+1. Ignore incoming task management (TM) commands. It's fine if there are
+not too many of them, so average performance isn't hurt and the
+corresponding device isn't put offline, i.e. if the backstorage isn't
+too much slow.
+
+2. Decrease /sys/block/sdX/device/queue_depth on the initiator in case
+if it's Linux (see below how) or/and SCST_MAX_TGT_DEV_COMMANDS constant
+in scst_priv.h file until you stop seeing incoming TM commands.
+ISCSI-SCST driver also has its own iSCSI specific parameter for that.
+
+3. Try to avoid such seek intensive workloads.
+
+4. Insrease speed of the target's backstorage.
+
+5. Implement in SCST the dynamic I/O flow control.
+
+To decrease device queue depth on Linux initiators run command:
+
+# echo Y >/sys/block/sdX/device/queue_depth
+
+where Y is the new number of simultaneously queued commands, X - your
+imported device letter, like 'a' for sda device. There are no special
+limitations for Y value, it can be any value from 1 to possible maximum
+(usually, 32), so start from dividing the current value on 2, i.e. set
+16, if /sys/block/sdX/device/queue_depth contains 32.
+
+Note, that logged messages about QUEUE_FULL status are quite different
+by nature. This is a normal work, just SCSI flow control in action.
+Simply don't enable "mgmt_minor" logging level, or, alternatively, if
+you are confident in the worst case performance of your back-end
+storage, you can increase SCST_MAX_TGT_DEV_COMMANDS in scst_priv.h to
+64. Usually initiators don't try to push more commands on the target.
+
+Credits
+-------
+
+Thanks to:
+
+ * Mark Buechler <[email protected]> for a lot of useful
+ suggestions, bug reports and help in debugging.
+
+ * Ming Zhang <[email protected]> for fixes and comments.
+
+ * Nathaniel Clark <[email protected]> for fixes and comments.
+
+ * Calvin Morrow <[email protected]> for testing and useful
+ suggestions.
+
+ * Hu Gang <[email protected]> for the original version of the
+ LSI target driver.
+
+ * Erik Habbinga <[email protected]> for fixes and support
+ of the LSI target driver.
+
+ * Ross S. W. Walker <[email protected]> for the original block IO
+ code and Vu Pham <[email protected]> who updated it for the VDISK dev
+ handler.
+
+ * Michael G. Byrnes <[email protected]> for fixes.
+
+ * Alessandro Premoli <[email protected]> for fixes
+
+ * Nathan Bullock <[email protected]> for fixes.
+
+ * Terry Greeniaus <[email protected]> for fixes.
+
+ * Krzysztof Blaszkowski <[email protected]> for many fixes and bug reports.
+
+ * Jianxi Chen <[email protected]> for fixing problem with
+ devices >2TB in size
+
+ * Bart Van Assche <[email protected]> for a lot of help
+
+Vladislav Bolkhovitin <[email protected]>, http://scst.sourceforge.net
This patch contains definitions and functions for tracing various
internal events. It can be fully compiled out and have no runtime overhead.
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
drivers/scst/scst_debug.c | 128 ++++++++++++++++
include/scst/scst_debug.h | 361 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 489 insertions(+)
diff -uprN orig/linux-2.6.27/include/scst/scst_debug.h linux-2.6.27/include/scst/scst_debug.h
--- orig/linux-2.6.27/include/scst/scst_debug.h
+++ linux-2.6.27/include/scst/scst_debug.h
@@ -0,0 +1,361 @@
+/*
+ * include/scst_debug.h
+ *
+ * Copyright (C) 2004 - 2008 Vladislav Bolkhovitin <[email protected]>
+ * Copyright (C) 2004 - 2005 Leonid Stoljar
+ * Copyright (C) 2007 - 2008 CMS Distribution Limited
+ *
+ * Contains macroses for execution tracing and error reporting
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation, version 2
+ * of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __SCST_DEBUG_H
+#define __SCST_DEBUG_H
+
+#include <linux/autoconf.h> /* for CONFIG_* */
+
+#include <linux/bug.h> /* for WARN_ON_ONCE */
+
+
+
+#ifdef CONFIG_SCST_EXTRACHECKS
+#define EXTRACHECKS_BUG_ON(a) BUG_ON(a)
+#define EXTRACHECKS_WARN_ON(a) WARN_ON(a)
+#define EXTRACHECKS_WARN_ON_ONCE(a) WARN_ON_ONCE(a)
+#else
+#define EXTRACHECKS_BUG_ON(a)
+#define EXTRACHECKS_WARN_ON(a)
+#define EXTRACHECKS_WARN_ON_ONCE(a)
+#endif
+
+#ifdef CONFIG_SCST_DEBUG
+/*# define LOG_FLAG KERN_DEBUG*/
+# define LOG_FLAG KERN_INFO
+# define INFO_FLAG KERN_INFO
+# define ERROR_FLAG KERN_INFO
+#else
+# define LOG_FLAG KERN_INFO
+# define INFO_FLAG KERN_INFO
+# define ERROR_FLAG KERN_ERR
+#endif
+
+#define CRIT_FLAG KERN_CRIT
+
+#define NO_FLAG ""
+
+#define TRACE_NULL 0x00000000
+#define TRACE_DEBUG 0x00000001
+#define TRACE_FUNCTION 0x00000002
+#define TRACE_LINE 0x00000004
+#define TRACE_PID 0x00000008
+#define TRACE_ENTRYEXIT 0x00000010
+#define TRACE_BUFF 0x00000020
+#define TRACE_MEMORY 0x00000040
+#define TRACE_SG_OP 0x00000080
+#define TRACE_OUT_OF_MEM 0x00000100
+#define TRACE_MINOR 0x00000200 /* less important events */
+#define TRACE_MGMT 0x00000400
+#define TRACE_MGMT_MINOR 0x00000800
+#define TRACE_MGMT_DEBUG 0x00001000
+#define TRACE_SCSI 0x00002000
+#define TRACE_SPECIAL 0x00004000 /* filtering debug, etc */
+#define TRACE_ALL 0xffffffff
+/* Flags 0xXXXX0000 are local for users */
+
+/*
+ * Note: in the next two printk() statements the KERN_CONT macro is only
+ * present to suppress a checkpatch warning (KERN_CONT is defined as "").
+ */
+#define PRINT(log_flag, format, args...) \
+ printk(KERN_CONT "%s" format "\n", log_flag, ## args)
+#define PRINTN(log_flag, format, args...) \
+ printk(KERN_CONT "%s" format, log_flag, ## args)
+
+#ifdef LOG_PREFIX
+#define __LOG_PREFIX LOG_PREFIX
+#else
+#define __LOG_PREFIX NULL
+#endif
+
+#if defined(CONFIG_SCST_DEBUG) || defined(CONFIG_SCST_TRACING)
+
+#ifndef CONFIG_SCST_DEBUG
+#define ___unlikely(a) (a)
+#else
+#define ___unlikely(a) unlikely(a)
+#endif
+
+extern int debug_print_prefix(unsigned long trace_flag, const char *log_level,
+ const char *prefix, const char *func, int line);
+extern void debug_print_buffer(const char *log_level, const void *data,
+ int len);
+
+#define TRACE(trace, format, args...) \
+do { \
+ if (___unlikely(trace_flag & (trace))) { \
+ char *__tflag = LOG_FLAG; \
+ if (debug_print_prefix(trace_flag, __tflag, __LOG_PREFIX, \
+ __func__, __LINE__) > 0) { \
+ __tflag = NO_FLAG; \
+ } \
+ PRINT(NO_FLAG, "%s" format, __tflag, args); \
+ } \
+} while (0)
+
+#define PRINT_BUFFER(message, buff, len) \
+do { \
+ PRINT(NO_FLAG, "%s:", message); \
+ debug_print_buffer(INFO_FLAG, buff, len); \
+} while (0)
+
+#define PRINT_BUFF_FLAG(flag, message, buff, len) \
+do { \
+ if (___unlikely(trace_flag & (flag))) { \
+ char *__tflag = INFO_FLAG; \
+ if (debug_print_prefix(trace_flag, __tflag, NULL, __func__,\
+ __LINE__) > 0) { \
+ __tflag = NO_FLAG; \
+ } \
+ PRINT(NO_FLAG, "%s%s:", __tflag, message); \
+ debug_print_buffer(INFO_FLAG, buff, len); \
+ } \
+} while (0)
+
+#else /* CONFIG_SCST_DEBUG || CONFIG_SCST_TRACING */
+
+#define TRACE(trace, args...) do {} while (0)
+#define PRINT_BUFFER(message, buff, len) do {} while (0)
+#define PRINT_BUFF_FLAG(flag, message, buff, len) do {} while (0)
+
+#endif /* CONFIG_SCST_DEBUG || CONFIG_SCST_TRACING */
+
+#ifdef CONFIG_SCST_DEBUG
+
+#define __TRACE(trace, format, args...) \
+do { \
+ if (trace_flag & (trace)) { \
+ char *__tflag = LOG_FLAG; \
+ if (debug_print_prefix(trace_flag, __tflag, NULL, __func__,\
+ __LINE__) > 0) { \
+ __tflag = NO_FLAG; \
+ } \
+ PRINT(NO_FLAG, "%s" format, __tflag, args); \
+ } \
+} while (0)
+
+#define TRACE_MEM(args...) __TRACE(TRACE_MEMORY, args)
+#define TRACE_SG(args...) __TRACE(TRACE_SG_OP, args)
+#define TRACE_DBG(args...) __TRACE(TRACE_DEBUG, args)
+#define TRACE_DBG_SPECIAL(args...) __TRACE(TRACE_DEBUG|TRACE_SPECIAL, args)
+#define TRACE_MGMT_DBG(args...) __TRACE(TRACE_MGMT_DEBUG, args)
+#define TRACE_MGMT_DBG_SPECIAL(args...) \
+ __TRACE(TRACE_MGMT_DEBUG|TRACE_SPECIAL, args)
+
+#define TRACE_BUFFER(message, buff, len) \
+do { \
+ if (trace_flag & TRACE_BUFF) { \
+ char *__tflag = LOG_FLAG; \
+ if (debug_print_prefix(trace_flag, __tflag, NULL, __func__, \
+ __LINE__) > 0) { \
+ __tflag = NO_FLAG; \
+ } \
+ PRINT(NO_FLAG, "%s%s:", __tflag, message); \
+ debug_print_buffer(LOG_FLAG, buff, len); \
+ } \
+} while (0)
+
+#define TRACE_BUFF_FLAG(flag, message, buff, len) \
+do { \
+ if (trace_flag & (flag)) { \
+ char *__tflag = LOG_FLAG; \
+ if (debug_print_prefix(trace_flag, __tflag, NULL, __func__, \
+ __LINE__) > 0) { \
+ __tflag = NO_FLAG; \
+ } \
+ PRINT(NO_FLAG, "%s%s:", __tflag, message); \
+ debug_print_buffer(LOG_FLAG, buff, len); \
+ } \
+} while (0)
+
+#define PRINT_LOG_FLAG(log_flag, format, args...) \
+do { \
+ char *__tflag = log_flag; \
+ if (debug_print_prefix(trace_flag, __tflag, __LOG_PREFIX, \
+ __func__, __LINE__) > 0) { \
+ __tflag = NO_FLAG; \
+ } \
+ PRINT(NO_FLAG, "%s" format, __tflag, args); \
+} while (0)
+
+#define PRINT_WARNING(format, args...) \
+do { \
+ if (strcmp(INFO_FLAG, LOG_FLAG)) { \
+ PRINT_LOG_FLAG(LOG_FLAG, "***WARNING*** " format, args); \
+ } \
+ PRINT_LOG_FLAG(INFO_FLAG, "***WARNING*** " format, args); \
+} while (0)
+
+#define PRINT_ERROR(format, args...) \
+do { \
+ if (strcmp(ERROR_FLAG, LOG_FLAG)) { \
+ PRINT_LOG_FLAG(LOG_FLAG, "***ERROR*** " format, args); \
+ } \
+ PRINT_LOG_FLAG(ERROR_FLAG, "***ERROR*** " format, args); \
+} while (0)
+
+#define PRINT_CRIT_ERROR(format, args...) \
+do { \
+ /* if (strcmp(CRIT_FLAG, LOG_FLAG)) \
+ { \
+ PRINT_LOG_FLAG(LOG_FLAG, "***CRITICAL ERROR*** " format, args); \
+ }*/ \
+ PRINT_LOG_FLAG(CRIT_FLAG, "***CRITICAL ERROR*** " format, args); \
+} while (0)
+
+#define PRINT_INFO(format, args...) \
+do { \
+ if (strcmp(INFO_FLAG, LOG_FLAG)) { \
+ PRINT_LOG_FLAG(LOG_FLAG, format, args); \
+ } \
+ PRINT_LOG_FLAG(INFO_FLAG, format, args); \
+} while (0)
+
+#define TRACE_ENTRY() \
+do { \
+ if (trace_flag & TRACE_ENTRYEXIT) { \
+ if (trace_flag & TRACE_PID) { \
+ PRINT(LOG_FLAG, "[%d]: ENTRY %s", current->pid, \
+ __func__); \
+ } \
+ else { \
+ PRINT(LOG_FLAG, "ENTRY %s", __func__); \
+ } \
+ } \
+} while (0)
+
+#define TRACE_EXIT() \
+do { \
+ if (trace_flag & TRACE_ENTRYEXIT) { \
+ if (trace_flag & TRACE_PID) { \
+ PRINT(LOG_FLAG, "[%d]: EXIT %s", current->pid, \
+ __func__); \
+ } \
+ else { \
+ PRINT(LOG_FLAG, "EXIT %s", __func__); \
+ } \
+ } \
+} while (0)
+
+#define TRACE_EXIT_RES(res) \
+do { \
+ if (trace_flag & TRACE_ENTRYEXIT) { \
+ if (trace_flag & TRACE_PID) { \
+ PRINT(LOG_FLAG, "[%d]: EXIT %s: %ld", current->pid, \
+ __func__, (long)(res)); \
+ } \
+ else { \
+ PRINT(LOG_FLAG, "EXIT %s: %ld", \
+ __func__, (long)(res)); \
+ } \
+ } \
+} while (0)
+
+#define TRACE_EXIT_HRES(res) \
+do { \
+ if (trace_flag & TRACE_ENTRYEXIT) { \
+ if (trace_flag & TRACE_PID) { \
+ PRINT(LOG_FLAG, "[%d]: EXIT %s: 0x%lx", current->pid, \
+ __func__, (long)(res)); \
+ } \
+ else { \
+ PRINT(LOG_FLAG, "EXIT %s: %lx", \
+ __func__, (long)(res)); \
+ } \
+ } \
+} while (0)
+
+#else /* CONFIG_SCST_DEBUG */
+
+#define TRACE_MEM(format, args...) do {} while (0)
+#define TRACE_SG(format, args...) do {} while (0)
+#define TRACE_DBG(format, args...) do {} while (0)
+#define TRACE_DBG_SPECIAL(format, args...) do {} while (0)
+#define TRACE_MGMT_DBG(format, args...) do {} while (0)
+#define TRACE_MGMT_DBG_SPECIAL(format, args...) do {} while (0)
+#define TRACE_BUFFER(message, buff, len) do {} while (0)
+#define TRACE_BUFF_FLAG(flag, message, buff, len) do {} while (0)
+#define TRACE_ENTRY() do {} while (0)
+#define TRACE_EXIT() do {} while (0)
+#define TRACE_EXIT_RES(res) do {} while (0)
+#define TRACE_EXIT_HRES(res) do {} while (0)
+
+#ifdef LOG_PREFIX
+
+#define PRINT_INFO(format, args...) \
+do { \
+ PRINT(INFO_FLAG, "%s: " format, LOG_PREFIX, args); \
+} while (0)
+
+#define PRINT_WARNING(format, args...) \
+do { \
+ PRINT(INFO_FLAG, "%s: ***WARNING*** " \
+ format, LOG_PREFIX, args); \
+} while (0)
+
+#define PRINT_ERROR(format, args...) \
+do { \
+ PRINT(ERROR_FLAG, "%s: ***ERROR*** " \
+ format, LOG_PREFIX, args); \
+} while (0)
+
+#define PRINT_CRIT_ERROR(format, args...) \
+do { \
+ PRINT(CRIT_FLAG, "%s: ***CRITICAL ERROR*** " \
+ format, LOG_PREFIX, args); \
+} while (0)
+
+#else
+
+#define PRINT_INFO(format, args...) \
+do { \
+ PRINT(INFO_FLAG, format, args); \
+} while (0)
+
+#define PRINT_WARNING(format, args...) \
+do { \
+ PRINT(INFO_FLAG, "***WARNING*** " \
+ format, args); \
+} while (0)
+
+#define PRINT_ERROR(format, args...) \
+do { \
+ PRINT(ERROR_FLAG, "***ERROR*** " \
+ format, args); \
+} while (0)
+
+#define PRINT_CRIT_ERROR(format, args...) \
+do { \
+ PRINT(CRIT_FLAG, "***CRITICAL ERROR*** " \
+ format, args); \
+} while (0)
+
+#endif /* LOG_PREFIX */
+
+#endif /* CONFIG_SCST_DEBUG */
+
+#if defined(CONFIG_SCST_DEBUG) && defined(CONFIG_DEBUG_SLAB)
+#define SCST_SLAB_FLAGS (SLAB_RED_ZONE | SLAB_POISON)
+#else
+#define SCST_SLAB_FLAGS 0L
+#endif
+
+#endif /* __SCST_DEBUG_H */
diff -uprN orig/linux-2.6.27/drivers/scst/scst_debug.c linux-2.6.27/drivers/scst/scst_debug.c
--- orig/linux-2.6.27/drivers/scst/scst_debug.c
+++ linux-2.6.27/drivers/scst/scst_debug.c
@@ -0,0 +1,128 @@
+/*
+ * scst_debug.c
+ *
+ * Copyright (C) 2004 - 2008 Vladislav Bolkhovitin <[email protected]>
+ * Copyright (C) 2004 - 2005 Leonid Stoljar
+ * Copyright (C) 2007 - 2008 CMS Distribution Limited
+ *
+ * Contains helper functions for execution tracing and error reporting.
+ * Intended to be included in main .c file.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation, version 2
+ * of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include "scst.h"
+#include "scst_debug.h"
+
+#if defined(CONFIG_SCST_DEBUG) || defined(CONFIG_SCST_TRACING)
+
+#define TRACE_BUF_SIZE 512
+
+static char trace_buf[TRACE_BUF_SIZE];
+static DEFINE_SPINLOCK(trace_buf_lock);
+
+static inline int get_current_tid(void)
+{
+ /* Code should be the same as in sys_gettid() */
+ if (in_interrupt()) {
+ /*
+ * Unfortunately, task_pid_vnr() isn't IRQ-safe, so otherwise
+ * it can oops. ToDo.
+ */
+ return 0;
+ }
+ return task_pid_vnr(current);
+}
+
+int debug_print_prefix(unsigned long trace_flag, const char *log_level,
+ const char *prefix, const char *func, int line)
+{
+ int i = 0;
+ unsigned long flags;
+
+ spin_lock_irqsave(&trace_buf_lock, flags);
+
+ if (trace_flag & TRACE_PID)
+ i += snprintf(&trace_buf[i], TRACE_BUF_SIZE, "[%d]: ",
+ get_current_tid());
+ if (prefix != NULL)
+ i += snprintf(&trace_buf[i], TRACE_BUF_SIZE - i, "%s: ",
+ prefix);
+ if (trace_flag & TRACE_FUNCTION)
+ i += snprintf(&trace_buf[i], TRACE_BUF_SIZE - i, "%s:", func);
+ if (trace_flag & TRACE_LINE)
+ i += snprintf(&trace_buf[i], TRACE_BUF_SIZE - i, "%i:", line);
+
+ if (i > 0)
+ PRINTN(log_level, "%s", trace_buf);
+
+ spin_unlock_irqrestore(&trace_buf_lock, flags);
+
+ return i;
+}
+EXPORT_SYMBOL(debug_print_prefix);
+
+void debug_print_buffer(const char *log_level, const void *data, int len)
+{
+ int z, z1, i;
+ const unsigned char *buf = (const unsigned char *) data;
+ int f = 0;
+ unsigned long flags;
+
+ if (buf == NULL)
+ return;
+
+ spin_lock_irqsave(&trace_buf_lock, flags);
+
+ PRINT(NO_FLAG, " (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F");
+ for (z = 0, z1 = 0, i = 0; z < len; z++) {
+ if (z % 16 == 0) {
+ if (z != 0) {
+ i += snprintf(&trace_buf[i], TRACE_BUF_SIZE - i,
+ " ");
+ for (; (z1 < z) && (i < TRACE_BUF_SIZE - 1);
+ z1++) {
+ if ((buf[z1] >= 0x20) &&
+ (buf[z1] < 0x80))
+ trace_buf[i++] = buf[z1];
+ else
+ trace_buf[i++] = '.';
+ }
+ trace_buf[i] = '\0';
+ PRINT(NO_FLAG, "%s", trace_buf);
+ i = 0;
+ f = 1;
+ }
+ i += snprintf(&trace_buf[i], TRACE_BUF_SIZE - i,
+ "%4x: ", z);
+ }
+ i += snprintf(&trace_buf[i], TRACE_BUF_SIZE - i, "%02x ",
+ buf[z]);
+ }
+ i += snprintf(&trace_buf[i], TRACE_BUF_SIZE - i, " ");
+ for (; (z1 < z) && (i < TRACE_BUF_SIZE - 1); z1++) {
+ if ((buf[z1] > 0x20) && (buf[z1] < 0x80))
+ trace_buf[i++] = buf[z1];
+ else
+ trace_buf[i++] = '.';
+ }
+ trace_buf[i] = '\0';
+ if (f)
+ PRINT(log_level, "%s", trace_buf);
+ else
+ PRINT(NO_FLAG, "%s", trace_buf);
+
+ spin_unlock_irqrestore(&trace_buf_lock, flags);
+ return;
+}
+EXPORT_SYMBOL(debug_print_buffer);
+
+#endif /* CONFIG_SCST_DEBUG || CONFIG_SCST_TRACING */
This patch contains SCST the /proc interface.
A description of this interface can be found in the patch with the
SCST core documentation.
Since a procfs-based configuration interface is unacceptable for new
kernel modules, in the next review iteration SCST's configuration
interface will be replaced by a sysfs-based configuration interface.
This patch is not intended to be included in the Linux kernel, but is
posted here, because as of today this configuration interface is
necessary when using SCST.
Unfortunately, configfs is not (yet) suited for configuring SCST. This
is, because configfs is user space driven, so kernel can't create
subdirectories on it, and all files on configfs are limited to 4K in
size. It makes impossible for kernel to show, e.g., a list of connected
initiators. Hence, with configfs it is necessary to have one more
interface to show such data, e.g. sysfs-based. It would lead to 2
interfaces in two different places for configuring SCSI targets:
configfs and sysfs based. Definitely, it is better to have only one,
sysfs-based interface, than 2 interfaces. From other side, sysfs in what
SCST needs provides basically the same possibilities as configfs. And
it's widely used in the kernel to configure various its parameters. See,
for instance, bonding devices or IO schedulers.
The proposed /sys interface would be very similar to the current /proc
layout intact, except cases, where output >PAGE_SIZE is needed. For such
cases each entry, i.e. line, in such files would be presented as a
subdirectory with name the first element in that line and each other
element in it would be presented as a separate file (attribute). For
instance, /proc/scsi_tgt/sessions, which lists connected sessions, would
be converted to:
/sys/scsi_tgt/
/sys/scsi_tgt/sessions/
/sys/scsi_tgt/sessions/session1_name/
/sys/scsi_tgt/sessions/session1_name/target_name
/sys/scsi_tgt/sessions/session1_name/initiator_name
/sys/scsi_tgt/sessions/session1_name/acl -> ../../acls/aclX
/sys/scsi_tgt/sessions/session1_name/commands
/sys/scsi_tgt/sessions/session2_name/
/sys/scsi_tgt/sessions/session2_name/target_name
/sys/scsi_tgt/sessions/session2_name/initiator_name
/sys/scsi_tgt/sessions/session2_name/acl -> ../../acls/aclY
/sys/scsi_tgt/sessions/session2_name/commands
.
.
.
Addition of new, e.g. vdisk devices, would be done via echo'ing commands
to, in this example, /sys/scsi_tgt/vdisk/mgmt file similarly as it's
currently done with /proc/scsi_tgt/vdisk/vdisk (same commands, actually).
Any comments and suggestions will be greatly appreciated.
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
drivers/scst/scst_proc.c | 2196
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 2196 insertions(+)
The patch is too big to be submitted inline. You can find it in
http://scst.sourceforge.net/patches/scst_proc.diff
This patch contains SCST SGV cache. SGV cache is a memory management
subsystem in SCST. One can call it a "memory pool", but Linux kernel
already have mempool interface, which serves different purposes. SGV
cache provides to SCST core, target drivers and backend dev handlers
facilities to allocate and build SG vectors for data buffers. The main
feature of it is that it doesn't free to the system each vector, which
is not used anymore, but keeps it for a while to let it be reused by the
next consecutive command to reduce command processing latency and,
hence, improve performance. The freed SG vectors are kept by SGV cache
either for some predefined time, or until the system needs more memory
and asks to free some using the set_shrinker() interface. Also the SGV
cache allows to:
- Cluster pages together to minimize number of SG entries in the
vector and improve the performance of handling the SG vector.
- Set custom page allocator function. For instance, the scst_user
device handler uses this facility to eliminate unneeded
mapping/unmapping of user space pages and avoid unneeded IOCTL calls for
buffers allocations. In fileio_tgt application it leads to ~30% less CPU
load and considerable performance increase.
- Prevent each initiator or all initiators altogether to allocate too
much memory and effectively DoS the target. Consider 10 initiators,
which can have access to 10 devices each. Any of then can queue up to 64
commands, each can transfer up to 1MB of data. So, all of them in a peak
can allocate up to 10*10*64 = ~6.5GB of memory for data buffers. This
amount must be limited somehow and SGV cache performs this function.
This feature was implemented after people reported about such DoS'es,
when there are many fast initiators and a slow target.
From implementation POV SGV cache is a simple extension of kmem cache.
Each SGV cache, called pool, (struct sgv_pool) has SGV_POOL_ELEMENTS (11
currently) of kmem caches. Each of those kmem caches keeps SGV pool
objects (struct sgv_pool_obj) corresponding to SG vectors with size of
order X pages. For instance, request to allocate 4 pages will be served
from kmem cache[2] (order 2). If then request to allocate 11KB comes,
the same SG vector with 4 pages will be reused (see below).
When a request to allocate new SG vector comes, sgv_pool_alloc() via
sgv_pool_cached_get() checks if there is already cached vector with that
order. If yes, then that vector will be reused and its length, if
necessary, will be modified to match the requested size. In the above
example request for 11KB, 4 pages vector will be reused and modified
using trans_tbl to contain 3 pages and the last entry will be modified
to contain the requested length - 2*PAGE_SIZE. If there is no cached
object, then a new sgv_pool_obj will be allocated from the corresponding
kmem cache, chosen by order of number of requested pages. Then that
vector will be filled by pages and returned.
Freed sgv_pool_obj objects are freed to the system either by apit_pool
work or in sgv_pool_cached_shrinker() called by system, when it's asking
for memory.
P.S. Solaris COMSTAR also has similar facility.
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
drivers/scst/scst_mem.c | 1336 ++++++++++++++++++++++++++++++++++++++++++++++++
drivers/scst/scst_mem.h | 149 +++++
include/scst/scst_sgv.h | 60 ++
3 files changed, 1545 insertions(+)
The patch is too big to be submitted inline. You can find it in
http://scst.sourceforge.net/patches/scst_sgv.diff
This patch contains the changes necessary to integrate SCST into the
kernel build system by adding the SCST target framework as an entry
under drivers.
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
drivers/Kconfig | 2 ++
drivers/Makefile | 1 +
2 files changed, 3 insertions(+)
diff -upkr -X linux-2.6.26/Documentation/dontdiff linux-2.6.26/drivers/Kconfig linux-2.6.26/drivers/Kconfig
--- orig/linux-2.6.27/drivers/Kconfig 01:51:29.000000000 +0400
+++ linux-2.6.27/drivers/Kconfig 14:14:46.000000000 +0400
@@ -24,6 +24,8 @@ source "drivers/ide/Kconfig"
source "drivers/scsi/Kconfig"
+source "drivers/scst/Kconfig"
+
source "drivers/ata/Kconfig"
source "drivers/md/Kconfig"
diff -upkr -X linux-2.6.26/Documentation/dontdiff linux-2.6.26/drivers/Makefile linux-2.6.26/drivers/Makefile
--- orig/linux-2.6.27/drivers/Makefile 01:51:29.000000000 +0400
+++ linux-2.6.27/drivers/Makefile 14:15:29.000000000 +0400
@@ -39,6 +39,7 @@ obj-$(CONFIG_ATM) += atm/
obj-y += macintosh/
obj-$(CONFIG_IDE) += ide/
obj-$(CONFIG_SCSI) += scsi/
+obj-$(CONFIG_SCST) += scst/
obj-$(CONFIG_ATA) += ata/
obj-$(CONFIG_FUSION) += message/
obj-$(CONFIG_FIREWIRE) += firewire/
This patch contains SCST pass-through backend dev handlers. There are
handlers for disks (type 0), tapes (type 1), processors (type 3), CDROMs
(type 5), MO disks (type 7), medium changers (type 8) and RAID
controllers (type 0xC).
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
drivers/scst/dev_handlers/scst_cdrom.c | 299 +++++++++++++++++++++++++++++++++++++
drivers/scst/dev_handlers/scst_changer.c | 229 ++++++++++++++++++++++++++++
drivers/scst/dev_handlers/scst_dev_handler.h | 113 ++++++++++++++
drivers/scst/dev_handlers/scst_disk.c | 384 +++++++++++++++++++++++++++++++++++++++++++++++
drivers/scst/dev_handlers/scst_modisk.c | 406 ++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/scst/dev_handlers/scst_processor.c | 229 ++++++++++++++++++++++++++++
drivers/scst/dev_handlers/scst_raid.c | 229 ++++++++++++++++++++++++++++
drivers/scst/dev_handlers/scst_tape.c | 426 +++++++++++++++++++++++++++++++++++++++++++++++++++++
8 files changed, 2315 insertions(+)
The patch is too big to be submitted inline. You can find it in
http://scst.sourceforge.net/patches/scst_passthrough.diff
This patch contains the SCST virtual disk backend device handler. This
is an SCST device handler that allows to export local files or local
block devices as SCSI devices through the SCST target framework. This
handler has three modes:
- FILEIO mode, which allows to use files on file systems or block
devices as virtual remotely available SCSI disks or CDROMs with benefits
of the Linux page cache.
- BLOCKIO mode, which performs direct block IO with a block device,
bypassing page-cache for all operations. This mode works ideally with
high-end storage HBAs and for applications that either do not need
caching between application and disk or need the large block throughput.
- NULLIO mode, in which no real IO is performed. It is intended for
performance testing.
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
drivers/scst/dev_handlers/scst_vdisk.c | 3540 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 3540 insertions(+)
The patch is too big to be submitted inline. You can find it in
http://scst.sourceforge.net/patches/scst_vdisk.diff
This patch contains user space backend dev handler. It allows to
implement in the user space virtual SCSI devices in the SCST
environment. See description of its interface here:
http://scst.sourceforge.net/scst_user_spec.txt
An example of an application that uses this interface is fileio_tgt. You
can download this software from
https://sourceforge.net/project/showfiles.php?group_id=110471&package_id=283232.
This application is a full feature virtual disk emulator similar to
scst_vdisk backend dev handler.
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
drivers/scst/dev_handlers/scst_user.c | 3059 ++++++++++++++++++++++++++++++++++
include/scst/scst_user.h | 266 ++
2 files changed, 3325 insertions(+)
The patch is too big to be submitted inline. You can find it in
http://scst.sourceforge.net/patches/scst_user.diff
This patch contains Makefile for SCST backend handlers.
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
drivers/scst/dev_handlers/Makefile | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff -uprN orig/linux-2.6.27/drivers/scst/dev_handlers/Makefile linux-2.6.27/drivers/scst/dev_handlers/Makefile
--- orig/linux-2.6.27/drivers/scst/dev_handlers/Makefile
+++ linux-2.6.27/drivers/scst/dev_handlers/Makefile
@@ -0,0 +1,14 @@
+EXTRA_CFLAGS += -Iinclude/scst -Wno-unused-parameter
+
+obj-m := scst_cdrom.o scst_changer.o scst_disk.o scst_modisk.o scst_tape.o \
+ scst_vdisk.o scst_raid.o scst_processor.o scst_user.o
+
+obj-$(CONFIG_SCST_DISK) += scst_disk.o
+obj-$(CONFIG_SCST_TAPE) += scst_tape.o
+obj-$(CONFIG_SCST_CDROM) += scst_cdrom.o
+obj-$(CONFIG_SCST_MODISK) += scst_modisk.o
+obj-$(CONFIG_SCST_CHANGER) += scst_changer.o
+obj-$(CONFIG_SCST_RAID) += scst_raid.o
+obj-$(CONFIG_SCST_PROCESSOR) += scst_processor.o
+obj-$(CONFIG_SCST_VDISK) += scst_vdisk.o
+obj-$(CONFIG_SCST_USER) += scst_user.o
This patch adds to the kernel new function scsi_execute_async_fifo().
This function is the same as scsi_execute_async(), but it queues
requests in FIFO order, not LIFO as scsi_execute_async() does.
Function scsi_execute_async_fifo() is needed for SCST to queue commands
from remote initiators to SCST devices in pass-through mode. There is no
simple alternative to this function, because SCST has to deal with SG
vectors and can't deal with bio's. Hardware target drivers require that.
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
drivers/scsi/scsi_lib.c | 58 +++++++++++++++++++++++++++++++++++++++++----
include/scsi/scsi_device.h | 8 ++++++
2 files changed, 62 insertions(+), 4 deletions(-)
diff -upr linux-2.6.26/drivers/scsi/scsi_lib.c linux-2.6.26/drivers/scsi/scsi_lib.c
--- linux-2.6.26/drivers/scsi/scsi_lib.c 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/drivers/scsi/scsi_lib.c 2008-07-31 21:20:00.000000000 +0400
@@ -372,7 +372,7 @@ free_bios:
}
/**
- * scsi_execute_async - insert request
+ * __scsi_execute_async - insert request
* @sdev: scsi device
* @cmd: scsi command
* @cmd_len: length of scsi cdb
@@ -385,11 +385,14 @@ free_bios:
* @privdata: data passed to done()
* @done: callback function when done
* @gfp: memory allocation flags
+ * @at_head: insert request at head or tail of queue
*/
-int scsi_execute_async(struct scsi_device *sdev, const unsigned char *cmd,
+static inline int __scsi_execute_async(struct scsi_device *sdev,
+ const unsigned char *cmd,
int cmd_len, int data_direction, void *buffer, unsigned bufflen,
int use_sg, int timeout, int retries, void *privdata,
- void (*done)(void *, char *, int, int), gfp_t gfp)
+ void (*done)(void *, char *, int, int), gfp_t gfp,
+ int at_head)
{
struct request *req;
struct scsi_io_context *sioc;
@@ -426,7 +429,7 @@ int scsi_execute_async(struct scsi_devic
sioc->data = privdata;
sioc->done = done;
- blk_execute_rq_nowait(req->q, NULL, req, 1, scsi_end_async);
+ blk_execute_rq_nowait(req->q, NULL, req, at_head, scsi_end_async);
return 0;
free_req:
@@ -435,8 +438,55 @@ free_sense:
kmem_cache_free(scsi_io_context_cache, sioc);
return DRIVER_ERROR << 24;
}
+
+/**
+ * scsi_execute_async - insert request
+ * @sdev: scsi device
+ * @cmd: scsi command
+ * @cmd_len: length of scsi cdb
+ * @data_direction: data direction
+ * @buffer: data buffer (this can be a kernel buffer or scatterlist)
+ * @bufflen: len of buffer
+ * @use_sg: if buffer is a scatterlist this is the number of elements
+ * @timeout: request timeout in seconds
+ * @retries: number of times to retry request
+ * @flags: or into request flags
+ **/
+int scsi_execute_async(struct scsi_device *sdev, const unsigned char *cmd,
+ int cmd_len, int data_direction, void *buffer,
+ unsigned bufflen, int use_sg, int timeout,
+ int retries, void *privdata,
+ void (*done)(void *, char *, int, int), gfp_t gfp)
+{
+ return __scsi_execute_async(sdev, cmd, cmd_len, data_direction, buffer,
+ bufflen, use_sg, timeout, retries, privdata, done, gfp, 1);
+}
EXPORT_SYMBOL_GPL(scsi_execute_async);
+/**
+ * scsi_execute_async_fifo - insert request at tail, in FIFO order
+ * @sdev: scsi device
+ * @cmd: scsi command
+ * @cmd_len: length of scsi cdb
+ * @data_direction: data direction
+ * @buffer: data buffer (this can be a kernel buffer or scatterlist)
+ * @bufflen: len of buffer
+ * @use_sg: if buffer is a scatterlist this is the number of elements
+ * @timeout: request timeout in seconds
+ * @retries: number of times to retry request
+ * @flags: or into request flags
+ **/
+int scsi_execute_async_fifo(struct scsi_device *sdev, const unsigned char *cmd,
+ int cmd_len, int data_direction, void *buffer,
+ unsigned bufflen, int use_sg, int timeout, int retries,
+ void *privdata,
+ void (*done)(void *, char *, int, int), gfp_t gfp)
+{
+ return __scsi_execute_async(sdev, cmd, cmd_len, data_direction, buffer,
+ bufflen, use_sg, timeout, retries, privdata, done, gfp, 0);
+}
+EXPORT_SYMBOL_GPL(scsi_execute_async_fifo);
+
/*
* Function: scsi_init_cmd_errh()
*
diff -upr linux-2.6.26/include/scsi/scsi_device.h linux-2.6.26/include/scsi/scsi_device.h
--- linux-2.6.26/include/scsi/scsi_device.h 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/include/scsi/scsi_device.h 2008-07-31 21:20:39.000000000 +0400
@@ -365,6 +365,14 @@ extern int scsi_execute_async(struct scs
int timeout, int retries, void *privdata,
void (*done)(void *, char *, int, int),
gfp_t gfp);
+#define SCSI_EXEC_REQ_FIFO_DEFINED
+extern int scsi_execute_async_fifo(struct scsi_device *sdev,
+ const unsigned char *cmd, int cmd_len,
+ int data_direction, void *buffer,
+ unsigned bufflen, int use_sg,
+ int timeout, int retries, void *privdata,
+ void (*done)(void *, char *, int, int),
+ gfp_t gfp);
static inline int __must_check scsi_device_reprobe(struct scsi_device *sdev)
{
This patch exports alloc_io_context() function. For performance reasons
SCST queues commands using a pool of IO threads. It is considerably
better for performance (>30% increase on sequential reads) if threads in
a pool have the same IO context. Since SCST can be built as a module,
it needs alloc_io_context() function exported.
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
block/blk-ioc.c | 1 +
1 file changed, 1 insertion(+)
diff -upkr linux-2.6.27.2/block/blk-ioc.c linux-2.6.27.2/block/blk-ioc.c
--- linux-2.6.27.2/block/blk-ioc.c 2008-10-10 02:13:53.000000000 +0400
+++ linux-2.6.27.2/block/blk-ioc.c 2008-11-25 21:27:01.000000000 +0300
@@ -105,6 +105,7 @@ struct io_context *alloc_io_context(gfp_
return ret;
}
+EXPORT_SYMBOL(alloc_io_context);
/*
* If the current task has no IO context then create one and initialise it.
This patch adds necessary functionality in qla2xxx driver to support
SCST target mode add-on. Basically, it only adds the necessary for the
target mode add-on constants, hooks, sysfs entries and initialization
actions. It doesn't touch any core functionality of qla2xxx driver.
This patch needs patches http://article.gmane.org/gmane.linux.scsi/43495
and http://article.gmane.org/gmane.linux.scsi/43475, but doesn't require
them.
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
drivers/scsi/qla2xxx/Kconfig | 10 +
drivers/scsi/qla2xxx/qla2x_tgt.h | 120 ++++++++++++++++++
drivers/scsi/qla2xxx/qla2x_tgt_def.h | 362 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/scsi/qla2xxx/qla_attr.c | 324 ++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/scsi/qla2xxx/qla_def.h | 11 +
drivers/scsi/qla2xxx/qla_gbl.h | 9 +
drivers/scsi/qla2xxx/qla_gs.c | 11 +
drivers/scsi/qla2xxx/qla_init.c | 235 ++++++++++++++++++++++++++++++++----
drivers/scsi/qla2xxx/qla_iocb.c | 7 -
drivers/scsi/qla2xxx/qla_isr.c | 123 +++++++++++++++++++
drivers/scsi/qla2xxx/qla_mbx.c | 14 +-
drivers/scsi/qla2xxx/qla_os.c | 14 ++
12 files changed, 1201 insertions(+), 39 deletions(-)
The patch is too big to be submitted inline. You can find it in
http://scst.sourceforge.net/patches/qla_tgt.diff
This patch contains target mode add-on for qla2xxx driver for QLogic
22xx/23xx cards.
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
drivers/scst/qla2xxx-target/Kconfig | 19
drivers/scst/qla2xxx-target/Makefile | 6
drivers/scst/qla2xxx-target/qla2x00t.c | 2280 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/scst/qla2xxx-target/qla2x00t.h | 163 ++++
4 files changed, 2468 insertions(+)
The patch is too big to be submitted inline. You can find it in
http://scst.sourceforge.net/patches/qla2x00t.diff
This patch contains documentation for QLogic target mode add-on.
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
/Documentation/scst/README.qla2x00t | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 117 insertions(+)
diff -uprN orig/linux-2.6.27/Documentation/scst/README.qla2x00t linux-2.6.27/Documentation/scst/README.qla2x00t
--- orig/linux-2.6.27/Documentation/scst/README.qla2x00t
+++ linux-2.6.27/Documentation/scst/README.qla2x00t
@@ -0,0 +1,117 @@
+Target driver for Qlogic 2200/2300 Fibre Channel cards
+======================================================
+
+Version 1.0.1, XX XXXX 2008
+---------------------------
+
+This driver has all required features and looks to be quite stable (for
+beta) and useful. It consists from two parts: the target mode driver
+itself and the changed initiator driver from Linux kernel, which is,
+particularly, intended to perform all the initialization and shutdown
+tasks. This driver was changed to provide the target mode support and
+all necessary callbacks, but it's still capable to work as initiator
+only. Mode, when a host acts as the initiator and the target
+simultaneously, is supported as well.
+
+This version is compatible with SCST core version 1.0.1 and higher and
+Linux kernel 2.6.27 and higher. Backport patches to kernels earlier than
+2.6.27 are welcome.
+
+The original initiator driver was taken from the kernel 2.6.27.
+
+If you need to use this driver on kernels prior 2.6.27, it is
+recommended to use version 1.0.0.x of this driver, taken from branch
+1.0.0.x. For 2.6.26 you can also use version of this driver from SVN
+trunk/ revision 577.
+
+See also "ToDo" file for list of known issues and unimplemented
+features.
+
+Installation
+------------
+
+Only vanilla kernels from kernel.org are supported, but it should work
+on vendors' kernels, if you manage to successfully compile on them. The
+main problem with vendor's kernels is that they often contain patches,
+which will appear only in the next version of the vanilla kernel,
+therefore it's quite hard to track such changes. Thus, if during
+compilation for some vendor kernel your compiler complains about
+redefinition of some symbol, you should either switch to vanilla kernel,
+or change as necessary the corresponding to that symbol "#if
+LINUX_VERSION_CODE" statement.
+
+At first, make sure that the link "/lib/modules/`you_kernel_version`/build"
+points to the source code for your currently running kernel.
+
+Then you should replace (or link) by the initiator driver from this
+package "qla2xxx" subdirectory in kernel_source/drivers/scsi/ of the
+currently running kernel and using your favorite kernel configuration
+tool enable in the QLogic QLA2XXX Fibre Channel driver target mode
+support (CONFIG_SCSI_QLA2XXX_TARGET). Then rebuild the kernel and its
+modules. During this step you will compile the initiator driver. To
+install it, install the built kernel and its modules.
+
+Then edit qla2x00-target/Makefile and set SCST_INC_DIR variable to point
+to the directory, where SCST's public include files are located. If you
+install QLA2x00 target driver's source code in the SCST's directory,
+then SCST_INC_DIR will be set correctly for you.
+
+Also you can set SCST_DIR variable to the directory, where SCST was
+built, but this is optional. If you don't set it or set incorrectly,
+during the compilation you will get a bunch of harmless warnings like
+"WARNING: "scst_rx_data" [/XXX/qla2x00tgt.ko] undefined!"
+
+To compile the target driver, type 'make' in qla2x00-target/
+subdirectory. It will build qla2x00tgt.ko module.
+
+To install the target driver, type 'make install' in qla2x00-target/
+subdirectory. The target driver will be installed in
+/lib/modules/`you_kernel_version`/extra. To uninstall it, type 'make
+uninstall'.
+
+After the drivers are loaded and adapters successfully initialized by
+the initiator driver, including firmware image load, the target mode
+should be enabled via a sysfs interface on a per card basis. Under the
+appropriate scsi_host there is an entry target_mode_enabled, where you
+should write "1", like:
+
+echo "1" >/sys/class/scsi_host/host0/target_mode_enabled
+
+Then you should configure exported devices using the corresponding
+interface of SCST core. It is highly recommended to use scstadmin
+utility for that purpose.
+
+Compilation options
+-------------------
+
+There are the following compilation options, that could be commented
+in/out in Makefile:
+
+ - CONFIG_SCST_DEBUG - turns on some debugging code, including some logging.
+ Makes the driver considerably bigger and slower, producing large amount of
+ log data.
+
+ - CONFIG_SCST_TRACING - turns on ability to log events. Makes the driver
+ considerably bigger and leads to some performance loss.
+
+ - CONFIG_QLA_TGT_DEBUG_WORK_IN_THREAD - makes SCST process incoming
+ commands from the qla2x00t target driver and call the driver's
+ callbacks in internal SCST threads context instead of SIRQ context,
+ where those commands were received. Useful for debugging and lead to
+ some performance loss.
+
+Credits
+-------
+
+Thanks to
+
+ * Nathaniel Clark <[email protected]> for porting to new 2.6 kernel
+initiator driver.
+
+ * Mark Buechler <[email protected]> for the original
+WWN-based authentification, a lot of useful suggestions, bug reports and
+help in debugging.
+
+ * Ming Zhang <[email protected]> for his fixes.
+
+Vladislav Bolkhovitin <[email protected]>, http://scst.sourceforge.net
This patch contains target mode driver for InfiniBand SRP.
This driver work directly on top of InfiniBand stack and SCST.
Signed-off-by: Vu Pham <[email protected]>
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
drivers/scst/srpt/Kconfig | 12
drivers/scst/srpt/Makefile | 4
drivers/scst/srpt/ib_dm_mad.h | 106 ++
drivers/scst/srpt/ib_srpt.c | 2307 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/scst/srpt/ib_srpt.h | 201 +++++
5 files changed, 2630 insertions(+)
The patch is too big to be submitted inline. You can find it in
http://scst.sourceforge.net/patches/srpt.diff
This patch contains documentation for SRP target driver.
Signed-off-by: Vu Pham <[email protected]>
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
Documentation/scst/README.srpt | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 85 insertions(+)
diff -uprN orig/linux-2.6.27/Documentation/scst/README.srpt linux-2.6.27/Documentation/scst/README.srpt
--- orig/linux-2.6.27/Documentation/scst/README.srpt
+++ linux-2.6.27/Documentation/scst/README.srpt
@@ -0,0 +1,85 @@
+SCSI RDMA Protocol (SRP) Target driver for Linux
+=================================================
+
+The SRP Target driver is designed to work directly on top of the
+OpenFabrics OFED-1.x software stack (http://www.openfabrics.org) or
+the Infiniband drivers in the Linux kernel tree
+(http://www.kernel.org). The SRP target driver also interfaces with
+the generic SCSI target mid-level driver called SCST
+(http://scst.sourceforge.net).
+
+How-to run
+-----------
+
+A. On srp target machine
+1. Please refer to SCST's README for loading scst driver and its
+dev_handlers drivers (scst_disk, scst_vdisk block or file IO mode, nullio, ...)
+
+Example 1: working with real back-end scsi disks
+a. modprobe scst
+b. modprobe scst_disk
+c. cat /proc/scsi_tgt/scsi_tgt
+
+ibstor00:~ # cat /proc/scsi_tgt/scsi_tgt
+Device (host:ch:id:lun or name) Device handler
+0:0:0:0 dev_disk
+4:0:0:0 dev_disk
+5:0:0:0 dev_disk
+6:0:0:0 dev_disk
+7:0:0:0 dev_disk
+
+Now you want to exclude the first scsi disk and expose the last 4 scsi disks as
+IB/SRP luns for I/O
+echo "add 4:0:0:0 0" >/proc/scsi_tgt/groups/Default/devices
+echo "add 5:0:0:0 1" >/proc/scsi_tgt/groups/Default/devices
+echo "add 6:0:0:0 2" >/proc/scsi_tgt/groups/Default/devices
+echo "add 7:0:0:0 3" >/proc/scsi_tgt/groups/Default/devices
+
+Example 2: working with VDISK FILEIO mode (using md0 device and file 10G-file)
+a. modprobe scst
+b. modprobe scst_vdisk
+c. echo "open vdisk0 /dev/md0" > /proc/scsi_tgt/vdisk/vdisk
+d. echo "open vdisk1 /10G-file" > /proc/scsi_tgt/vdisk/vdisk
+e. echo "add vdisk0 0" >/proc/scsi_tgt/groups/Default/devices
+f. echo "add vdisk1 1" >/proc/scsi_tgt/groups/Default/devices
+
+Example 3: working with VDISK BLOCKIO mode (using md0 device, sda, and cciss/c1d0)
+a. modprobe scst
+b. modprobe scst_vdisk
+c. echo "open vdisk0 /dev/md0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
+d. echo "open vdisk1 /dev/sda BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
+e. echo "open vdisk2 /dev/cciss/c1d0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
+f. echo "add vdisk0 0" >/proc/scsi_tgt/groups/Default/devices
+g. echo "add vdisk1 1" >/proc/scsi_tgt/groups/Default/devices
+h. echo "add vdisk2 2" >/proc/scsi_tgt/groups/Default/devices
+
+2. modprobe ib_srpt
+
+
+B. On initiator machines you can manualy do the following steps:
+1. modprobe ib_srp
+2. ipsrpdm -c (to discover new SRP target)
+3. echo <new target info> > /sys/class/infiniband_srp/srp-mthca0-1/add_target
+4. fdisk -l (will show new discovered scsi disks)
+
+Example:
+Assume that you use port 1 of first HCA in the system ie. mthca0
+
+[root@lab104 ~]# ibsrpdm -c -d /dev/infiniband/umad0
+id_ext=0002c90200226cf4,ioc_guid=0002c90200226cf4,
+dgid=fe800000000000000002c90200226cf5,pkey=ffff,service_id=0002c90200226cf4
+[root@lab104 ~]# echo id_ext=0002c90200226cf4,ioc_guid=0002c90200226cf4,
+dgid=fe800000000000000002c90200226cf5,pkey=ffff,service_id=0002c90200226cf4 >
+/sys/class/infiniband_srp/srp-mthca0-1/add_target
+
+OR
+
++ You can edit /etc/infiniband/openib.conf to load srp driver and srp HA daemon
+automatically ie. set SRP_LOAD=yes, and SRPHA_ENABLE=yes
++ To set up and use high availability feature you need dm-multipath driver
+and multipath tool
++ Please refer to OFED-1.x SRP's user manual for more in-details instructions
+on how-to enable/use HA feature
+
+To minimize QUEUEFULL conditions, you can apply scst_increase_max_tgt_cmds
+patch from SRPT package from http://sourceforge.net/project/showfiles.php?group_id=110471
This patch contains a driver which allows the creation of user space
target drivers.
It was written by Richard Sharpe based on scst_debug driver. It allows
to access devices that are exported via SCST directly on the same Linux
system that they are exported from. Those devices appears as a regular
/dev/sg, sd, st, etc. devices, so a user space target driver can use
these device nodes to execute SCSI commands on. A possible work flow is
as follows:
- A target driver receives a new connection.
- Using the corresponding interface of scst_local driver (see below)
it will create a new SCST session, which will be assigned to the
corresponding ACL, and get a set of the corresponding sg devices.
- It will open them exclusively, so no other programs is able to
intermix with it.
- It will start SCSI commands exchange using the open sg devices.
- Then, when the initiator closed the session, the target driver will
delete the corresponding SCST session.
ACL is an "Access Control Group" also called "security group" in the
SCST documentation. It allows different initiators to see different set
of devices with different access permissions. This feature is also often
called "LUN masking".
At the moment scst_local driver isn't fully completed in the area of the
interface to allow each initiator to have own dedicated session assigned
to the corresponding ACL. Only basic functionality via
scst_ini_targ_debug /sys entry implemented.
There are two possible approaches. We need your advice about which
approach is better.
On the load scst_local module would register SCST target and session
with target and initiator names scst_local_targ_tmpl.name. Then it would
create all available from SCST devices. There would be possibility to
override both target and initiator names on load the scst_local module
via corresponding module parameters.
Each created SCSI host would have three attributes in sysfs, namely
target_name, initiator_name and host_no. Host_no should have host number
of the created SCSI hosts.
Possible approaches are the following:
1. Sysfs-based.
- scst_local_add_host_store() would accept 2 parameters, divided by
',': target name and initiator name. If target with that name already
exists, it would be reused, otherwise created. Then
scst_local_add_host_store() would create a new SCSI host and add to it
available from SCST devices.
- If scst_local_add_host_store() would get the first non-empty symbol
'-', it would remove the corresponding SCST session. If the target after
the session removal has no other sessions, it would be removed too.
For instance, on a new connection (session) from remote initiator, an
user space target driver:
- Using "add_host" command would create a new SCST session in a new
target, if it's the first session.
- Then it would search through sysfs and find using target and
initiator names host number of the newly created SCSI host.
- Then, e.g. by using lsscsi utility, it would find the corresponding
sg devices, then open them.
- Then it would start commands exchange using the open sg devices.
- Then, when the initiator closed the session, it would delete the
corresponding SCST session and, for the last session, target.
2. IOCTL-based.
- Scst_local would have an IOCTL functions: create session and delete
session. Both would accept 2 parameters: target name and initiator name.
- Create session would accept 2 parameters: target name and initiator
name. If target with that name already exists, it would be reused,
otherwise created. Then it would create a new SCSI host and add to it
available from SCST devices. It would return the created SCSI number on
success and -1 on failure.
- Delete session would remove the corresponding SCST session. If the
target after the session removal has no other sessions, it would be
removed too.
For instance, on a new connection (session) from remote initiator, an
user space target driver:
- Using "create session" IOCTL would create a new SCST session in a
new target, if it's the first session. It would return the new SCSI host
number.
- Then, e.g. by using lsscsi utility, it would find the corresponding
sg devices, then open them.
- Then it would start commands exchange using the open sg devices.
- Then, when the initiator closed the session, it would delete the
corresponding SCST session using "delete session" IOCTL and, for the
last session, target.
I personally would prefer the IOCTL approach as easier to use.
Configfs-based approach isn't considered, because it's better to keep
all SCSI target related entries in one place in /sys/scsi_target, not in
two places in /sys/scsi_target and /sys/config/scst_local.
Signed-off-by: Richard Sharpe <[email protected]>
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
drivers/scst/scst_local/Kconfig | 10
drivers/scst/scst_local/Makefile | 10
drivers/scst/scst_local/scst_local.c | 1054 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 1074 insertions(+)
diff -uprN orig/linux-2.6.27/drivers/scst/scst_local/Kconfig linux-2.6.27/drivers/scst/scst_local/Kconfig
--- orig/linux-2.6.27/drivers/scst/scst_local/Kconfig
+++ linux-2.6.27/drivers/scst/scst_local/Kconfig
@@ -0,0 +1,10 @@
+config SCST_LOCAL
+ tristate "SCST Local driver"
+ depends on SCST
+ ---help---
+ This module provides a LLD SCSI driver that connects to
+ the SCST target mode subsystem in a loop-back manner.
+ It allows you to test target-mode device-handlers locally.
+ You will need the SCST subsystem as well.
+
+ If unsure whether you really want or need this, say N.
diff -uprN orig/linux-2.6.27/drivers/scst/scst_local/Makefile linux-2.6.27/drivers/scst/scst_local/Makefile
--- orig/linux-2.6.27/drivers/scst/scst_local/Makefile
+++ linux-2.6.27/drivers/scst/scst_local/Makefile
@@ -0,0 +1,10 @@
+SCST_INC_DIR := include/scst
+SCST_DIR := drivers/scst
+EXTRA_CFLAGS += -I$(SCST_INC_DIR) -I$(SCST_DIR)
+
+#EXTRA_CFLAGS += -DCONFIG_SCST_EXTRACHECKS
+#EXTRA_CFLAGS += -DCONFIG_SCST_TRACING
+#EXTRA_CFLAGS += -DCONFIG_SCST_DEBUG
+
+obj-$(CONFIG_SCST_LOCAL) += scst_local.o
+
diff -uprN orig/linux-2.6.27/drivers/scst/scst_local/scst_local.c linux-2.6.27/drivers/scst/scst_local/scst_local.c
--- orig/linux-2.6.27/drivers/scst/scst_local/scst_local.c
+++ linux-2.6.27/drivers/scst/scst_local/scst_local.c
@@ -0,0 +1,1054 @@
+/*
+ * Copyright (C) 2008 Richard Sharpe
+ * Copyright (C) 1992 Eric Youngdale
+ *
+ * Simulate a host adapter and an SCST target adapter back to back
+ *
+ * Based on the scsi_debug.c driver originally by Eric Youngdale and
+ * others, including D Gilbert et al
+ *
+ */
+
+#include <linux/module.h>
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/timer.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/genhd.h>
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/proc_fs.h>
+#include <linux/vmalloc.h>
+#include <linux/moduleparam.h>
+#include <linux/scatterlist.h>
+#include <linux/blkdev.h>
+#include <linux/completion.h>
+#include <linux/stat.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_host.h>
+#include <scsi/scsi_tcq.h>
+#include <scsi/scsicam.h>
+#include <scsi/scsi_eh.h>
+
+/* SCST includes ... */
+#include <scst_const.h>
+#include <scst.h>
+
+#define LOG_PREFIX "scst_local"
+
+#include <scst_debug.h>
+
+
+#if defined(CONFIG_HIGHMEM4G) || defined(CONFIG_HIGHMEM64G)
+#warning "HIGHMEM kernel configurations are not supported by this module, \
+ because nowadays it isn't worth the effort. Consider changing \
+ VMSPLIT option or use a 64-bit configuration instead. See SCST core \
+ README file for details."
+#endif
+
+#ifdef CONFIG_SCST_DEBUG
+#define SCST_LOCAL_DEFAULT_LOG_FLAGS (TRACE_FUNCTION | TRACE_PID | \
+ TRACE_OUT_OF_MEM | TRACE_MGMT | TRACE_MGMT_MINOR | \
+ TRACE_MGMT_DEBUG | TRACE_MINOR | TRACE_SPECIAL)
+#else
+# ifdef CONFIG_SCST_TRACING
+#define SCST_LOCAL_DEFAULT_LOG_FLAGS (TRACE_OUT_OF_MEM | TRACE_MGMT | \
+ TRACE_MINOR | TRACE_SPECIAL)
+# endif
+#endif
+
+#if defined(CONFIG_SCST_DEBUG) || defined(CONFIG_SCST_TRACING)
+#define trace_flag scst_local_trace_flag
+static unsigned long scst_local_trace_flag = SCST_LOCAL_DEFAULT_LOG_FLAGS;
+#endif
+
+#define TRUE 1
+#define FALSE 0
+
+/*
+ * Some definitions needed by the scst portion
+ */
+static void scst_local_remove_adapter(void);
+static int scst_local_add_adapter(void);
+
+#define SCST_LOCAL_VERSION "0.9"
+static const char *scst_local_version_date = "20081130";
+
+/*
+ * Target structures that are shared between the two pieces
+ * This will have to change if we have more than one target
+ */
+static struct scst_tgt_template scst_local_targ_tmpl;
+
+/*
+ * Some max values
+ */
+#define DEF_NUM_HOST 1
+#define DEF_NUM_TGTS 1
+#define SCST_LOCAL_MAX_TARGETS 16
+#define DEF_MAX_LUNS 256
+
+/*
+ * These following defines are the SCSI Host LLD (the initiator).
+ * SCST Target Driver is below
+ */
+
+static int scst_local_add_host = DEF_NUM_HOST;
+static int scst_local_num_tgts = DEF_NUM_TGTS;
+static int scst_local_max_luns = DEF_MAX_LUNS;
+
+static int num_aborts;
+static int num_dev_resets;
+static int num_target_resets;
+
+/*
+ * Each host has multiple targets, each of which has a separate session
+ * to SCST.
+ */
+
+struct scst_local_host_info {
+ struct list_head host_list;
+ struct Scsi_Host *shost;
+ struct scst_tgt *target;
+ struct scst_session *session[SCST_LOCAL_MAX_TARGETS];
+ struct device dev;
+};
+
+#define to_scst_lcl_host(d) \
+ container_of(d, struct scst_local_host_info, dev)
+
+/*
+ * Maintains data that is needed during command processing ...
+ */
+struct scst_local_tgt_specific {
+ struct scsi_cmnd *cmnd;
+ void (*done)(struct scsi_cmnd *);
+};
+
+/*
+ * We use a pool of objects maintaind by the kernel so that it is less
+ * likely to have to allocate them when we are in the data path.
+ */
+static struct kmem_cache *tgt_specific_pool;
+
+static LIST_HEAD(scst_local_host_list);
+static DEFINE_SPINLOCK(scst_local_host_list_lock);
+
+static char scst_local_proc_name[] = "scst_ini_targ_debug";
+
+static struct bus_type scst_fake_lld_bus;
+static struct device scst_fake_primary;
+
+static struct device_driver scst_local_driverfs_driver = {
+ .name = scst_local_proc_name,
+ .bus = &scst_fake_lld_bus,
+};
+
+module_param_named(add_host, scst_local_add_host, int, S_IRUGO | S_IWUSR);
+module_param_named(num_tgts, scst_local_num_tgts, int, S_IRUGO | S_IWUSR);
+module_param_named(max_luns, scst_local_max_luns, int, S_IRUGO | S_IWUSR);
+
+MODULE_AUTHOR("Richard Sharpe + ideas from SCSI_DEBUG");
+MODULE_DESCRIPTION("SCSI+SCST local adapter driver");
+MODULE_LICENSE("GPL");
+MODULE_VERSION(SCST_LOCAL_VERSION);
+
+MODULE_PARM_DESC(add_host, "0..127 hosts can be created (def=1)");
+MODULE_PARM_DESC(num_tgts, "mumber of targets per host (def=1)");
+MODULE_PARM_DESC(max_luns, "number of luns per target (def=1)");
+
+static int scst_local_target_register(void);
+
+static int scst_local_proc_info(struct Scsi_Host *host, char *buffer,
+ char **start, off_t offset, int length,
+ int inout)
+{
+ int len, pos, begin;
+
+ TRACE_ENTRY();
+
+ if (inout == 1)
+ return -EACCES;
+
+ begin = 0;
+ pos = len = sprintf(buffer, "scst_local adapter driver, version "
+ "%s [%s]\n"
+ "num_tgts=%d, Aborts=%d, Device Resets=%d, "
+ "Target Resets=%d\n",
+ SCST_LOCAL_VERSION, scst_local_version_date,
+ scst_local_num_tgts, num_aborts, num_dev_resets,
+ num_target_resets);
+ if (pos < offset) {
+ len = 0;
+ begin = pos;
+ }
+ if (start)
+ *start = buffer + (offset - begin);
+ len -= (offset - begin);
+ if (len > length)
+ len = length;
+
+ TRACE_EXIT_RES(len);
+ return len;
+}
+
+static ssize_t scst_local_add_host_show(struct device_driver *ddp, char *buf)
+{
+ int len = 0;
+ struct scst_local_host_info *scst_lcl_host, *tmp;
+
+ TRACE_ENTRY();
+
+ list_for_each_entry_safe(scst_lcl_host, tmp, &scst_local_host_list,
+ host_list) {
+ len += scnprintf(&buf[len], PAGE_SIZE - len, " Initiator: %s\n",
+ scst_lcl_host->session[0]->initiator_name);
+ }
+
+ TRACE_EXIT_RES(len);
+ return len;
+}
+
+static ssize_t scst_local_add_host_store(struct device_driver *ddp,
+ const char *buf, size_t count)
+{
+ int delta_hosts;
+
+ TRACE_ENTRY();
+
+ if (sscanf(buf, "%d", &delta_hosts) != 1)
+ return -EINVAL;
+ if (delta_hosts > 0) {
+ do {
+ scst_local_add_adapter();
+ } while (--delta_hosts);
+ } else if (delta_hosts < 0) {
+ do {
+ scst_local_remove_adapter();
+ } while (++delta_hosts);
+ }
+
+ TRACE_EXIT_RES(count);
+ return count;
+}
+
+static DRIVER_ATTR(add_host, S_IRUGO | S_IWUSR, scst_local_add_host_show,
+ scst_local_add_host_store);
+
+static int do_create_driverfs_files(void)
+{
+ int ret;
+
+ TRACE_ENTRY();
+
+ ret = driver_create_file(&scst_local_driverfs_driver,
+ &driver_attr_add_host);
+
+ TRACE_EXIT_RES(ret);
+ return ret;
+}
+
+static void do_remove_driverfs_files(void)
+{
+ driver_remove_file(&scst_local_driverfs_driver,
+ &driver_attr_add_host);
+}
+
+static char scst_local_info_buf[256];
+
+static const char *scst_local_info(struct Scsi_Host *shp)
+{
+ TRACE_ENTRY();
+
+ sprintf(scst_local_info_buf, "scst_local, version %s [%s], "
+ "Aborts: %d, Device Resets: %d, Target Resets: %d",
+ SCST_LOCAL_VERSION, scst_local_version_date,
+ num_aborts, num_dev_resets, num_target_resets);
+
+ TRACE_EXIT();
+ return scst_local_info_buf;
+}
+
+/*
+static int scst_local_ioctl(struct scsi_device *dev, int cmd, void __user *arg)
+{
+ TRACE_ENTRY();
+
+ if (scst_local_opt_noise & SCST_LOCAL_OPT_LLD_NOISE)
+ printk(KERN_INFO "scst_local: ioctl: cmd=0x%x\n", cmd);
+ return -EINVAL;
+
+ TRACE_EXIT();
+}
+*/
+
+static int scst_local_abort(struct scsi_cmnd *SCpnt)
+{
+ struct scst_local_host_info *scst_lcl_host;
+ int ret = 0;
+ DECLARE_COMPLETION_ONSTACK(dev_reset_completion);
+
+ TRACE_ENTRY();
+
+ scst_lcl_host = to_scst_lcl_host(scsi_get_device(SCpnt->device->host));
+
+ ret = scst_rx_mgmt_fn_tag(scst_lcl_host->session[SCpnt->device->id],
+ SCST_LUN_RESET, SCpnt->tag, FALSE,
+ &dev_reset_completion);
+
+ wait_for_completion_interruptible(&dev_reset_completion);
+
+ ++num_aborts;
+
+ TRACE_EXIT_RES(ret);
+ return ret;
+}
+
+/*
+ * We issue a mgmt function. We should pass a structure to the function
+ * that contains our private data, so we can retrieve the status from the
+ * done routine ... TODO
+ */
+static int scst_local_device_reset(struct scsi_cmnd *SCpnt)
+{
+ struct scst_local_host_info *scst_lcl_host;
+ int lun;
+ int ret = 0;
+ DECLARE_COMPLETION_ONSTACK(dev_reset_completion);
+
+ TRACE_ENTRY();
+
+ scst_lcl_host = to_scst_lcl_host(scsi_get_device(SCpnt->device->host));
+
+ lun = SCpnt->device->lun;
+ lun = (lun & 0xFF) << 8 | ((lun & 0xFF00) >> 8); /* FIXME: LE only */
+
+ ret = scst_rx_mgmt_fn_lun(scst_lcl_host->session[SCpnt->device->id],
+ SCST_LUN_RESET,
+ (const uint8_t *)&lun,
+ sizeof(lun), FALSE,
+ &dev_reset_completion);
+
+ /*
+ * Now wait for the completion ...
+ */
+ wait_for_completion_interruptible(&dev_reset_completion);
+
+ ++num_dev_resets;
+
+ TRACE_EXIT_RES(ret);
+ return ret;
+}
+
+static int scst_local_target_reset(struct scsi_cmnd *SCpnt)
+{
+ struct scst_local_host_info *scst_lcl_host;
+ int lun;
+ int ret = 0;
+ DECLARE_COMPLETION_ONSTACK(dev_reset_completion);
+
+ TRACE_ENTRY();
+
+ scst_lcl_host = to_scst_lcl_host(scsi_get_device(SCpnt->device->host));
+
+ lun = SCpnt->device->lun;
+ lun = (lun & 0xFF) << 8 | ((lun & 0xFF00) >> 8); /* FIXME: LE only */
+
+ ret = scst_rx_mgmt_fn_lun(scst_lcl_host->session[SCpnt->device->id],
+ SCST_TARGET_RESET,
+ (const uint8_t *)&lun,
+ sizeof(lun), FALSE,
+ &dev_reset_completion);
+
+ /*
+ * Now wait for the completion ...
+ */
+ wait_for_completion_interruptible(&dev_reset_completion);
+
+ ++num_target_resets;
+
+ TRACE_EXIT_RES(ret);
+ return ret;
+}
+
+static void copy_sense(struct scsi_cmnd *cmnd, struct scst_cmd *scst_cmnd)
+{
+ int scst_cmnd_sense_len = scst_cmd_get_sense_buffer_len(scst_cmnd);
+
+ TRACE_ENTRY();
+
+ scst_cmnd_sense_len = (SCSI_SENSE_BUFFERSIZE < scst_cmnd_sense_len ?
+ SCSI_SENSE_BUFFERSIZE : scst_cmnd_sense_len);
+ memcpy(cmnd->sense_buffer, scst_cmd_get_sense_buffer(scst_cmnd),
+ scst_cmnd_sense_len);
+
+ TRACE_BUFFER("Sense set", cmnd->sense_buffer, scst_cmnd_sense_len);
+
+ TRACE_EXIT();
+ return;
+}
+
+/*
+ * Utility function to handle processing of done and allow
+ * easy insertion of error injection if desired
+ */
+static int scst_local_send_resp(struct scsi_cmnd *cmnd,
+ struct scst_cmd *scst_cmnd,
+ void (*done)(struct scsi_cmnd *),
+ int scsi_result)
+{
+ int ret = 0;
+
+ TRACE_ENTRY();
+
+ if (cmnd && scst_cmnd) {
+ /* Simulate autosense by this driver */
+ if (SAM_STAT_CHECK_CONDITION == (scsi_result & 0xFF))
+ copy_sense(cmnd, scst_cmnd);
+ }
+
+ if (cmnd)
+ cmnd->result = scsi_result;
+ if (done)
+ done(cmnd);
+
+ TRACE_EXIT_RES(ret);
+ return ret;
+}
+
+/*
+ * This does the heavy lifting ... we pass all the commands on to the
+ * target driver and have it do its magic ...
+ */
+static int scst_local_queuecommand(struct scsi_cmnd *SCpnt,
+ void (*done)(struct scsi_cmnd *))
+{
+ struct scst_local_tgt_specific *tgt_specific = NULL;
+ struct scst_local_host_info *scst_lcl_host;
+ int target = SCpnt->device->id;
+ int lun;
+ struct scst_cmd *scst_cmd = NULL;
+ scst_data_direction dir;
+
+ TRACE_ENTRY();
+
+ TRACE_DBG("targ: %d, init id %d, lun %d, cmd: 0X%02X\n",
+ target, SCpnt->device->host->hostt->this_id, SCpnt->device->lun,
+ SCpnt->cmnd[0]);
+
+ scst_lcl_host = to_scst_lcl_host(scsi_get_device(SCpnt->device->host));
+
+ scsi_set_resid(SCpnt, 0);
+
+ if (target == SCpnt->device->host->hostt->this_id) {
+ printk(KERN_ERR "%s: initiator's id used as target\n",
+ __func__);
+ return scst_local_send_resp(SCpnt, NULL, done,
+ DID_NO_CONNECT << 16);
+ }
+
+ /*
+ * Tell the target that we have a command ... but first we need
+ * to get the LUN into a format that SCST understand
+ */
+ lun = SCpnt->device->lun;
+ lun = (lun & 0xFF) << 8 | ((lun & 0xFF00) >> 8); /* FIXME: LE only */
+ scst_cmd = scst_rx_cmd(scst_lcl_host->session[SCpnt->device->id],
+ (const uint8_t *)&lun,
+ sizeof(lun), SCpnt->cmnd,
+ SCpnt->cmd_len, TRUE);
+ if (!scst_cmd) {
+ printk(KERN_ERR "%s out of memory at line %d\n",
+ __func__, __LINE__);
+ return -ENOMEM;
+ }
+
+ scst_cmd_set_tag(scst_cmd, SCpnt->tag);
+ switch (scsi_get_tag_type(SCpnt->device)) {
+ case MSG_SIMPLE_TAG:
+ scst_cmd->queue_type = SCST_CMD_QUEUE_SIMPLE;
+ break;
+ case MSG_HEAD_TAG:
+ scst_cmd->queue_type = SCST_CMD_QUEUE_HEAD_OF_QUEUE;
+ break;
+ case MSG_ORDERED_TAG:
+ scst_cmd->queue_type = SCST_CMD_QUEUE_ORDERED;
+ break;
+ case SCSI_NO_TAG:
+ default:
+ scst_cmd->queue_type = SCST_CMD_QUEUE_UNTAGGED;
+ break;
+ }
+
+ dir = SCST_DATA_NONE;
+ switch (SCpnt->sc_data_direction) {
+ case DMA_TO_DEVICE:
+ dir = SCST_DATA_WRITE;
+ break;
+ case DMA_FROM_DEVICE:
+ dir = SCST_DATA_READ;
+ break;
+ case DMA_BIDIRECTIONAL:
+ printk(KERN_ERR "%s: DMA_BIDIRECTIONAL not allowed!\n",
+ __func__);
+ return scst_local_send_resp(SCpnt, NULL, done,
+ DID_ERROR << 16);
+ /*dir = SCST_DATA_UNKNOWN;*/
+ break;
+ case DMA_NONE:
+ default:
+ dir = SCST_DATA_NONE;
+ break;
+ }
+ scst_cmd_set_expected(scst_cmd, dir, scsi_bufflen(SCpnt));
+
+ /*
+ * Defer allocating memory until all error paths are done
+ */
+ tgt_specific = kmem_cache_alloc(tgt_specific_pool, GFP_ATOMIC);
+ if (!tgt_specific) {
+ printk(KERN_ERR "%s out of memory at line %d\n",
+ __func__, __LINE__);
+ return -ENOMEM;
+ }
+ tgt_specific->cmnd = SCpnt;
+ tgt_specific->done = done;
+
+ scst_cmd_set_tgt_priv(scst_cmd, tgt_specific);
+
+ /* Set the SGL things directly ... */
+ scst_cmd_set_tgt_sg(scst_cmd, scsi_sglist(SCpnt), scsi_sg_count(SCpnt));
+
+ /*
+ * Unfortunately, we called with IRQs disabled, so have no choice,
+ * except pass to the thread context.
+ */
+ scst_cmd_init_done(scst_cmd, SCST_CONTEXT_THREAD);
+
+ /*
+ * We are done here I think. Other callbacks move us forward.
+ */
+ TRACE_EXIT();
+ return 0;
+}
+
+static void scst_local_release_adapter(struct device *dev)
+{
+ struct scst_local_host_info *scst_lcl_host;
+ int i = 0;
+
+ TRACE_ENTRY();
+ scst_lcl_host = to_scst_lcl_host(dev);
+ if (scst_lcl_host) {
+ for (i = 0; i < scst_local_num_tgts; i++)
+ if (scst_lcl_host->session[i])
+ scst_unregister_session(
+ scst_lcl_host->session[i], TRUE, NULL);
+ scst_unregister(scst_lcl_host->target);
+ kfree(scst_lcl_host);
+ }
+
+ TRACE_EXIT();
+}
+
+/*
+ * Add an adapter on the host side ... We add the target before we add
+ * the host (initiator) so that we don't get any requests before we are
+ * ready for them.
+ *
+ * I want to convert this so we can map many hosts to a smaller number of
+ * targets to support the simulation of multiple initiators.
+ */
+static int scst_local_add_adapter(void)
+{
+ int error = 0, i = 0;
+ struct scst_local_host_info *scst_lcl_host;
+ char name[32];
+
+ TRACE_ENTRY();
+
+ scst_lcl_host = kzalloc(sizeof(struct scst_local_host_info),
+ GFP_KERNEL);
+ if (NULL == scst_lcl_host) {
+ printk(KERN_ERR "%s out of memory at line %d\n",
+ __func__, __LINE__);
+ return -ENOMEM;
+ }
+
+ spin_lock(&scst_local_host_list_lock);
+ list_add_tail(&scst_lcl_host->host_list, &scst_local_host_list);
+ spin_unlock(&scst_local_host_list_lock);
+
+ /*
+ * Register a target with SCST and add a session
+ */
+ sprintf(name, "scstlcltgt%d", scst_local_add_host);
+ scst_lcl_host->target = scst_register(&scst_local_targ_tmpl, name);
+ if (!scst_lcl_host) {
+ printk(KERN_WARNING "scst_register_target failed:\n");
+ error = -1;
+ goto cleanup;
+ }
+
+ /*
+ * Create a session for each device
+ */
+ for (i = 0; i < scst_local_num_tgts; i++) {
+ sprintf(name, "scstlclhst%d:%d", scst_local_add_host, i);
+ scst_lcl_host->session[i] = scst_register_session(
+ scst_lcl_host->target,
+ TRUE, name, NULL, NULL);
+ if (!scst_lcl_host->session[i]) {
+ printk(KERN_WARNING "scst_register_session failed:\n");
+ error = -1;
+ goto unregister_target;
+ }
+ }
+
+ scst_lcl_host->dev.bus = &scst_fake_lld_bus;
+ scst_lcl_host->dev.parent = &scst_fake_primary;
+ scst_lcl_host->dev.release = &scst_local_release_adapter;
+ sprintf(scst_lcl_host->dev.bus_id, "scst_adp_%d", scst_local_add_host);
+
+ error = device_register(&scst_lcl_host->dev);
+ if (error)
+ goto unregister_session;
+
+ scst_local_add_host++; /* keep count of what we have added */
+
+ TRACE_EXIT();
+ return error;
+
+unregister_session:
+ for (i = 0; i < scst_local_num_tgts; i++) {
+ if (scst_lcl_host->session[i])
+ scst_unregister_session(scst_lcl_host->session[i],
+ TRUE, NULL);
+ }
+unregister_target:
+ scst_unregister(scst_lcl_host->target);
+cleanup:
+ kfree(scst_lcl_host);
+ TRACE_EXIT();
+ return error;
+}
+
+/*
+ * Remove an adapter ...
+ */
+static void scst_local_remove_adapter(void)
+{
+ struct scst_local_host_info *scst_lcl_host = NULL;
+
+ TRACE_ENTRY();
+
+ spin_lock(&scst_local_host_list_lock);
+ if (!list_empty(&scst_local_host_list)) {
+ scst_lcl_host = list_entry(scst_local_host_list.prev,
+ struct scst_local_host_info,
+ host_list);
+ list_del(&scst_lcl_host->host_list);
+ }
+ spin_unlock(&scst_local_host_list_lock);
+
+ if (!scst_lcl_host)
+ return;
+
+ device_unregister(&scst_lcl_host->dev);
+
+ --scst_local_add_host;
+
+ TRACE_EXIT();
+}
+
+static struct scsi_host_template scst_lcl_ini_driver_template = {
+ .proc_info = scst_local_proc_info,
+ .proc_name = scst_local_proc_name,
+ .name = SCST_LOCAL_NAME,
+ .info = scst_local_info,
+/* .ioctl = scst_local_ioctl, */
+ .queuecommand = scst_local_queuecommand,
+ .eh_abort_handler = scst_local_abort,
+ .eh_device_reset_handler = scst_local_device_reset,
+ .eh_target_reset_handler = scst_local_target_reset,
+ .can_queue = 256,
+ .this_id = SCST_LOCAL_MAX_TARGETS,
+ /* SCST doesn't support sg chaining */
+ .sg_tablesize = SCSI_MAX_SG_SEGMENTS,
+ .cmd_per_lun = 32,
+ .max_sectors = 0xffff,
+ /*
+ * There's no gain to merge requests on this level. If necessary,
+ * they will be merged at the backstorage level.
+ */
+ .use_clustering = DISABLE_CLUSTERING,
+ .skip_settle_delay = 1,
+ .module = THIS_MODULE,
+};
+
+static void scst_fake_0_release(struct device *dev)
+{
+ TRACE_ENTRY();
+
+ TRACE_EXIT();
+}
+
+static struct device scst_fake_primary = {
+ .bus_id = "scst_fake_0",
+ .release = scst_fake_0_release,
+};
+
+static int __init scst_local_init(void)
+{
+ int ret, k, adapters;
+
+ TRACE_ENTRY();
+
+#if defined(CONFIG_HIGHMEM4G) || defined(CONFIG_HIGHMEM64G)
+ PRINT_ERROR("%s", "HIGHMEM kernel configurations are not supported. "
+ "Consider changing VMSPLIT option or use a 64-bit "
+ "configuration instead. See SCST core README file for "
+ "details.");
+ ret = -EINVAL;
+ goto out;
+#endif
+
+ TRACE_DBG("Adapters: %d\n", scst_local_add_host);
+
+ if (scst_local_num_tgts > SCST_LOCAL_MAX_TARGETS)
+ scst_local_num_tgts = SCST_LOCAL_MAX_TARGETS;
+
+ /*
+ * Allocate a pool of structures for tgt_specific structures
+ */
+ tgt_specific_pool = kmem_cache_create("scst_tgt_specific",
+ sizeof(struct scst_local_tgt_specific),
+ 0, SCST_SLAB_FLAGS, NULL);
+
+ if (!tgt_specific_pool) {
+ printk(KERN_WARNING "%s: out of memory for "
+ "tgt_specific structs",
+ __func__);
+ return -ENOMEM;
+ }
+
+ ret = device_register(&scst_fake_primary);
+ if (ret < 0) {
+ printk(KERN_WARNING "%s: device_register error: %d\n",
+ __func__, ret);
+ goto destroy_kmem;
+ }
+ ret = bus_register(&scst_fake_lld_bus);
+ if (ret < 0) {
+ printk(KERN_WARNING "%s: bus_register error: %d\n",
+ __func__, ret);
+ goto dev_unreg;
+ }
+ ret = driver_register(&scst_local_driverfs_driver);
+ if (ret < 0) {
+ printk(KERN_WARNING "%s: driver_register error: %d\n",
+ __func__, ret);
+ goto bus_unreg;
+ }
+ ret = do_create_driverfs_files();
+ if (ret < 0) {
+ printk(KERN_WARNING "%s: create_files error: %d\n",
+ __func__, ret);
+ goto driver_unregister;
+ }
+
+
+ /*
+ * register the target driver and then create a host. This makes sure
+ * that we see any targets that are there. Gotta figure out how to
+ * tell the system that there are new targets when SCST creates them.
+ */
+
+ ret = scst_local_target_register();
+ if (ret < 0) {
+ printk(KERN_WARNING "%s: unable to register targ griver: %d\n",
+ __func__, ret);
+ goto del_files;
+ }
+
+ /*
+ * Add adapters ...
+ */
+ adapters = scst_local_add_host;
+ scst_local_add_host = 0;
+ for (k = 0; k < adapters; k++) {
+ if (scst_local_add_adapter()) {
+ printk(KERN_ERR "%s: "
+ "scst_local_add_adapter failed: %d\n",
+ __func__, k);
+ break;
+ }
+ }
+
+out:
+ TRACE_EXIT_RES(ret);
+ return ret;
+
+del_files:
+ do_remove_driverfs_files();
+driver_unregister:
+ driver_unregister(&scst_local_driverfs_driver);
+bus_unreg:
+ bus_unregister(&scst_fake_lld_bus);
+dev_unreg:
+ device_unregister(&scst_fake_primary);
+destroy_kmem:
+ kmem_cache_destroy(tgt_specific_pool);
+ goto out;
+}
+
+static void __exit scst_local_exit(void)
+{
+ int k = scst_local_add_host;
+
+ TRACE_ENTRY();
+
+ for (; k; k--) {
+ printk(KERN_INFO "removing adapter in %s\n", __func__);
+ scst_local_remove_adapter();
+ }
+ do_remove_driverfs_files();
+ driver_unregister(&scst_local_driverfs_driver);
+ bus_unregister(&scst_fake_lld_bus);
+ device_unregister(&scst_fake_primary);
+
+ /*
+ * Now unregister the target template
+ */
+ scst_unregister_target_template(&scst_local_targ_tmpl);
+
+ /*
+ * Free the pool we allocated
+ */
+ if (tgt_specific_pool)
+ kmem_cache_destroy(tgt_specific_pool);
+
+ TRACE_EXIT();
+}
+
+device_initcall(scst_local_init);
+module_exit(scst_local_exit);
+
+/*
+ * Fake LLD Bus and functions
+ */
+
+static int scst_fake_lld_driver_probe(struct device *dev)
+{
+ int ret = 0;
+ struct scst_local_host_info *scst_lcl_host;
+ struct Scsi_Host *hpnt;
+
+ TRACE_ENTRY();
+
+ scst_lcl_host = to_scst_lcl_host(dev);
+
+ hpnt = scsi_host_alloc(&scst_lcl_ini_driver_template,
+ sizeof(scst_lcl_host));
+ if (NULL == hpnt) {
+ printk(KERN_ERR "%s: scsi_register failed\n", __func__);
+ ret = -ENODEV;
+ return ret;
+ }
+
+ scst_lcl_host->shost = hpnt;
+
+ /*
+ * We are going to have to register with SCST here I think
+ * and fill in some of these from that info?
+ */
+
+ *((struct scst_local_host_info **)hpnt->hostdata) = scst_lcl_host;
+ if ((hpnt->this_id >= 0) && (scst_local_num_tgts > hpnt->this_id))
+ hpnt->max_id = scst_local_num_tgts + 1;
+ else
+ hpnt->max_id = scst_local_num_tgts;
+ hpnt->max_lun = scst_local_max_luns - 1;
+
+ ret = scsi_add_host(hpnt, &scst_lcl_host->dev);
+ if (ret) {
+ printk(KERN_ERR "%s: scsi_add_host failed\n", __func__);
+ ret = -ENODEV;
+ scsi_host_put(hpnt);
+ } else
+ scsi_scan_host(hpnt);
+
+ TRACE_EXIT_RES(ret);
+ return ret;
+}
+
+static int scst_fake_lld_driver_remove(struct device *dev)
+{
+ struct scst_local_host_info *scst_lcl_host;
+
+ TRACE_ENTRY();
+
+ scst_lcl_host = to_scst_lcl_host(dev);
+
+ if (!scst_lcl_host) {
+ printk(KERN_ERR "%s: Unable to locate host info\n",
+ __func__);
+ return -ENODEV;
+ }
+
+ scsi_remove_host(scst_lcl_host->shost);
+
+ scsi_host_put(scst_lcl_host->shost);
+
+ TRACE_EXIT();
+ return 0;
+}
+
+static int scst_fake_lld_bus_match(struct device *dev,
+ struct device_driver *dev_driver)
+{
+ TRACE_ENTRY();
+
+ TRACE_EXIT();
+ return 1;
+}
+
+static struct bus_type scst_fake_lld_bus = {
+ .name = "scst_fake_bus",
+ .match = scst_fake_lld_bus_match,
+ .probe = scst_fake_lld_driver_probe,
+ .remove = scst_fake_lld_driver_remove,
+};
+
+/*
+ * SCST Target driver from here ... there are some forward declarations
+ * above
+ */
+
+static int scst_local_targ_detect(struct scst_tgt_template *tgt_template)
+{
+ int adapter_count;
+
+ TRACE_ENTRY();
+
+ /*
+ * Register the adapter(s)
+ */
+
+ adapter_count = scst_local_add_host;
+
+ TRACE_EXIT_RES(adapter_count);
+ return adapter_count;
+};
+
+static int scst_local_targ_release(struct scst_tgt *tgt)
+{
+ TRACE_ENTRY();
+
+ TRACE_EXIT();
+ return 0;
+}
+
+static int scst_local_targ_xmit_response(struct scst_cmd *scst_cmd)
+{
+ struct scst_local_tgt_specific *tgt_specific;
+
+ TRACE_ENTRY();
+
+ if (unlikely(scst_cmd_aborted(scst_cmd))) {
+ scst_set_delivery_status(scst_cmd, SCST_CMD_DELIVERY_ABORTED);
+ scst_tgt_cmd_done(scst_cmd, SCST_CONTEXT_SAME);
+ printk(KERN_INFO "%s aborted command handled\n", __func__);
+ return SCST_TGT_RES_SUCCESS;
+ }
+
+ tgt_specific = scst_cmd_get_tgt_priv(scst_cmd);
+
+ /*
+ * This might have to change to use the two status flags
+ */
+ if (scst_cmd_get_is_send_status(scst_cmd)) {
+ (void)scst_local_send_resp(tgt_specific->cmnd, scst_cmd,
+ tgt_specific->done,
+ scst_cmd_get_status(scst_cmd));
+ }
+
+ /*
+ * Now tell SCST that the command is done ...
+ */
+ scst_tgt_cmd_done(scst_cmd, SCST_CONTEXT_SAME);
+
+ TRACE_EXIT();
+
+ return SCST_TGT_RES_SUCCESS;
+}
+
+static void scst_local_targ_on_free_cmd(struct scst_cmd *scst_cmd)
+{
+ struct scst_local_tgt_specific *tgt_specific;
+
+ TRACE_ENTRY();
+
+ tgt_specific = scst_cmd_get_tgt_priv(scst_cmd);
+ kmem_cache_free(tgt_specific_pool, tgt_specific);
+
+ TRACE_EXIT();
+ return;
+}
+
+static void scst_local_targ_task_mgmt_done(struct scst_mgmt_cmd *mgmt_cmd)
+{
+ struct completion *tgt_specific;
+
+ TRACE_ENTRY();
+
+ tgt_specific = (struct completion *)
+ scst_mgmt_cmd_get_tgt_priv(mgmt_cmd);
+
+ if (tgt_specific)
+ complete(tgt_specific);
+
+ TRACE_EXIT();
+ return;
+}
+
+static struct scst_tgt_template scst_local_targ_tmpl = {
+ .name = "scst_local_tgt",
+ .xmit_response_atomic = 1,
+ .detect = scst_local_targ_detect,
+ .release = scst_local_targ_release,
+ .xmit_response = scst_local_targ_xmit_response,
+ .on_free_cmd = scst_local_targ_on_free_cmd,
+ .task_mgmt_fn_done = scst_local_targ_task_mgmt_done,
+};
+
+/*
+ * Register the target driver ... to get things going
+ */
+static int scst_local_target_register(void)
+{
+ int ret;
+
+ TRACE_ENTRY();
+
+ ret = scst_register_target_template(&scst_local_targ_tmpl);
+ if (ret < 0) {
+ printk(KERN_WARNING "scst_register_target_template "
+ "failed: %d\n",
+ ret);
+ goto error;
+ }
+
+ TRACE_EXIT();
+ return 0;
+
+error:
+ TRACE_EXIT_RES(ret);
+ return ret;
+}
+
This patch contains documentation for scst_local driver.
Signed-off-by: Richard Sharpe <[email protected]>
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
Documentation/scst/README.scst_local | 95 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 95 insertions(+)
diff -uprN orig/linux-2.6.27/Documentation/scst/README.scst_local linux-2.6.27/Documentation/scst/README.scst_local
--- orig/linux-2.6.27/Documentation/scst/README.scst_local
+++ linux-2.6.27/Documentation/scst/README.scst_local
@@ -0,0 +1,95 @@
+SCST Debug ...
+Richard Sharpe, 30-Nov-2008
+
+This is the SCST Local driver. Its function is to allow you to access devices
+that are exported via SCST directly on the same Linux system that they are
+exported from.
+
+No assumptions are made in the code about the device types on the target, so
+any device handlers that you load in SCST should be visible, including tapes
+and so forth.
+
+To build, simply issue 'make' in the scst-local directory.
+
+Try 'modinfo scst_local' for a listing of module parameters so far.
+
+NOTE! This is now part of scst and supports being built in the scst tree.
+
+Here is how I have used it so far:
+
+1. Load up scst:
+
+ modprobe scst
+ modprobe scst_vdisk
+
+2. Create a virtual disk (or your own device handler):
+
+ dd if=/dev/zero of=/some/path/vdisk1.img bs=16384 count=1000000
+# dd if=/dev/zero of=/some/path/vdisk2.img bs=16384 count=1000000
+ echo "open vm_disk1 /some/path/vdisk1.img" > /proc/scsi_tgt/vdisk/vdisk
+# echo "open vm_disk2 /some/path/vdisk2.img" > /proc/scsi_tgt/vdisk/vdisk
+ echo "add vm_disk1 0" > /proc/scsi_tgt/groups/Default/devices
+# echo "add vm_disk2 1" > /proc/scsi_tgt/groups/Default/devices
+
+3. Load the scst_local driver:
+
+ insmod scst_local
+# insmod scst_local max_luns=8
+
+4. Check what you have
+
+ cat /proc/scsi/scsi
+ Attached devices:
+ Host: scsi0 Channel: 00 Id: 00 Lun: 00
+ Vendor: ATA Model: ST9320320AS Rev: 0303
+ Type: Direct-Access ANSI SCSI revision: 05
+ Host: scsi4 Channel: 00 Id: 00 Lun: 00
+ Vendor: TSSTcorp Model: CD/DVDW TS-L632D Rev: TO04
+ Type: CD-ROM ANSI SCSI revision: 05
+ Host: scsi7 Channel: 00 Id: 00 Lun: 00
+ Vendor: SCST_FIO Model: vm_disk1 Rev: 101
+ Type: Direct-Access ANSI SCSI revision: 04
+
+5. Have fun.
+
+Some of this was coded while in Santa Clara, some in Bangalore, and some in
+Hyderabad. Noe doubt some will be coded on the way back to Santa Clara.
+
+The code still has bugs, so if you encounter any, email me the fixes at:
+
+ [email protected]
+
+I am thinking of renaming this to something more interesting.
+
+6. Change log
+
+V0.1 24-Sep-2008 (Hyderabad) Initial coding, pretty chatty and messy,
+ but worked.
+
+V0.2 25-Sep-2008 (Hong Kong) Cleaned up the code a lot, reduced the log
+ chatter, fixed a bug where multiple LUNs did not
+ work. Also, added logging control. Tested with
+ five virtual disks. They all came up as /dev/sdb
+ through /dev/sdf and I could dd to them. Also
+ fixed a bug preventing multiple adapters.
+
+V0.3 26-Sep-2008 (Santa Clara) Added back a copyright plus cleaned up some
+ unused functions and structures.
+
+V0.4 5-Oct-2008 (Santa Clara) Changed name to scst_local as suggested, cleaned
+ up some unused variables (made them used) and
+ change allocation to a kmem_cache pool.
+
+V0.5 5-Oct-2008 (Santa Clara) Added mgmt commands to handle dev reset and
+ aborts. Not sure if aborts works. Also corrected
+ the version info and renamed readme to README.
+
+V0.6 7-Oct-2008 (Santa Clara) Removed some redundant code and made some
+ changes suggested by Vladislav.
+
+V0.7 11-Oct-2008 (Santa Clara) Moved into the scst tree. Cleaned up some
+ unused functions, used TRACE macros etc.
+
+V0.9 30-Nov-2008 (Mtn View) Cleaned up an additional problem with symbols not
+ being defined in older version of the kernel. Also
+ fixed some English and cleaned up this doc.
This patch contains iSCSI-SCST target driver. This driver is a heavily
modified forked with all respects IET
(http://iscsitarget.sourceforge.net). Modifications were aimed to make a
clearer, more reviewable and maintainable code as well as to fix many
problems and make many improvements. See
http://scst.sourceforge.net/target_iscsi.html for more details.
It has split user/kernel space architecture, where all management,
sessions creation, parameters negotiation, etc. made in user space and
data are transferred in the kernel space. Such architecture for iSCSI
processing was many times acknowledged as the right one. Particularly,
in-kernel iSCSI initiator (open-iscsi) has such architecture.
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
drivers/scst/iscsi-scst/Kconfig | 25
drivers/scst/iscsi-scst/Makefile | 6
drivers/scst/iscsi-scst/config.c | 597 +++++++
drivers/scst/iscsi-scst/conn.c | 488 +++++
drivers/scst/iscsi-scst/digest.c | 221 ++
drivers/scst/iscsi-scst/digest.h | 31
drivers/scst/iscsi-scst/event.c | 116 +
drivers/scst/iscsi-scst/iscsi.c | 3066 ++++++++++++++++++++++++++++++++++++
drivers/scst/iscsi-scst/iscsi.h | 577 ++++++
drivers/scst/iscsi-scst/iscsi_dbg.h | 73
drivers/scst/iscsi-scst/iscsi_hdr.h | 517 ++++++
drivers/scst/iscsi-scst/nthread.c | 1522 +++++++++++++++++
drivers/scst/iscsi-scst/param.c | 263 +++
drivers/scst/iscsi-scst/session.c | 210 ++
drivers/scst/iscsi-scst/target.c | 306 +++
include/scst/iscsi_scst.h | 156 +
include/scst/iscsi_scst_ver.h | 16
17 files changed, 8190 insertions(+)
The patch is too big to be submitted inline. You can find it in
http://scst.sourceforge.net/patches/iscsi-scst.diff
This patch contains documentation for iSCSI-SCST.
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
Documentation/scst/README.iscsi | 127 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 127 insertions(+)
diff -uprN orig/linux-2.6.27/Documentation/scst/README.iscsi linux-2.6.27/Documentation/scst/README.iscsi
--- orig/linux-2.6.27/Documentation/scst/README.iscsi
+++ linux-2.6.27/Documentation/scst/README.iscsi
@@ -0,0 +1,127 @@
+iSCSI SCST target driver
+========================
+
+Version 1.0.1/0.4.16r155, XX XXXX 2008
+--------------------------------------
+
+This driver is a forked with all respects version of iSCSI Enterprise
+Target (IET) (http://iscsitarget.sourceforge.net/) with updates to work
+over SCST as well as with many improvements and bugfixes (see ChangeLog
+file). The reason of fork is that the necessary changes are intrusive
+and with the current IET merge policy, where only simple bugfix-like
+patches, which doesn't touch the core code, could be merged, it is very
+unlikely that they will be merged in the main IET trunk.
+
+To let it be installed and work at the same host together with IET
+simultaneously all the driver's modules and files were renamed:
+
+ * ietd.conf -> iscsi-scstd.conf
+ * ietadm -> iscsi-scst-adm
+ * ietd -> iscsi-scstd
+ * iscsi-target -> iscsi-scst
+ * iscsi-target.ko -> iscsi-scst.ko
+
+To use full power of TCP zero-copy transmit functions, especially
+dealing with user space supplied via scst_user module memory, iSCSI-SCST
+needs to be notified when Linux networking finished data transmission.
+For that you should enable CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION
+kernel config option. This is highly recommended, but not required. Basically,
+you should consider usage of this option as some optimization, which IET
+doesn't have, so if you don't use it, you will just revert to the
+original IET behavior, when for data transmission:
+
+ - For in-kernel allocated memory (scst_vdisk and pass-through
+ handlers) usage of SGV cache on transmit path (READ-type commands)
+ will be disabled. The performance hit will be not big, but performance
+ will still remain better, than for IET, because SGV cache will remain
+ used on receive path while IET doesn't have such feature.
+
+ - For user space allocated memory (scst_user handler) all transmitted
+ data will be additionally copied into temporary TCP buffers. The
+ performance hit will be quite noticeable.
+
+Note, that if your network hardware does not support TX offload
+functions of has them disabled, then TCP zero-copy transmit functions on
+your system will not be used by Linux networking in any case, so
+put_page_callback patch will not be able to improve performance for you.
+You can check your network hardware offload capabilities by command
+"ethtool -k ethX", where X is the network device number. At least
+"tx-checksumming" and "scatter-gather" should be enabled.
+
+Usage
+-----
+
+ISCSI parameters like iSNS, CHAP and target parameters are configured in
+iscsi-scstd.conf. All LUN information is configured using the regular
+SCST interface. It is highly recommended to use scstadmin utility for
+that purpose. The LUN information in iscsi-scstd.conf will be ignored.
+This is because now responsibilities are divided (as it should be)
+between the target driver (iSCSI-SCST) and the SCST core as it logically
+should be: the target driver is responsible for handling targets and
+their parameters, SCST core is responsible for handling backstorage.
+
+If you need to configure different LUs for different targets you should
+create for each target group "Default_target_name", where "target_name"
+means name of the target, for example:
+"Default_iqn.2007-05.com.example:storage.disk1.sys1.xyz", and add there
+all necessary LUNs. Check SCST README file for details.
+
+Check SCST README file how to tune for the best performance.
+
+If under high load you experience I/O stalls or see in the kernel log
+abort or reset messages, then try to reduce QueuedCommands parameter in
+iscsi-scstd.conf file for the corresponding target to some lower value,
+like 8 (default is 32). See also SCST README file for more details about
+that issue.
+
+CAUTION: Working of target and initiator on the same host isn't
+======== supported. See SCST README file for details.
+
+
+Performance advices
+-------------------
+
+1. If you use Windows XP or Windows 2003+ as initiators, you should
+consider to decrease TcpAckFrequency parameter to 1. See
+http://support.microsoft.com/kb/328890/ or google for "TcpAckFrequency"
+for more details.
+
+
+Compilation options
+-------------------
+
+There are the following compilation options, that could be commented
+in/out in the kernel's module Makefile:
+
+ - CONFIG_SCST_DEBUG - turns on some debugging code, including some logging.
+ Makes the driver considerably bigger and slower, producing large amount of
+ log data.
+
+ - CONFIG_SCST_TRACING - turns on ability to log events. Makes the driver
+ considerably bigger and leads to some performance loss.
+
+ - CONFIG_SCST_EXTRACHECKS - adds extra validity checks in the various places.
+
+ - CONFIG_SCST_ISCSI_DEBUG_DIGEST_FAILURES - simulates digest failures in
+ random places.
+
+
+Credits
+-------
+
+Thanks to:
+
+ * IET developers for IET
+
+ * Ming Zhang <[email protected]> for fixes
+
+ * Krzysztof Blaszkowski <[email protected]> for many fixes
+
+ * Alexey Kuznetsov <[email protected]> for comments and help in
+ debugging
+
+ * Tomasz Chmielewski <[email protected]> for testing and suggestions
+
+ * Bart Van Assche <[email protected]> for a lot of help
+
+Vladislav Bolkhovitin <[email protected]>, http://scst.sourceforge.net
This patch implements support for zero-copy TCP transmit of user space
data. It is necessary in iSCSI-SCST target driver for transmitting data
from user space buffers, supplied by user space backend handlers. In
this case SCST core needs to know when TCP finished transmitting the
data, so the corresponding buffers can be reused or freed. Without this
patch it isn't possible, so iSCSI-SCST has to use data copying to TCP
send buffers function sock_sendpage(). ISCSI-SCST also works without
this patch, but that this patch gives a nice performance improvement.
In the chosen approach new optional field void *net_priv was added to
struct page. It is enclosed by
#if defined(CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION),
so if one doesn't need this functionality, net_priv won't consume space
in struct page.
Then, 2 new global callbacks net_get_page_callback and
net_put_page_callback together with 2 new inline functions
net_get_page() and net_put_page() were added. If
CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION not defined
net_get_page() and net_put_page() effectively become get_page() and
put_page() correspondingly.
Those functions, if the corresponding net_get_page_callback or
net_put_page_callback assigned, call it, then do get_page() or put_page().
Then in net/ subdirectory all get_page() calls were replaced by
net_get_page() and put_page() - by net_put_page().
How it works. ISCSI-SCST assigns net_get_page_callback and
net_put_page_callback to its internal functions. Each page before being
sent to TCP's sendpage has net_priv field set to pointer to the
corresponding iSCSI command. Then in each net_get_page_callback handler
reference counter for that command increased and in each
net_put_page_callback - decreased. When it reaches zero, then all the
data for this command were transferred, so the command and its buffer
can be freed.
You can find how it used in the iSCSI-SCST patch (number 21 in this series).
Global callbacks were chosen, because this is the simplest and most
performance effective approach, fully following section 2 subsection 4
of SubmittingPatches file: "Don't over-design". If accepted, iSCSI-SCST
will be the only user of this functionality. Requirements to call
net_set_get_put_page_callbacks() (see comment in the patch) allows to
not protect those callbacks anyhow. Then, if in the future there is
another user of that functionality, it will be possible to convert those
callbacks to RCU-protected list of callbacks. But for now there's no
need to overcomplicate the code.
During development the following approaches were also examined and rejected:
1. Add net_priv analog in struct sk_buff, not in struct page. But then
it would be required that all the pages in each skb must be from the
same originator, i.e. with the same net_priv. It is unpractical to
change all the operations with skb's to forbid merging them, if they
have different net_priv. I tried, but quickly gave up. There are too
many such places in very not obvious code pieces.
2. Have in iSCSI-SCST a hashed list to translate page to iSCSI cmd by a
simple search function. This approach was rejected, because to copy a
page a modern CPU needs using MMX about 1500 ticks. It was observed,
that each page can be referenced by TCP during transmit about 20 times
or even more. So, if each search needs, say, 20 ticks, the overall
search time will be 20*20*2 (to get() and put()) = 800 ticks. So, this
approach would considerably worse performance-wise to the chosen
approach and provide not too much benefit.
Please, if you reject this approach, advice any other way to implement
the required functionality.
Signed-off-by: Vladislav Bolkhovitin <[email protected]>
---
include/linux/mm_types.h | 12 +++++++++++
include/linux/net.h | 40 ++++++++++++++++++++++++++++++++++++++
net/Kconfig | 12 +++++++++++
net/core/skbuff.c | 14 ++++++-------
net/ipv4/Makefile | 1
net/ipv4/ip_output.c | 4 +--
net/ipv4/tcp.c | 8 +++----
net/ipv4/tcp_output.c | 2 -
net/ipv4/tcp_zero_copy.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++
net/ipv6/ip6_output.c | 2 -
10 files changed, 129 insertions(+), 15 deletions(-)
diff -upr linux-2.6.26/include/linux/mm_types.h linux-2.6.26/include/linux/mm_types.h
--- linux-2.6.26/include/linux/mm_types.h 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/include/linux/mm_types.h 2008-07-22 20:30:21.000000000 +0400
@@ -92,6 +92,18 @@ struct page {
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
+
+#if defined(CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION)
+ /*
+ * Used to implement support for notification on zero-copy TCP transfer
+ * completion. It might look as not good to have this field here and
+ * it's better to have it in struct sk_buff, but it would make the code
+ * much more complicated and fragile, since all skb then would have to
+ * contain only pages with the same value in this field.
+ */
+ void *net_priv;
+#endif
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
unsigned long page_cgroup;
#endif
diff -upr linux-2.6.26/include/linux/net.h linux-2.6.26/include/linux/net.h
--- linux-2.6.26/include/linux/net.h 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/include/linux/net.h 2008-07-29 20:48:07.000000000 +0400
@@ -57,6 +57,7 @@ typedef enum {
#include <linux/random.h>
#include <linux/wait.h>
#include <linux/fcntl.h> /* For O_CLOEXEC and O_NONBLOCK */
+#include <linux/mm.h>
struct poll_table_struct;
struct pipe_inode_info;
@@ -354,5 +354,44 @@ extern int net_msg_cost;
extern struct ratelimit_state net_ratelimit_state;
#endif
+#if defined(CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION)
+/* Support for notification on zero-copy TCP transfer completion */
+typedef void (*net_get_page_callback_t)(struct page *page);
+typedef void (*net_put_page_callback_t)(struct page *page);
+
+extern net_get_page_callback_t net_get_page_callback;
+extern net_put_page_callback_t net_put_page_callback;
+
+extern int net_set_get_put_page_callbacks(
+ net_get_page_callback_t get_callback,
+ net_put_page_callback_t put_callback);
+
+/*
+ * See comment for net_set_get_put_page_callbacks() why those functions
+ * don't need any protection.
+ */
+static inline void net_get_page(struct page *page)
+{
+ if (page->net_priv != 0)
+ net_get_page_callback(page);
+ get_page(page);
+}
+static inline void net_put_page(struct page *page)
+{
+ if (page->net_priv != 0)
+ net_put_page_callback(page);
+ put_page(page);
+}
+#else
+static inline void net_get_page(struct page *page)
+{
+ get_page(page);
+}
+static inline void net_put_page(struct page *page)
+{
+ put_page(page);
+}
+#endif /* CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION */
+
#endif /* __KERNEL__ */
#endif /* _LINUX_NET_H */
diff -upr linux-2.6.26/net/core/skbuff.c linux-2.6.26/net/core/skbuff.c
--- linux-2.6.26/net/core/skbuff.c 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/core/skbuff.c 2008-07-22 20:28:41.000000000 +0400
@@ -319,7 +319,7 @@ static void skb_release_data(struct sk_b
if (skb_shinfo(skb)->nr_frags) {
int i;
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
- put_page(skb_shinfo(skb)->frags[i].page);
+ net_put_page(skb_shinfo(skb)->frags[i].page);
}
if (skb_shinfo(skb)->frag_list)
@@ -658,7 +658,7 @@ struct sk_buff *pskb_copy(struct sk_buff
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
- get_page(skb_shinfo(n)->frags[i].page);
+ net_get_page(skb_shinfo(n)->frags[i].page);
}
skb_shinfo(n)->nr_frags = i;
}
@@ -721,7 +721,7 @@ int pskb_expand_head(struct sk_buff *skb
sizeof(struct skb_shared_info));
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
- get_page(skb_shinfo(skb)->frags[i].page);
+ net_get_page(skb_shinfo(skb)->frags[i].page);
if (skb_shinfo(skb)->frag_list)
skb_clone_fraglist(skb);
@@ -990,7 +990,7 @@ drop_pages:
skb_shinfo(skb)->nr_frags = i;
for (; i < nfrags; i++)
- put_page(skb_shinfo(skb)->frags[i].page);
+ net_put_page(skb_shinfo(skb)->frags[i].page);
if (skb_shinfo(skb)->frag_list)
skb_drop_fraglist(skb);
@@ -1159,7 +1159,7 @@ pull_pages:
k = 0;
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
if (skb_shinfo(skb)->frags[i].size <= eat) {
- put_page(skb_shinfo(skb)->frags[i].page);
+ net_put_page(skb_shinfo(skb)->frags[i].page);
eat -= skb_shinfo(skb)->frags[i].size;
} else {
skb_shinfo(skb)->frags[k] = skb_shinfo(skb)->frags[i];
@@ -1916,7 +1916,7 @@ static inline void skb_split_no_header(s
* where splitting is expensive.
* 2. Split is accurately. We make this.
*/
- get_page(skb_shinfo(skb)->frags[i].page);
+ net_get_page(skb_shinfo(skb)->frags[i].page);
skb_shinfo(skb1)->frags[0].page_offset += len - pos;
skb_shinfo(skb1)->frags[0].size -= len - pos;
skb_shinfo(skb)->frags[i].size = len - pos;
@@ -2284,7 +2284,7 @@ struct sk_buff *skb_segment(struct sk_bu
BUG_ON(i >= nfrags);
*frag = skb_shinfo(skb)->frags[i];
- get_page(frag->page);
+ net_get_page(frag->page);
size = frag->size;
if (pos < offset) {
diff -upr linux-2.6.26/net/ipv4/ip_output.c linux-2.6.26/net/ipv4/ip_output.c
--- linux-2.6.26/net/ipv4/ip_output.c 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/ipv4/ip_output.c 2008-07-22 20:28:41.000000000 +0400
@@ -1007,7 +1007,7 @@ alloc_new_skb:
err = -EMSGSIZE;
goto error;
}
- get_page(page);
+ net_get_page(page);
skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0);
frag = &skb_shinfo(skb)->frags[i];
}
@@ -1165,7 +1165,7 @@ ssize_t ip_append_page(struct sock *sk,
if (skb_can_coalesce(skb, i, page, offset)) {
skb_shinfo(skb)->frags[i-1].size += len;
} else if (i < MAX_SKB_FRAGS) {
- get_page(page);
+ net_get_page(page);
skb_fill_page_desc(skb, i, page, offset, len);
} else {
err = -EMSGSIZE;
diff -upr linux-2.6.26/net/ipv4/Makefile linux-2.6.26/net/ipv4/Makefile
--- linux-2.6.26/net/ipv4/Makefile 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/ipv4/Makefile 2008-07-22 20:35:05.000000000 +0400
@@ -50,6 +50,7 @@ obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o
obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o
obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
+obj-$(CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION) += tcp_zero_copy.o
obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
xfrm4_output.o
diff -upr linux-2.6.26/net/ipv4/tcp.c linux-2.6.26/net/ipv4/tcp.c
--- linux-2.6.26/net/ipv4/tcp.c 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/ipv4/tcp.c 2008-07-22 20:28:41.000000000 +0400
@@ -712,7 +712,7 @@ new_segment:
if (can_coalesce) {
skb_shinfo(skb)->frags[i - 1].size += copy;
} else {
- get_page(page);
+ net_get_page(page);
skb_fill_page_desc(skb, i, page, offset, copy);
}
@@ -917,7 +917,7 @@ new_segment:
goto new_segment;
} else if (page) {
if (off == PAGE_SIZE) {
- put_page(page);
+ net_put_page(page);
TCP_PAGE(sk) = page = NULL;
off = 0;
}
@@ -958,9 +958,9 @@ new_segment:
} else {
skb_fill_page_desc(skb, i, page, off, copy);
if (TCP_PAGE(sk)) {
- get_page(page);
+ net_get_page(page);
} else if (off + copy < PAGE_SIZE) {
- get_page(page);
+ net_get_page(page);
TCP_PAGE(sk) = page;
}
}
diff -upr linux-2.6.26/net/ipv4/tcp_output.c linux-2.6.26/net/ipv4/tcp_output.c
--- linux-2.6.26/net/ipv4/tcp_output.c 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/ipv4/tcp_output.c 2008-07-22 20:28:41.000000000 +0400
@@ -854,7 +854,7 @@ static void __pskb_trim_head(struct sk_b
k = 0;
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
if (skb_shinfo(skb)->frags[i].size <= eat) {
- put_page(skb_shinfo(skb)->frags[i].page);
+ net_put_page(skb_shinfo(skb)->frags[i].page);
eat -= skb_shinfo(skb)->frags[i].size;
} else {
skb_shinfo(skb)->frags[k] = skb_shinfo(skb)->frags[i];
diff -upr linux-2.6.26/net/ipv4/tcp_zero_copy.c linux-2.6.26/net/ipv4/tcp_zero_copy.c
--- linux-2.6.26/net/ipv4/tcp_zero_copy.c 2008-07-22 20:12:35.000000000 +0400
+++ linux-2.6.26/net/ipv4/tcp_zero_copy.c 2008-07-31 21:21:13.000000000 +0400
@@ -0,0 +1,49 @@
+/*
+ * Support routines for TCP zero copy transmit
+ *
+ * Created by Vladislav Bolkhovitin
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ */
+
+#include <linux/skbuff.h>
+
+net_get_page_callback_t net_get_page_callback __read_mostly;
+EXPORT_SYMBOL(net_get_page_callback);
+
+net_put_page_callback_t net_put_page_callback __read_mostly;
+EXPORT_SYMBOL(net_put_page_callback);
+
+/*
+ * Caller of this function must ensure that at the moment when it's called
+ * there are no pages in the system with net_priv field set to non-zero
+ * value. Hence, this function, as well as net_get_page() and net_put_page(),
+ * don't need any protection.
+ */
+int net_set_get_put_page_callbacks(
+ net_get_page_callback_t get_callback,
+ net_put_page_callback_t put_callback)
+{
+ int res = 0;
+
+ if ((net_get_page_callback != NULL) && (get_callback != NULL) &&
+ (net_get_page_callback != get_callback)) {
+ res = -EBUSY;
+ goto out;
+ }
+
+ if ((net_put_page_callback != NULL) && (put_callback != NULL) &&
+ (net_put_page_callback != put_callback)) {
+ res = -EBUSY;
+ goto out;
+ }
+
+ net_get_page_callback = get_callback;
+ net_put_page_callback = put_callback;
+
+out:
+ return res;
+}
+EXPORT_SYMBOL(net_set_get_put_page_callbacks);
diff -upr linux-2.6.26/net/ipv6/ip6_output.c linux-2.6.26/net/ipv6/ip6_output.c
--- linux-2.6.26/net/ipv6/ip6_output.c 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/ipv6/ip6_output.c 2008-07-22 20:28:41.000000000 +0400
@@ -1349,7 +1349,7 @@ alloc_new_skb:
err = -EMSGSIZE;
goto error;
}
- get_page(page);
+ net_get_page(page);
skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0);
frag = &skb_shinfo(skb)->frags[i];
}
diff -upr linux-2.6.26/net/Kconfig linux-2.6.26/net/Kconfig
--- linux-2.6.26/net/Kconfig 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/Kconfig 2008-07-29 21:15:39.000000000 +0400
@@ -59,6 +59,18 @@ config INET
Short answer: say Y.
+config TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION
+ bool "TCP/IP zero-copy transfer completion notification"
+ depends on INET
+ default SCST_ISCSI
+ ---help---
+ Adds support for sending a notification upon completion of a
+ zero-copy TCP/IP transfer. This can speed up certain TCP/IP
+ software. Currently this is only used by the iSCSI target driver
+ iSCSI-SCST.
+
+ If unsure, say N.
+
if INET
source "net/ipv4/Kconfig"
source "net/ipv6/Kconfig"
On Wed, Dec 10, 2008 at 09:30:37PM +0300, Vladislav Bolkhovitin wrote:
> This patch contains SCST core code.
>
> Signed-off-by: Vladislav Bolkhovitin <[email protected]>
> ---
> drivers/scst/Kconfig | 256 ++
Do not use "default n" - this is already the default value.
> drivers/scst/Makefile | 12
EXTRA_CFLAGS are deprecated in favour of ccflags-y
> drivers/scst/scst_cdbprobe.h | 519 +++++
static const struct scst_sdbops scst_scsi_op_table[]
This does not belong to a header file.
> drivers/scst/scst_lib.c | 3689 +++++++++++++++++++++++++++++++++++++
> drivers/scst/scst_main.c | 1919 +++++++++++++++++++
> drivers/scst/scst_module.c | 69
> drivers/scst/scst_priv.h | 513 +++++
> drivers/scst/scst_targ.c | 5458 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 8 files changed, 12435 insertions(+)
There was a lot af TRACE_ENTRY() / TRACE_EXIT() noise.
We should have proper tools for that by now (I hope).
We often ask for exported symbols to be documented - so one has
a slight idea of their purpose.
CONFIG_SCST_STRICT_SERIALIZING has bad impact on readability.
Could this be abstracted better?
I did not look into the source in details. Just a few sparse comments.
Sam
Hi Vladislav.
On Wed, Dec 10, 2008 at 10:04:36PM +0300, Vladislav Bolkhovitin ([email protected]) wrote:
> In the chosen approach new optional field void *net_priv was added to
> struct page. It is enclosed by
There is a huge no-no in networking land on increasing skb.
Reason is simple every skb will carry potentially unneded data as long
as given option is enabled, and most of the time it will.
To break this barrier one has to have (I wanted to write ego, but then
decided to replace it with mojo) so huge reason to do this, that it is
almost impossible to have.
Something tells me that increasing page structure with 8 bytes because
of zero-copy iscsi transfer is not that great idea, since basically every
user out there will have it enabled in the distro config and will waste
noticeble amount of ram.
The same problem of not sending any kind of notification to the user
when his pages are 'acked' by receiving some packet or freeing the data
exists long ago and was tried to be fixed several times.
The most applicable to your case maybe DST experience. DST is a block
layer device and all its pages starting from quite recent kernels are
not allowed to be slab ones (xfs was the last one who provided slab
pages in the bios), so each page has two 'unused' pointers in lru list
entry, which you may reuse. If scsi layer may have slab pages from some
place (although this does not sound like a good idea, ->sendpage() will
bug on on them anyway), this hack will not work, otherwise you only need
to have net_page_get/put stuff in and do not mess with increasing page.
And this was tested 3-4 kernel releases ago, so things may be changed.
Another appropach is to increase skb's shared data (at the end of the
skb->data), and this approach was not frowned upon too much either, but
it requires to mess with skb->destructor, which may not be appropriate
in some cases. If iscsi does not use sockets (it does iirc), things are
much simpler.
Hope this helps.
--
Evgeniy Polyakov
On Wed, Dec 10 2008, Vladislav Bolkhovitin wrote:
> This patch exports alloc_io_context() function. For performance reasons
> SCST queues commands using a pool of IO threads. It is considerably
> better for performance (>30% increase on sequential reads) if threads in
> a pool have the same IO context. Since SCST can be built as a module,
> it needs alloc_io_context() function exported.
>
> Signed-off-by: Vladislav Bolkhovitin <[email protected]>
> ---
> block/blk-ioc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff -upkr linux-2.6.27.2/block/blk-ioc.c linux-2.6.27.2/block/blk-ioc.c
> --- linux-2.6.27.2/block/blk-ioc.c 2008-10-10 02:13:53.000000000 +0400
> +++ linux-2.6.27.2/block/blk-ioc.c 2008-11-25 21:27:01.000000000 +0300
> @@ -105,6 +105,7 @@ struct io_context *alloc_io_context(gfp_
>
> return ret;
> }
> +EXPORT_SYMBOL(alloc_io_context);
Why is this needed, can't you just use CLONE_IO?
--
Jens Axboe
Sam Ravnborg wrote:
> On Wed, Dec 10, 2008 at 09:30:37PM +0300, Vladislav Bolkhovitin wrote:
>> This patch contains SCST core code.
>>
>> Signed-off-by: Vladislav Bolkhovitin <[email protected]>
>> ---
>> drivers/scst/Kconfig | 256 ++
>
> Do not use "default n" - this is already the default value.
OK
>> drivers/scst/Makefile | 12
> EXTRA_CFLAGS are deprecated in favour of ccflags-y
OK
>> drivers/scst/scst_cdbprobe.h | 519 +++++
> static const struct scst_sdbops scst_scsi_op_table[]
>
> This does not belong to a header file.
OK
>> drivers/scst/scst_lib.c | 3689 +++++++++++++++++++++++++++++++++++++
>> drivers/scst/scst_main.c | 1919 +++++++++++++++++++
>> drivers/scst/scst_module.c | 69
>> drivers/scst/scst_priv.h | 513 +++++
>> drivers/scst/scst_targ.c | 5458 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 8 files changed, 12435 insertions(+)
>
> There was a lot af TRACE_ENTRY() / TRACE_EXIT() noise.
> We should have proper tools for that by now (I hope).
Sorry, I don't see such tools with, for instance, a possibility to be
compiled out in non-debug builds.
From one side, I can agree, those TRACE_ENTRY()/TRACE_EXIT() statements
*may* look like a noise (I personally don't notice them), but, from
other side, in past times they have proved how usable they are. If an
SCST user has a problem, I simply ask him to make a debug build, then
enable entry_exit and some other logging levels, then reproduce the
problem and send me the logs. Then in most cases I can see what's wrong
and provide a fix without additional actions and questions.
> We often ask for exported symbols to be documented - so one has
> a slight idea of their purpose.
They are documented, near their prototypes in the public header files,
particularly, scst.h. It was done so, because it was supposed that one,
writing a target driver or dev handler will have on hands the header
files, not source code.
Should we move those comments from the functions prototypes to the
functions definitions?
> CONFIG_SCST_STRICT_SERIALIZING has bad impact on readability.
> Could this be abstracted better?
We will try.
Thanks,
Vlad
Hi Evgeniy,
Evgeniy Polyakov wrote:
> Hi Vladislav.
>
> On Wed, Dec 10, 2008 at 10:04:36PM +0300, Vladislav Bolkhovitin ([email protected]) wrote:
>> In the chosen approach new optional field void *net_priv was added to
>> struct page. It is enclosed by
>
> There is a huge no-no in networking land on increasing skb.
> Reason is simple every skb will carry potentially unneded data as long
> as given option is enabled, and most of the time it will.
> To break this barrier one has to have (I wanted to write ego, but then
> decided to replace it with mojo) so huge reason to do this, that it is
> almost impossible to have.
>
> Something tells me that increasing page structure with 8 bytes because
> of zero-copy iscsi transfer is not that great idea, since basically every
> user out there will have it enabled in the distro config and will waste
> noticeble amount of ram.
The waste will be only 0.2% of RAM or 2MB per 1GB. Not much. Perhaps,
not noticeable for an average user of distro kernels at all. Embedded
people, who count each byte, almost always don't need iSCSI, so won't
have any problems to disable
TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION option.
Also, as I wrote, iSCSI-SCST can work without this patch or with
TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION option disabled, but with
user space device handlers it will work considerably worse. Only few
distro kernels users need an iSCSI target and only few among such users
need to use a user space device handler. So, option
TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION can be safely disabled in
default configs of distro kernels. People who need both iSCSI target
*and* fast working user space device handler would simply enable that
option and rebuild the kernel. Rejecting this patch provides much worse
alternative: those people would also have to *patch* the kernel at
first, only then enable that option, then rebuild the kernel.
> The same problem of not sending any kind of notification to the user
> when his pages are 'acked' by receiving some packet or freeing the data
> exists long ago and was tried to be fixed several times.
>
> The most applicable to your case maybe DST experience. DST is a block
> layer device and all its pages starting from quite recent kernels are
> not allowed to be slab ones (xfs was the last one who provided slab
> pages in the bios), so each page has two 'unused' pointers in lru list
> entry, which you may reuse. If scsi layer may have slab pages from some
> place (although this does not sound like a good idea, ->sendpage() will
> bug on on them anyway), this hack will not work, otherwise you only need
> to have net_page_get/put stuff in and do not mess with increasing page.
> And this was tested 3-4 kernel releases ago, so things may be changed.
I've just rechecked if any of struct page fields can be shared with
net_priv field. Unfortunately, it looks like none among the mandatory
fields:
- Transmitting pages can be mapped in some user space, so sharing in
unions with _mapcount, mapping and index fields isn't possible
- Transmitting pages can be in the page cache, so sharing lru field
isn't possible as well
It might, however, be shared with field "virtual", since SCST allocates
only lowmem pages, but on non-HIGHMEM kernels and most HIGHMEM kernels
WANT_PAGE_VIRTUAL isn't defined, so it will change basically nothing.
> Another appropach is to increase skb's shared data (at the end of the
> skb->data), and this approach was not frowned upon too much either, but
> it requires to mess with skb->destructor, which may not be appropriate
> in some cases. If iscsi does not use sockets (it does iirc), things are
> much simpler.
As I wrote, approach to use net_priv on the skb level was examined, but
rejected as unpractical. Simply too many places should be modified to
prevent merging skb's with different net_priv-like labels and those
places are too inobvious. Implementation of this approach and,
especially, maintenance it would be just a nightmare.
Thanks,
Vlad
Jens Axboe wrote:
> On Wed, Dec 10 2008, Vladislav Bolkhovitin wrote:
>> This patch exports alloc_io_context() function. For performance reasons
>> SCST queues commands using a pool of IO threads. It is considerably
>> better for performance (>30% increase on sequential reads) if threads in
>> a pool have the same IO context. Since SCST can be built as a module,
>> it needs alloc_io_context() function exported.
>>
>> Signed-off-by: Vladislav Bolkhovitin <[email protected]>
>> ---
>> block/blk-ioc.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff -upkr linux-2.6.27.2/block/blk-ioc.c linux-2.6.27.2/block/blk-ioc.c
>> --- linux-2.6.27.2/block/blk-ioc.c 2008-10-10 02:13:53.000000000 +0400
>> +++ linux-2.6.27.2/block/blk-ioc.c 2008-11-25 21:27:01.000000000 +0300
>> @@ -105,6 +105,7 @@ struct io_context *alloc_io_context(gfp_
>>
>> return ret;
>> }
>> +EXPORT_SYMBOL(alloc_io_context);
>
> Why is this needed, can't you just use CLONE_IO?
There are two reasons for that:
1. kthread interface doesn't support passing CLONE_IO flag.
2. Each (virtual) device has own pool of threads, which serves it.
Threads in each such pools should have a common IO context, but
different pools should have different IO contexts. So, it would be
necessary to implement two levels start of IO threads in each pool. At
first, one thread would be started. Then it would call get_io_context()
to gain io_context. Then it would create the remaining threads with
CLONE_IO flag. Definitely, it's a lot more complicated than a simple
call of alloc_io_context() and assignment of the returned context to
each just created thread in a loop before they were ran.
Thanks,
Vlad
On Thu, Dec 11 2008, Vladislav Bolkhovitin wrote:
> Jens Axboe wrote:
> >On Wed, Dec 10 2008, Vladislav Bolkhovitin wrote:
> >>This patch exports alloc_io_context() function. For performance reasons
> >>SCST queues commands using a pool of IO threads. It is considerably
> >>better for performance (>30% increase on sequential reads) if threads in
> >> a pool have the same IO context. Since SCST can be built as a module,
> >>it needs alloc_io_context() function exported.
> >>
> >>Signed-off-by: Vladislav Bolkhovitin <[email protected]>
> >>---
> >> block/blk-ioc.c | 1 +
> >> 1 file changed, 1 insertion(+)
> >>
> >>diff -upkr linux-2.6.27.2/block/blk-ioc.c linux-2.6.27.2/block/blk-ioc.c
> >>--- linux-2.6.27.2/block/blk-ioc.c 2008-10-10 02:13:53.000000000 +0400
> >>+++ linux-2.6.27.2/block/blk-ioc.c 2008-11-25 21:27:01.000000000 +0300
> >>@@ -105,6 +105,7 @@ struct io_context *alloc_io_context(gfp_
> >>
> >> return ret;
> >> }
> >>+EXPORT_SYMBOL(alloc_io_context);
> >
> >Why is this needed, can't you just use CLONE_IO?
>
> There are two reasons for that:
>
> 1. kthread interface doesn't support passing CLONE_IO flag.
Then you fix that instead of working around it! :-)
> 2. Each (virtual) device has own pool of threads, which serves it.
> Threads in each such pools should have a common IO context, but
> different pools should have different IO contexts. So, it would be
> necessary to implement two levels start of IO threads in each pool. At
> first, one thread would be started. Then it would call get_io_context()
> to gain io_context. Then it would create the remaining threads with
> CLONE_IO flag. Definitely, it's a lot more complicated than a simple
> call of alloc_io_context() and assignment of the returned context to
> each just created thread in a loop before they were ran.
Just start the first thread without CLONE_IO, and subsequent threads
fork off that with CLONE_IO set? I think we need to make sure that we
allocate an IO context for the 'parent' if it doesn't have one already
and CLONE_IO is set, but that is something that can easily be rectified.
It may seem more complex, but if you use this approach you are pretty
much free to worry about any changes in the future there.
--
Jens Axboe
Jens Axboe wrote:
> On Thu, Dec 11 2008, Vladislav Bolkhovitin wrote:
>> Jens Axboe wrote:
>>> On Wed, Dec 10 2008, Vladislav Bolkhovitin wrote:
>>>> This patch exports alloc_io_context() function. For performance reasons
>>>> SCST queues commands using a pool of IO threads. It is considerably
>>>> better for performance (>30% increase on sequential reads) if threads in
>>>> a pool have the same IO context. Since SCST can be built as a module,
>>>> it needs alloc_io_context() function exported.
>>>>
>>>> Signed-off-by: Vladislav Bolkhovitin <[email protected]>
>>>> ---
>>>> block/blk-ioc.c | 1 +
>>>> 1 file changed, 1 insertion(+)
>>>>
>>>> diff -upkr linux-2.6.27.2/block/blk-ioc.c linux-2.6.27.2/block/blk-ioc.c
>>>> --- linux-2.6.27.2/block/blk-ioc.c 2008-10-10 02:13:53.000000000 +0400
>>>> +++ linux-2.6.27.2/block/blk-ioc.c 2008-11-25 21:27:01.000000000 +0300
>>>> @@ -105,6 +105,7 @@ struct io_context *alloc_io_context(gfp_
>>>>
>>>> return ret;
>>>> }
>>>> +EXPORT_SYMBOL(alloc_io_context);
>>> Why is this needed, can't you just use CLONE_IO?
>> There are two reasons for that:
>>
>> 1. kthread interface doesn't support passing CLONE_IO flag.
>
> Then you fix that instead of working around it! :-)
It doesn't worth the effort, because of (2) below.
>> 2. Each (virtual) device has own pool of threads, which serves it.
>> Threads in each such pools should have a common IO context, but
>> different pools should have different IO contexts. So, it would be
>> necessary to implement two levels start of IO threads in each pool. At
>> first, one thread would be started. Then it would call get_io_context()
>> to gain io_context. Then it would create the remaining threads with
>> CLONE_IO flag. Definitely, it's a lot more complicated than a simple
>> call of alloc_io_context() and assignment of the returned context to
>> each just created thread in a loop before they were ran.
>
> Just start the first thread without CLONE_IO, and subsequent threads
> fork off that with CLONE_IO set?
Yes, that would be the two stages threads creation. A *LOT* more
complicated, than with the direct io_context assignment using
alloc_io_context().
> I think we need to make sure that we
> allocate an IO context for the 'parent' if it doesn't have one already
> and CLONE_IO is set, but that is something that can easily be rectified.
Sorry, I don't feel I understood you here..
> It may seem more complex, but if you use this approach you are pretty
> much free to worry about any changes in the future there.
Worrying about future changes is regular in Linux kernel, where there is
no stable API ;-)
On Thu, Dec 11 2008, Vladislav Bolkhovitin wrote:
> Jens Axboe wrote:
> >On Thu, Dec 11 2008, Vladislav Bolkhovitin wrote:
> >>Jens Axboe wrote:
> >>>On Wed, Dec 10 2008, Vladislav Bolkhovitin wrote:
> >>>>This patch exports alloc_io_context() function. For performance reasons
> >>>>SCST queues commands using a pool of IO threads. It is considerably
> >>>>better for performance (>30% increase on sequential reads) if threads
> >>>>in a pool have the same IO context. Since SCST can be built as a
> >>>> module, it needs alloc_io_context() function exported.
> >>>>
> >>>>Signed-off-by: Vladislav Bolkhovitin <[email protected]>
> >>>>---
> >>>> block/blk-ioc.c | 1 +
> >>>> 1 file changed, 1 insertion(+)
> >>>>
> >>>>diff -upkr linux-2.6.27.2/block/blk-ioc.c linux-2.6.27.2/block/blk-ioc.c
> >>>>--- linux-2.6.27.2/block/blk-ioc.c 2008-10-10 02:13:53.000000000 +0400
> >>>>+++ linux-2.6.27.2/block/blk-ioc.c 2008-11-25 21:27:01.000000000 +0300
> >>>>@@ -105,6 +105,7 @@ struct io_context *alloc_io_context(gfp_
> >>>>
> >>>> return ret;
> >>>>}
> >>>>+EXPORT_SYMBOL(alloc_io_context);
> >>>Why is this needed, can't you just use CLONE_IO?
> >>There are two reasons for that:
> >>
> >>1. kthread interface doesn't support passing CLONE_IO flag.
> >
> >Then you fix that instead of working around it! :-)
>
> It doesn't worth the effort, because of (2) below.
>
> >>2. Each (virtual) device has own pool of threads, which serves it.
> >>Threads in each such pools should have a common IO context, but
> >>different pools should have different IO contexts. So, it would be
> >>necessary to implement two levels start of IO threads in each pool. At
> >>first, one thread would be started. Then it would call get_io_context()
> >>to gain io_context. Then it would create the remaining threads with
> >>CLONE_IO flag. Definitely, it's a lot more complicated than a simple
> >>call of alloc_io_context() and assignment of the returned context to
> >>each just created thread in a loop before they were ran.
> >
> >Just start the first thread without CLONE_IO, and subsequent threads
> >fork off that with CLONE_IO set?
>
> Yes, that would be the two stages threads creation. A *LOT* more
> complicated, than with the direct io_context assignment using
> alloc_io_context().
>
> >I think we need to make sure that we
> >allocate an IO context for the 'parent' if it doesn't have one already
> >and CLONE_IO is set, but that is something that can easily be rectified.
>
> Sorry, I don't feel I understood you here..
Sure I understand that it's then a two-stage rocket for the first
context you fork off. I don't see how you qualify that as a *LOT* more
complicated...
> >It may seem more complex, but if you use this approach you are pretty
> >much free to worry about any changes in the future there.
>
> Worrying about future changes is regular in Linux kernel, where there is
> no stable API ;-)
Sure, but if your stuff gets merged then *I* have to fiddle with your
stuff as well when making changes. If you plan to keep your stuff out of
the kernel and maintain it there, fine, but I think you probably don't.
It's not a HUGE deal for this case, since you basically just want to use
alloc_io_context() and ioc_task_link(). So we can make the export and be
done with it.
--
Jens Axboe
On Thu, 2008-12-11 at 21:16 +0300, Vladislav Bolkhovitin wrote:
> Hi Evgeniy,
>
> Evgeniy Polyakov wrote:
> > Hi Vladislav.
> >
> > On Wed, Dec 10, 2008 at 10:04:36PM +0300, Vladislav Bolkhovitin ([email protected]) wrote:
> >> In the chosen approach new optional field void *net_priv was added to
> >> struct page. It is enclosed by
> >
> > There is a huge no-no in networking land on increasing skb.
> > Reason is simple every skb will carry potentially unneded data as long
> > as given option is enabled, and most of the time it will.
> > To break this barrier one has to have (I wanted to write ego, but then
> > decided to replace it with mojo) so huge reason to do this, that it is
> > almost impossible to have.
> >
> > Something tells me that increasing page structure with 8 bytes because
> > of zero-copy iscsi transfer is not that great idea, since basically every
> > user out there will have it enabled in the distro config and will waste
> > noticeble amount of ram.
>
> The waste will be only 0.2% of RAM or 2MB per 1GB. Not much. Perhaps,
> not noticeable for an average user of distro kernels at all. Embedded
> people, who count each byte, almost always don't need iSCSI, so won't
> have any problems to disable
> TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION option.
Actually, there are several other considerations:
1. struct page is a lowmem structure, so increasing its size
becomes problematic on x86 PAE systems.
2. The current 64 bit struct page seems to be exactly pushing a
cacheline boundary. Increasing it so it spills over will have a
performance impact
It's the performance problems that will be most critical, I suspect, so
you'll need mm people buy in for doing this.
One thing that leaps immediately to mind is that you could isolate this
to the net layer by putting it in skb_frag_struct. However, such a move
would require a proper API for this in net ... right now it looks like
you're using the struct page addition to carry this information from
SCSI to net, which is a bit of a layering violation.
James
On Wed, 2008-12-10 at 21:37 +0300, Vladislav Bolkhovitin wrote:
> This patch contains SCST the /proc interface.
>
> A description of this interface can be found in the patch with the
> SCST core documentation.
>
> Since a procfs-based configuration interface is unacceptable for new
> kernel modules, in the next review iteration SCST's configuration
> interface will be replaced by a sysfs-based configuration interface.
> This patch is not intended to be included in the Linux kernel, but is
> posted here, because as of today this configuration interface is
> necessary when using SCST.
>
> Unfortunately, configfs is not (yet) suited for configuring SCST. This
> is, because configfs is user space driven, so kernel can't create
> subdirectories on it, and all files on configfs are limited to 4K in
> size. It makes impossible for kernel to show, e.g., a list of connected
> initiators. Hence, with configfs it is necessary to have one more
> interface to show such data, e.g. sysfs-based.
Btw, please stop spreading FUD about ConfigFS. ConfigFS works great for
Target_Core_Mod and LIO-Target v3.0, and is what I have found as the
*BEST* foundation for generic target mod moving forward. This is not
based on a hypothetical discussion or on a long term TODO list, this has
been determined from actually writing the code, which is located at:
http://git.kernel.org/?p=linux/kernel/git/nab/lio-core-2.6.git;a=blob;f=drivers/lio-core/target_core_configfs.c
http://git.kernel.org/?p=linux/kernel/git/nab/lio-core-2.6.git;a=blob;f=drivers/lio-core/iscsi_target_configfs.c
So please, just because you don't want to acknowledge ConfigFS in your
own work, do not act like there is not already thounsands of lines of
ConfigFS code up and running for the generic target mode and LIO-Target.
--nab
> It would lead to 2
> interfaces in two different places for configuring SCSI targets:
> configfs and sysfs based. Definitely, it is better to have only one,
> sysfs-based interface, than 2 interfaces. From other side, sysfs in what
> SCST needs provides basically the same possibilities as configfs. And
> it's widely used in the kernel to configure various its parameters. See,
> for instance, bonding devices or IO schedulers.
>
> The proposed /sys interface would be very similar to the current /proc
> layout intact, except cases, where output >PAGE_SIZE is needed. For such
> cases each entry, i.e. line, in such files would be presented as a
> subdirectory with name the first element in that line and each other
> element in it would be presented as a separate file (attribute). For
> instance, /proc/scsi_tgt/sessions, which lists connected sessions, would
> be converted to:
>
> /sys/scsi_tgt/
> /sys/scsi_tgt/sessions/
> /sys/scsi_tgt/sessions/session1_name/
> /sys/scsi_tgt/sessions/session1_name/target_name
> /sys/scsi_tgt/sessions/session1_name/initiator_name
> /sys/scsi_tgt/sessions/session1_name/acl -> ../../acls/aclX
> /sys/scsi_tgt/sessions/session1_name/commands
> /sys/scsi_tgt/sessions/session2_name/
> /sys/scsi_tgt/sessions/session2_name/target_name
> /sys/scsi_tgt/sessions/session2_name/initiator_name
> /sys/scsi_tgt/sessions/session2_name/acl -> ../../acls/aclY
> /sys/scsi_tgt/sessions/session2_name/commands
> .
> .
> .
>
> Addition of new, e.g. vdisk devices, would be done via echo'ing commands
> to, in this example, /sys/scsi_tgt/vdisk/mgmt file similarly as it's
> currently done with /proc/scsi_tgt/vdisk/vdisk (same commands, actually).
>
> Any comments and suggestions will be greatly appreciated.
>
> Signed-off-by: Vladislav Bolkhovitin <[email protected]>
> ---
> drivers/scst/scst_proc.c | 2196
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 2196 insertions(+)
>
> The patch is too big to be submitted inline. You can find it in
> http://scst.sourceforge.net/patches/scst_proc.diff
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> >> drivers/scst/scst_lib.c | 3689
> >> +++++++++++++++++++++++++++++++++++++
> >> drivers/scst/scst_main.c | 1919 +++++++++++++++++++
> >> drivers/scst/scst_module.c | 69
> >> drivers/scst/scst_priv.h | 513 +++++
> >> drivers/scst/scst_targ.c | 5458
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> 8 files changed, 12435 insertions(+)
> >
> >There was a lot af TRACE_ENTRY() / TRACE_EXIT() noise.
> >We should have proper tools for that by now (I hope).
>
> Sorry, I don't see such tools with, for instance, a possibility to be
> compiled out in non-debug builds.
>
> From one side, I can agree, those TRACE_ENTRY()/TRACE_EXIT() statements
> *may* look like a noise (I personally don't notice them), but, from
> other side, in past times they have proved how usable they are. If an
> SCST user has a problem, I simply ask him to make a debug build, then
> enable entry_exit and some other logging levels, then reproduce the
> problem and send me the logs. Then in most cases I can see what's wrong
> and provide a fix without additional actions and questions.
I had ftrace in mind but it has not hit mainline yet.
But will do before this patchset does.
>
> >We often ask for exported symbols to be documented - so one has
> >a slight idea of their purpose.
>
> They are documented, near their prototypes in the public header files,
> particularly, scst.h. It was done so, because it was supposed that one,
> writing a target driver or dev handler will have on hands the header
> files, not source code.
>
> Should we move those comments from the functions prototypes to the
> functions definitions?
If there will be multiple implmentations of the same prototype
we generally recommend to stick the comment in a .H file.
But otherwise keep it close to the source it describe with the
minimal hope that it gets updated.
Oh - and we do not distribute a stripped down headers only
version of the kernel. Users will see full kernel source.
Sam
On Wed, 2008-12-10 at 22:01 +0300, Vladislav Bolkhovitin wrote:
> This patch contains iSCSI-SCST target driver. This driver is a heavily
> modified forked with all respects IET
> (http://iscsitarget.sourceforge.net). Modifications were aimed to make a
> clearer, more reviewable and maintainable code as well as to fix many
> problems and make many improvements. See
> http://scst.sourceforge.net/target_iscsi.html for more details.
>
> It has split user/kernel space architecture, where all management,
> sessions creation, parameters negotiation, etc. made in user space and
> data are transferred in the kernel space. Such architecture for iSCSI
> processing was many times acknowledged as the right one. Particularly,
> in-kernel iSCSI initiator (open-iscsi) has such architecture.
>
Just as with the Open/iSCSI Initiator, IMHO I believe the split
architecture design is difficult both to improve, debug and maintain,
and provides *ZERO* additional benefit in the context of traditional
iSCSI target mode for doing login and connection/session setup in
userspace.
Also, I appericate that you spent alot of time porting over IET code to
your engine, but during our previous discussion you did not seem
terribly interested in validation against core-iscsi-dv
(http://linux-iscsi.org/index.php/Core-iscsi-dv) to test RFC-3720
interopt and stability. Because the Core-iSCSI Initiator supports every
possible parameter combination up to ErrorRecoveryLevel=0 defined in
RFC-3720, the Core-iSCSI-Dv tests can run badblocks (or any too) to
check data integrity for *EVERY* possible traditional iSCSI key
combination and functionality for your iSCSI-SCST work, and any type of
serious iSCSI-SCST production deployments.
Until you can at least do that to prove both the stability and
completeness of your iSCSI Target stack up to RFC-3720
ErrorRecoveryLevel=0, I don't think this code makes sense for any type
of mainline acceptance. Also, I would appericate if you could stop the
handwaving about MC/S and ErrorRecoveryLevel=2 as well. MC/S (multiple
connection paths for iSCSI/TCP, not necessary with multiple assoication
LIO-Target SCTP) and ERL=2 (OS independent fabric recovery) is available
in both RFC-3720 (Traditional iSCSI) and RFC-5045 (iSER for iWARP and
IB).
Trying to argue against implementing the complete iSCSI RFC
functionality (like some folks did during the early Open/iSCSI days)
that customers and users *WANT* now is going to fall of deaf ears.
Please understand that they are going to be many, many different iSCSi
Initiators connecting to the upstream iSCSI Target Core infrastructure,
and trying to rush in an incomplete iSCSI Target $FABRIC_MOD
implementation benefits anyone.
Regards,
--nab
> Signed-off-by: Vladislav Bolkhovitin <[email protected]>
> ---
> drivers/scst/iscsi-scst/Kconfig | 25
> drivers/scst/iscsi-scst/Makefile | 6
> drivers/scst/iscsi-scst/config.c | 597 +++++++
> drivers/scst/iscsi-scst/conn.c | 488 +++++
> drivers/scst/iscsi-scst/digest.c | 221 ++
> drivers/scst/iscsi-scst/digest.h | 31
> drivers/scst/iscsi-scst/event.c | 116 +
> drivers/scst/iscsi-scst/iscsi.c | 3066 ++++++++++++++++++++++++++++++++++++
> drivers/scst/iscsi-scst/iscsi.h | 577 ++++++
> drivers/scst/iscsi-scst/iscsi_dbg.h | 73
> drivers/scst/iscsi-scst/iscsi_hdr.h | 517 ++++++
> drivers/scst/iscsi-scst/nthread.c | 1522 +++++++++++++++++
> drivers/scst/iscsi-scst/param.c | 263 +++
> drivers/scst/iscsi-scst/session.c | 210 ++
> drivers/scst/iscsi-scst/target.c | 306 +++
> include/scst/iscsi_scst.h | 156 +
> include/scst/iscsi_scst_ver.h | 16
> 17 files changed, 8190 insertions(+)
>
> The patch is too big to be submitted inline. You can find it in
> http://scst.sourceforge.net/patches/iscsi-scst.diff
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On Thu, 2008-12-11 at 14:55 -0800, Nicholas A. Bellinger wrote:
> On Wed, 2008-12-10 at 22:01 +0300, Vladislav Bolkhovitin wrote:
> > This patch contains iSCSI-SCST target driver. This driver is a heavily
> > modified forked with all respects IET
> > (http://iscsitarget.sourceforge.net). Modifications were aimed to make a
> > clearer, more reviewable and maintainable code as well as to fix many
> > problems and make many improvements. See
> > http://scst.sourceforge.net/target_iscsi.html for more details.
> >
> > It has split user/kernel space architecture, where all management,
> > sessions creation, parameters negotiation, etc. made in user space and
> > data are transferred in the kernel space. Such architecture for iSCSI
> > processing was many times acknowledged as the right one. Particularly,
> > in-kernel iSCSI initiator (open-iscsi) has such architecture.
> >
>
> Just as with the Open/iSCSI Initiator, IMHO I believe the split
> architecture design is difficult both to improve, debug and maintain,
> and provides *ZERO* additional benefit in the context of traditional
> iSCSI target mode for doing login and connection/session setup in
> userspace.
>
> Also, I appericate that you spent alot of time porting over IET code to
> your engine, but during our previous discussion you did not seem
> terribly interested in validation against core-iscsi-dv
> (http://linux-iscsi.org/index.php/Core-iscsi-dv) to test RFC-3720
> interopt and stability. Because the Core-iSCSI Initiator supports every
> possible parameter combination up to ErrorRecoveryLevel=0 defined in
> RFC-3720, the Core-iSCSI-Dv tests can run badblocks (or any too) to
> check data integrity for *EVERY* possible traditional iSCSI key
> combination and functionality for your iSCSI-SCST work, and any type of
> serious iSCSI-SCST production deployments.
>
> Until you can at least do that to prove both the stability and
> completeness of your iSCSI Target stack up to RFC-3720
> ErrorRecoveryLevel=0, I don't think this code makes sense for any type
> of mainline acceptance. Also, I would appericate if you could stop the
> handwaving about MC/S and ErrorRecoveryLevel=2 as well. MC/S (multiple
> connection paths for iSCSI/TCP, not necessary with multiple assoication
> LIO-Target SCTP) and ERL=2 (OS independent fabric recovery) is available
> in both RFC-3720 (Traditional iSCSI) and RFC-5045 (iSER for iWARP and
> IB).
>
That was RFC-5046 for iSER note btw, RFC-5045 is the actual case for
RDMA/DDP over TCP and the Stream Control Transfer Protocol (SCTP).
Regards,
--nab
> Trying to argue against implementing the complete iSCSI RFC
> functionality (like some folks did during the early Open/iSCSI days)
> that customers and users *WANT* now is going to fall of deaf ears.
> Please understand that they are going to be many, many different iSCSi
> Initiators connecting to the upstream iSCSI Target Core infrastructure,
> and trying to rush in an incomplete iSCSI Target $FABRIC_MOD
> implementation benefits anyone.
>
> Regards,
>
> --nab
>
> > Signed-off-by: Vladislav Bolkhovitin <[email protected]>
> > ---
> > drivers/scst/iscsi-scst/Kconfig | 25
> > drivers/scst/iscsi-scst/Makefile | 6
> > drivers/scst/iscsi-scst/config.c | 597 +++++++
> > drivers/scst/iscsi-scst/conn.c | 488 +++++
> > drivers/scst/iscsi-scst/digest.c | 221 ++
> > drivers/scst/iscsi-scst/digest.h | 31
> > drivers/scst/iscsi-scst/event.c | 116 +
> > drivers/scst/iscsi-scst/iscsi.c | 3066 ++++++++++++++++++++++++++++++++++++
> > drivers/scst/iscsi-scst/iscsi.h | 577 ++++++
> > drivers/scst/iscsi-scst/iscsi_dbg.h | 73
> > drivers/scst/iscsi-scst/iscsi_hdr.h | 517 ++++++
> > drivers/scst/iscsi-scst/nthread.c | 1522 +++++++++++++++++
> > drivers/scst/iscsi-scst/param.c | 263 +++
> > drivers/scst/iscsi-scst/session.c | 210 ++
> > drivers/scst/iscsi-scst/target.c | 306 +++
> > include/scst/iscsi_scst.h | 156 +
> > include/scst/iscsi_scst_ver.h | 16
> > 17 files changed, 8190 insertions(+)
> >
> > The patch is too big to be submitted inline. You can find it in
> > http://scst.sourceforge.net/patches/iscsi-scst.diff
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Jens Axboe wrote:
> On Thu, Dec 11 2008, Vladislav Bolkhovitin wrote:
>> Jens Axboe wrote:
>>> On Thu, Dec 11 2008, Vladislav Bolkhovitin wrote:
>>>> Jens Axboe wrote:
>>>>> On Wed, Dec 10 2008, Vladislav Bolkhovitin wrote:
>>>>>> This patch exports alloc_io_context() function. For performance reasons
>>>>>> SCST queues commands using a pool of IO threads. It is considerably
>>>>>> better for performance (>30% increase on sequential reads) if threads
>>>>>> in a pool have the same IO context. Since SCST can be built as a
>>>>>> module, it needs alloc_io_context() function exported.
>>>>>>
>>>>>> Signed-off-by: Vladislav Bolkhovitin <[email protected]>
>>>>>> ---
>>>>>> block/blk-ioc.c | 1 +
>>>>>> 1 file changed, 1 insertion(+)
>>>>>>
>>>>>> diff -upkr linux-2.6.27.2/block/blk-ioc.c linux-2.6.27.2/block/blk-ioc.c
>>>>>> --- linux-2.6.27.2/block/blk-ioc.c 2008-10-10 02:13:53.000000000 +0400
>>>>>> +++ linux-2.6.27.2/block/blk-ioc.c 2008-11-25 21:27:01.000000000 +0300
>>>>>> @@ -105,6 +105,7 @@ struct io_context *alloc_io_context(gfp_
>>>>>>
>>>>>> return ret;
>>>>>> }
>>>>>> +EXPORT_SYMBOL(alloc_io_context);
>>>>> Why is this needed, can't you just use CLONE_IO?
>>>> There are two reasons for that:
>>>>
>>>> 1. kthread interface doesn't support passing CLONE_IO flag.
>>> Then you fix that instead of working around it! :-)
>> It doesn't worth the effort, because of (2) below.
>>
>>>> 2. Each (virtual) device has own pool of threads, which serves it.
>>>> Threads in each such pools should have a common IO context, but
>>>> different pools should have different IO contexts. So, it would be
>>>> necessary to implement two levels start of IO threads in each pool. At
>>>> first, one thread would be started. Then it would call get_io_context()
>>>> to gain io_context. Then it would create the remaining threads with
>>>> CLONE_IO flag. Definitely, it's a lot more complicated than a simple
>>>> call of alloc_io_context() and assignment of the returned context to
>>>> each just created thread in a loop before they were ran.
>>> Just start the first thread without CLONE_IO, and subsequent threads
>>> fork off that with CLONE_IO set?
>> Yes, that would be the two stages threads creation. A *LOT* more
>> complicated, than with the direct io_context assignment using
>> alloc_io_context().
>>
>>> I think we need to make sure that we
>>> allocate an IO context for the 'parent' if it doesn't have one already
>>> and CLONE_IO is set, but that is something that can easily be rectified.
>> Sorry, I don't feel I understood you here..
>
> Sure I understand that it's then a two-stage rocket for the first
> context you fork off. I don't see how you qualify that as a *LOT* more
> complicated...
It would be a split of logic, which can be handled in one place to be
handled in two places. Hence, harder to implement, understand and
maintain. It's always something to avoid.
Actually, there is another, simpler method to achieve the same what
alloc_io_context() does, without exporting it:
struct io_context *ioc, *t;
*t = current->io_context;
current->io_context = NULL;
ioc = get_io_context(GFP_KERNEL, -1);
current->io_context = t;
But, I believe, you won't like such dirty hacking even more ;-)
>>> It may seem more complex, but if you use this approach you are pretty
>>> much free to worry about any changes in the future there.
>> Worrying about future changes is regular in Linux kernel, where there is
>> no stable API ;-)
>
> Sure, but if your stuff gets merged then *I* have to fiddle with your
> stuff as well when making changes. If you plan to keep your stuff out of
> the kernel and maintain it there, fine, but I think you probably don't.
>
> It's not a HUGE deal for this case, since you basically just want to use
> alloc_io_context() and ioc_task_link(). So we can make the export and be
> done with it.
Thanks!
Nicholas A. Bellinger wrote:
> On Wed, 2008-12-10 at 21:37 +0300, Vladislav Bolkhovitin wrote:
>> This patch contains SCST the /proc interface.
>>
>> A description of this interface can be found in the patch with the
>> SCST core documentation.
>>
>> Since a procfs-based configuration interface is unacceptable for new
>> kernel modules, in the next review iteration SCST's configuration
>> interface will be replaced by a sysfs-based configuration interface.
>> This patch is not intended to be included in the Linux kernel, but is
>> posted here, because as of today this configuration interface is
>> necessary when using SCST.
>>
>> Unfortunately, configfs is not (yet) suited for configuring SCST. This
>> is, because configfs is user space driven, so kernel can't create
>> subdirectories on it, and all files on configfs are limited to 4K in
>> size. It makes impossible for kernel to show, e.g., a list of connected
>> initiators. Hence, with configfs it is necessary to have one more
>> interface to show such data, e.g. sysfs-based.
>
> Btw, please stop spreading FUD about ConfigFS. ConfigFS works great for
> Target_Core_Mod and LIO-Target v3.0, and is what I have found as the
> *BEST* foundation for generic target mod moving forward. This is not
> based on a hypothetical discussion or on a long term TODO list, this has
> been determined from actually writing the code, which is located at:
>
> http://git.kernel.org/?p=linux/kernel/git/nab/lio-core-2.6.git;a=blob;f=drivers/lio-core/target_core_configfs.c
>
> http://git.kernel.org/?p=linux/kernel/git/nab/lio-core-2.6.git;a=blob;f=drivers/lio-core/iscsi_target_configfs.c
>
> So please, just because you don't want to acknowledge ConfigFS in your
> own work, do not act like there is not already thounsands of lines of
> ConfigFS code up and running for the generic target mode and LIO-Target.
Nicholas,
You have an *exceptional* ability don't see what you don't want to see.
It has already happened with Persistent Reservations over pass-through
backend (see the end of http://lkml.org/lkml/2008/7/10/328 and
subsequent messages in this thread) and now this is happening with
configfs. I already 2 times described you why configfs isn't appropriate
for a SCSI target (the first time in
http://lkml.org/lkml/2008/10/21/259), but you keep refusing to see it.
In short:
1. Kernel can't create subdirectories in configfs
2. Sysfs doesn't allow files >4K
3. What you have been doing to live with the above limitations is
implementing "access allowed only for explicitly specified initiators
and forbidden for all others" security approach. This approach is
unacceptable on practice. The majority of people simply define available
devices for a target and don't bother with listing initiators, which
allowed to connect to it. But you forces them to do that and keep doing
again and again for each related network change.
In contrast, sysfs allows kernel to create subdirectories. It will allow
to workaround the 4K limitation by a simple subdirectories hierarchy.
So, both security approaches ("access allowed only for explicitly
specified initiators and forbidden for all others" and "access forbidden
only for explicitly specified initiators and allowed for all others")
can be seamlessly implemented on sysfs as it is currently done on procfs
in SCST.
Vlad
Sam Ravnborg wrote:
>>>> drivers/scst/scst_lib.c | 3689
>>>> +++++++++++++++++++++++++++++++++++++
>>>> drivers/scst/scst_main.c | 1919 +++++++++++++++++++
>>>> drivers/scst/scst_module.c | 69
>>>> drivers/scst/scst_priv.h | 513 +++++
>>>> drivers/scst/scst_targ.c | 5458
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 8 files changed, 12435 insertions(+)
>>> There was a lot af TRACE_ENTRY() / TRACE_EXIT() noise.
>>> We should have proper tools for that by now (I hope).
>> Sorry, I don't see such tools with, for instance, a possibility to be
>> compiled out in non-debug builds.
>>
>> From one side, I can agree, those TRACE_ENTRY()/TRACE_EXIT() statements
>> *may* look like a noise (I personally don't notice them), but, from
>> other side, in past times they have proved how usable they are. If an
>> SCST user has a problem, I simply ask him to make a debug build, then
>> enable entry_exit and some other logging levels, then reproduce the
>> problem and send me the logs. Then in most cases I can see what's wrong
>> and provide a fix without additional actions and questions.
>
> I had ftrace in mind but it has not hit mainline yet.
> But will do before this patchset does.
Unfortunately, ftrace at the moment can't fully replace
TRACE_ENTRY()/TRACE_EXIT() statements, because it lacks:
1. Doing trace in only some modules or source files. Noise from tracing
of the whole kernel is not needed if we investigate problem in only one
module (SCST in this case).
2. Ability to inject own messages in the output stream. It's needed to
have synchronized stream of output messages about both functions tracing
and what those functions doing. For instance:
entry in func1()
adding cmd C to list L
exit from func1()
Otherwise there would be 2 traces:
entry in func1()
exit from func1()
and
adding cmd C to list L
which almost impossible to merge to analyze.
(Steven Rostedt added on CC)
>>> We often ask for exported symbols to be documented - so one has
>>> a slight idea of their purpose.
>> They are documented, near their prototypes in the public header files,
>> particularly, scst.h. It was done so, because it was supposed that one,
>> writing a target driver or dev handler will have on hands the header
>> files, not source code.
>>
>> Should we move those comments from the functions prototypes to the
>> functions definitions?
> If there will be multiple implmentations of the same prototype
> we generally recommend to stick the comment in a .H file.
> But otherwise keep it close to the source it describe with the
> minimal hope that it gets updated.
OK, we will move descriptions of functions from prototypes to definitions.
> Oh - and we do not distribute a stripped down headers only
> version of the kernel. Users will see full kernel source.
Those headers are not stripped down. Better analogy is with C++ private
and public class interfaces. Public SCST headers have public symbols for
external modules (target drivers mainly), but private headers are for
internal SCST use only. They are needed to better draw a boundary
between public and internal private interfaces.
Thanks,
Vlad
Nicholas A. Bellinger wrote:
> On Wed, 2008-12-10 at 22:01 +0300, Vladislav Bolkhovitin wrote:
>> This patch contains iSCSI-SCST target driver. This driver is a heavily
>> modified forked with all respects IET
>> (http://iscsitarget.sourceforge.net). Modifications were aimed to make a
>> clearer, more reviewable and maintainable code as well as to fix many
>> problems and make many improvements. See
>> http://scst.sourceforge.net/target_iscsi.html for more details.
>>
>> It has split user/kernel space architecture, where all management,
>> sessions creation, parameters negotiation, etc. made in user space and
>> data are transferred in the kernel space. Such architecture for iSCSI
>> processing was many times acknowledged as the right one. Particularly,
>> in-kernel iSCSI initiator (open-iscsi) has such architecture.
>>
>
> Just as with the Open/iSCSI Initiator, IMHO I believe the split
> architecture design is difficult both to improve, debug and maintain,
> and provides *ZERO* additional benefit in the context of traditional
> iSCSI target mode for doing login and connection/session setup in
> userspace.
>
> Also, I appericate that you spent alot of time porting over IET code to
> your engine, but during our previous discussion you did not seem
> terribly interested in validation against core-iscsi-dv
> (http://linux-iscsi.org/index.php/Core-iscsi-dv) to test RFC-3720
> interopt and stability. Because the Core-iSCSI Initiator supports every
> possible parameter combination up to ErrorRecoveryLevel=0 defined in
> RFC-3720, the Core-iSCSI-Dv tests can run badblocks (or any too) to
> check data integrity for *EVERY* possible traditional iSCSI key
> combination and functionality for your iSCSI-SCST work, and any type of
> serious iSCSI-SCST production deployments.
The fact that nobody so far cared to do all those complicated and time
consuming rather academic tests doesn't mean that iSCSI-SCST won't pass
them. IET/iSCSI-SCST have been used for a long time in very different
setups, including xBSD and Solaris initiators on non-x86 architectures,
without any problems.
Vlad
James Bottomley wrote:
> On Thu, 2008-12-11 at 21:16 +0300, Vladislav Bolkhovitin wrote:
>> Hi Evgeniy,
>>
>> Evgeniy Polyakov wrote:
>>> Hi Vladislav.
>>>
>>> On Wed, Dec 10, 2008 at 10:04:36PM +0300, Vladislav Bolkhovitin ([email protected]) wrote:
>>>> In the chosen approach new optional field void *net_priv was added to
>>>> struct page. It is enclosed by
>>> There is a huge no-no in networking land on increasing skb.
>>> Reason is simple every skb will carry potentially unneded data as long
>>> as given option is enabled, and most of the time it will.
>>> To break this barrier one has to have (I wanted to write ego, but then
>>> decided to replace it with mojo) so huge reason to do this, that it is
>>> almost impossible to have.
>>>
>>> Something tells me that increasing page structure with 8 bytes because
>>> of zero-copy iscsi transfer is not that great idea, since basically every
>>> user out there will have it enabled in the distro config and will waste
>>> noticeble amount of ram.
>> The waste will be only 0.2% of RAM or 2MB per 1GB. Not much. Perhaps,
>> not noticeable for an average user of distro kernels at all. Embedded
>> people, who count each byte, almost always don't need iSCSI, so won't
>> have any problems to disable
>> TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION option.
>
> Actually, there are several other considerations:
>
> 1. struct page is a lowmem structure, so increasing its size
> becomes problematic on x86 PAE systems.
> 2. The current 64 bit struct page seems to be exactly pushing a
> cacheline boundary. Increasing it so it spills over will have a
> performance impact
This is why I suggest to have
CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION disabled in
general kernels. ISCSI-SCST will still work with almost no performance
loss for in-kernel backend and people would better recompile kernel,
then patch it, then recompile.
> It's the performance problems that will be most critical, I suspect, so
> you'll need mm people buy in for doing this.
I'll ask in linux-mm, thanks for the suggestion.
> One thing that leaps immediately to mind is that you could isolate this
> to the net layer by putting it in skb_frag_struct. However, such a move
> would require a proper API for this in net ...
To have net_priv analog in skb was the first idea I was tried. But I
quickly gave up, because it would required that all the pages in each
skb_frag_struct be from the same originator, i.e. with the same
net_priv. It is unpractical to change all the operations with skb's to
forbid merging them, if they have different net_priv. There are too many
such places in very not obvious code pieces.
> right now it looks like
> you're using the struct page addition to carry this information from
> SCSI to net, which is a bit of a layering violation.
I don't think there is any layering violation here. Just lower layer
notifies upper layer that transmission of a page has finished. It's done
a bit not straightforward, but still basically the same as, for
instance, on_free_cmd() callbacks which SCST core uses to notify target
drivers and dev handlers that the corresponding command is about to be
freed, so they can free associated with it data as well.
Thanks,
Vlad
On Fri, 2008-12-12 at 22:25 +0300, Vladislav Bolkhovitin wrote:
> > One thing that leaps immediately to mind is that you could isolate this
> > to the net layer by putting it in skb_frag_struct. However, such a move
> > would require a proper API for this in net ...
>
> To have net_priv analog in skb was the first idea I was tried. But I
> quickly gave up, because it would required that all the pages in each
> skb_frag_struct be from the same originator, i.e. with the same
> net_priv. It is unpractical to change all the operations with skb's to
> forbid merging them, if they have different net_priv. There are too many
> such places in very not obvious code pieces.
Actually, I said carry in skb_frag_struct not skb ... that allows for
merging of skbs with different page sources. The API changes would have
to allow setting at this level.
> > right now it looks like
> > you're using the struct page addition to carry this information from
> > SCSI to net, which is a bit of a layering violation.
>
> I don't think there is any layering violation here. Just lower layer
> notifies upper layer that transmission of a page has finished. It's done
> a bit not straightforward, but still basically the same as, for
> instance, on_free_cmd() callbacks which SCST core uses to notify target
> drivers and dev handlers that the corresponding command is about to be
> freed, so they can free associated with it data as well.
The way you transmit the information you want notification is done by a
private tag in struct page, so you're carrying the information on an
object that belongs to neither layer ... that's the violation. It's
essentially an extension of the net API that goes via the mm layer.
James
(Added Ingo and Frederic)
On Fri, 2008-12-12 at 22:24 +0300, Vladislav Bolkhovitin wrote:
> Sam Ravnborg wrote:
> >>>> drivers/scst/scst_lib.c | 3689
> >>>> +++++++++++++++++++++++++++++++++++++
> >>>> drivers/scst/scst_main.c | 1919 +++++++++++++++++++
> >>>> drivers/scst/scst_module.c | 69
> >>>> drivers/scst/scst_priv.h | 513 +++++
> >>>> drivers/scst/scst_targ.c | 5458
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> 8 files changed, 12435 insertions(+)
> >>> There was a lot af TRACE_ENTRY() / TRACE_EXIT() noise.
> >>> We should have proper tools for that by now (I hope).
> >> Sorry, I don't see such tools with, for instance, a possibility to be
> >> compiled out in non-debug builds.
> >>
> >> From one side, I can agree, those TRACE_ENTRY()/TRACE_EXIT() statements
> >> *may* look like a noise (I personally don't notice them), but, from
> >> other side, in past times they have proved how usable they are. If an
> >> SCST user has a problem, I simply ask him to make a debug build, then
> >> enable entry_exit and some other logging levels, then reproduce the
> >> problem and send me the logs. Then in most cases I can see what's wrong
> >> and provide a fix without additional actions and questions.
> >
> > I had ftrace in mind but it has not hit mainline yet.
I'm assuming you are talking about previous kernels. ftrace is in
2.6.27, and dynamic ftrace is back in 2.6.28-rcX
> > But will do before this patchset does.
>
> Unfortunately, ftrace at the moment can't fully replace
> TRACE_ENTRY()/TRACE_EXIT() statements, because it lacks:
>
> 1. Doing trace in only some modules or source files. Noise from tracing
> of the whole kernel is not needed if we investigate problem in only one
> module (SCST in this case)
With dynamic ftrace you can pick and choose the what functions you want
to trace. 2.6.28 will not have the exit tracing, but 2.6.29 will (see
linux-tip for latest ftrace)
Here, for current 2.6.28, I can do:
# awk '$2=="t" && $4=="[e1000]" { print $3; }' \
/proc/kallsyms > /debug/tracing/set_ftrace_filter
and now only the e1000 functions will be traced.
# echo function > /debug/tracing/current_tracer
[root@bxrhel51 ~]# cat /debug/tracing/trace |head
# tracer: function
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
sshd-3411 [000] 567.748887: e1000_xmit_frame <-dev_hard_start_xmit
sshd-3411 [000] 567.748890: e1000_maybe_stop_tx <-e1000_xmit_frame
sshd-3411 [000] 567.748893: e1000_maybe_stop_tx <-e1000_xmit_frame
<idle>-0 [000] 567.748943: e1000_intr <-handle_IRQ_event
<idle>-0 [000] 567.748948: e1000_clean <-net_rx_action
<idle>-0 [000] 567.748949: e1000_unmap_and_free_tx_resource <-e1000_clean
I can pick any function I want to trace. The list of available functions
to trace is in /debug/tracing/available_filter_functions.
> .
>
> 2. Ability to inject own messages in the output stream. It's needed to
> have synchronized stream of output messages about both functions tracing
> and what those functions doing. For instance:
>
> entry in func1()
> adding cmd C to list L
> exit from func1()
Is this from within the driver?
By adding this simply patch:
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 872799b..b198dac 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -29,6 +29,8 @@
#include "e1000.h"
#include <net/ip6_checksum.h>
+#include <linux/ftrace.h>
+
char e1000_driver_name[] = "e1000";
static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver";
#define DRV_VERSION "7.3.20-k3-NAPI"
@@ -3810,6 +3812,8 @@ static int e1000_clean(struct napi_struct *napi, int budget)
if (tx_cleaned)
work_done = budget;
+ ftrace_printk("adapter is %p work_done: %d\n", adapter, work_done);
+
/* If budget not fully consumed, exit the polling mode */
if (work_done < budget) {
if (likely(adapter->itr_setting & 3))
I could do:
# awk '$2=="t" && $4=="[e1000]" { print $3; }' \
/proc/kallsyms > /debug/tracing/set_ftrace_filter
# echo 0 > /debug/tracing/tracing_enabled
# echo ftrace_printk > /debug/tracing/iter_ctrl
# echo function > /debug/tracing/current_tracer
# echo 1 > /debug/tracing/tracing_enabled
[wait]
# echo 0 > /debug/tracing/tracing_enabled
# cat /debug/tracing/trace | head
# tracer: function
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
bash-3393 [000] 469.809904: e1000_clean_rx_irq <-e1000_clean
bash-3393 [000] 469.809939: e1000_alloc_rx_buffers <-e1000_clean_rx_irq
bash-3393 [000] 469.809940: e1000_check_64k_bound <-e1000_alloc_rx_buffers
bash-3393 [000] 469.809945: e1000_clean: adapter is ffff88003d13a900 work_done: 64
bash-3393 [000] 469.809946: e1000_clean <-net_rx_action
Note in 2.6.29 (and current tip): s/iter_ctrl/tracing_options/
>
> Otherwise there would be 2 traces:
>
> entry in func1()
> exit from func1()
>
> and
>
> adding cmd C to list L
>
> which almost impossible to merge to analyze.
Perhaps you need to do it from userspace?
# echo 1 > /debug/tracing/tracing_enabled ; \
echo hi > /debug/tracing/trace_marker ; \
sleep 1 ; \
echo bye > /debug/tracing/trace_marker ; \
echo 0 > /debug/tracing/tracing_enabled
# cat /debug/tracing/trace
# tracer: function
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
bash-3393 [003] 1068.636510: 0: hi
<idle>-0 [002] 1069.039891: e1000_intr <-handle_IRQ_event
<idle>-0 [002] 1069.039898: e1000_clean <-net_rx_action
<idle>-0 [002] 1069.039901: e1000_clean_rx_irq <-e1000_clean
<idle>-0 [002] 1069.039950: e1000_alloc_rx_buffers <-e1000_clean_rx_irq
<idle>-0 [002] 1069.039951: e1000_check_64k_bound <-e1000_alloc_rx_buffers
<idle>-0 [002] 1069.039952: e1000_update_itr <-e1000_clean
[...]
<idle>-0 [002] 1069.487184: e1000_intr <-handle_IRQ_event
<idle>-0 [002] 1069.487190: e1000_clean <-net_rx_action
<idle>-0 [002] 1069.487191: e1000_clean_rx_irq <-e1000_clean
<idle>-0 [002] 1069.487192: e1000_update_itr <-e1000_clean
<idle>-0 [002] 1069.487192: e1000_update_itr <-e1000_clean
bash-3393 [003] 1069.638193: 0: bye
>
> (Steven Rostedt added on CC)
Thanks.
-- Steve
2008/12/13 Arnaldo Carvalho de Melo <[email protected]>:
> Steven, is it possible to trigger tracing for any function when a
> function in the list set up by the user is running and then stop tracing
> when it exits?
>
> - Arnaldo
>
Hi Arnaldo,
Yes it's possible with ftrace, by using the function graph tracer ( an
ftrace extension that adds the tracing on return).
To enable it just do:
echo function_graph > /debugfs/tracing/current_tracer
And to trace by choosing a particular function as the root of the call
stack, just do the things
explained on this patch:
http://git.kernel.org/?p=linux/kernel/git/mingo/linux-2.6-sched-devel.git;a=commit;h=ea4e2bc4d9f7370e57a343ccb5e7c0ad3222ec3c
And the tracing will finish after this function returns :-)
2008/12/13 Fr?d?ric Weisbecker <[email protected]>:
> 2008/12/13 Arnaldo Carvalho de Melo <[email protected]>:
>> Steven, is it possible to trigger tracing for any function when a
>> function in the list set up by the user is running and then stop tracing
>> when it exits?
>>
>> - Arnaldo
>>
>
> Hi Arnaldo,
>
> Yes it's possible with ftrace, by using the function graph tracer ( an
> ftrace extension that adds the tracing on return).
>
> To enable it just do:
> echo function_graph > /debugfs/tracing/current_tracer
>
> And to trace by choosing a particular function as the root of the call
> stack, just do the things
> explained on this patch:
> http://git.kernel.org/?p=linux/kernel/git/mingo/linux-2.6-sched-devel.git;a=commit;h=ea4e2bc4d9f7370e57a343ccb5e7c0ad3222ec3c
>
> And the tracing will finish after this function returns :-)
>
I forgot to say that add-on is not yet included on 2.6.28
This will be available for 2.6.29.
On Fri, 2008-12-12 at 22:26 +0300, Vladislav Bolkhovitin wrote:
> Nicholas A. Bellinger wrote:
> > On Wed, 2008-12-10 at 22:01 +0300, Vladislav Bolkhovitin wrote:
> >> This patch contains iSCSI-SCST target driver. This driver is a heavily
> >> modified forked with all respects IET
> >> (http://iscsitarget.sourceforge.net). Modifications were aimed to make a
> >> clearer, more reviewable and maintainable code as well as to fix many
> >> problems and make many improvements. See
> >> http://scst.sourceforge.net/target_iscsi.html for more details.
> >>
> >> It has split user/kernel space architecture, where all management,
> >> sessions creation, parameters negotiation, etc. made in user space and
> >> data are transferred in the kernel space. Such architecture for iSCSI
> >> processing was many times acknowledged as the right one. Particularly,
> >> in-kernel iSCSI initiator (open-iscsi) has such architecture.
> >>
> >
> > Just as with the Open/iSCSI Initiator, IMHO I believe the split
> > architecture design is difficult both to improve, debug and maintain,
> > and provides *ZERO* additional benefit in the context of traditional
> > iSCSI target mode for doing login and connection/session setup in
> > userspace.
> >
> > Also, I appericate that you spent alot of time porting over IET code to
> > your engine, but during our previous discussion you did not seem
> > terribly interested in validation against core-iscsi-dv
> > (http://linux-iscsi.org/index.php/Core-iscsi-dv) to test RFC-3720
> > interopt and stability. Because the Core-iSCSI Initiator supports every
> > possible parameter combination up to ErrorRecoveryLevel=0 defined in
> > RFC-3720, the Core-iSCSI-Dv tests can run badblocks (or any too) to
> > check data integrity for *EVERY* possible traditional iSCSI key
> > combination and functionality for your iSCSI-SCST work, and any type of
> > serious iSCSI-SCST production deployments.
>
> The fact that nobody so far cared to do all those complicated and time
> consuming rather academic tests doesn't mean that iSCSI-SCST won't pass
> them. IET/iSCSI-SCST have been used for a long time in very different
> setups, including xBSD and Solaris initiators on non-x86 architectures,
> without any problems.
>
Heh, nice try.
Considering that core-iscsi-dv is used for validating the production
systems used for Linux-iSCSI.org services, I would hardly consider self
hosted usage of LIO-Target (eg: actually using code we write for public
project services) an "academic" endevour. Last time I checked you where
iSCSI-SCST was not running self-hosted production for your own project,
so I hardly think you are in a place to judge which RFC-3720 domain
validation tests are of worth or not.
Anyways, you having to guess about if your iSCSI target code will pass a
RFC-3720 compliance hardly makes it mainline material. Considering
that iSCSI-SCST has never been independently reviewed for RFC-3720
compliance (as LIO-Target has) and never has had an iSCSI Initiator
doing non-selective iSCSI parameter domain validation (as LIO-Target
has), I find your claim of iSCSI-SCST RFC completeness and maturity a
dubious proposition at best. To this day I have not seen a single
iSCSI-SCST production setup anywhere, nor have I heard anyone
considering moving into it into any serious production environments.
Certainly iSCSI-SCST is lacking in the traditional iSCSI feature
department: no MC/S or ErrorRecoveryLevel=2 , features from RFC-3720
that apply to iSER/IB and iSER/DDP, and will be included in LIO-iSER
code in 2009. Considering that you have not implemented your own iSCSI
Initiator or contributed any code to the Open/iSCSI Initiator project,
seeing how these RFC-3720 production ready features would arrive in
iSCSI-SCST code any time soon is a strech of the imagination.
That said, I do understand that you spent a number of months adapting
the code from the IET project after your disagreements in that community
caused you to fork their code as iSCSI-SCST. I think having multiple
open source iSCSI targets is a GOOD thing, and I am not going to try to
convience you that you should stop working on iSCSI-SCST. However,
please understand that I have been doing nothing but iSCSI since 2001,
and that the LIO-Target base code has been running in customer
production since 2004. Not to mention a team of senior folks working
full time on the code from 2005-2007, including a new team of senior
devels who will be working on it full time in 2009 as the upstream
process continues.
So, again, I really apperciate your work both with SCST Core and
iSCSI-SCST, but the song and dance of iSCSI-SCST of being comparable in
maturity, feature completeness, or production track record to LIO-Target
is just that, a song and dance. Why..? Aside from the years of
commerical effort, resources and validation put into the LIO-Target
codebase, I do not have to guess about how RFC-3720 compliance will turn
out for LIO-Target. I wrote a iSCSI Initiator and domain validation
tool in parallel with LIO-Target to actually prove it to myself years
ago. So far, you have been unwilling and / or unable to prove on even
the most basic RFC-3720 functionality with your work.
Regards,
--nab
> Vlad
>
>
On Sat, Dec 13, 2008 at 11:03 AM, Nicholas A. Bellinger
<[email protected]> wrote:
> To this day I have not seen a single iSCSI-SCST production setup anywhere, nor
> have I heard anyone considering moving into it into any serious production
> environments.
Many people are using iSCSI-SCST in a production setup. Which planet
do you live on ?
Bart.
On Sat, 2008-12-13 at 11:11 +0100, Bart Van Assche wrote:
> On Sat, Dec 13, 2008 at 11:03 AM, Nicholas A. Bellinger
> <[email protected]> wrote:
> > To this day I have not seen a single iSCSI-SCST production setup anywhere, nor
> > have I heard anyone considering moving into it into any serious production
> > environments.
>
> Many people are using iSCSI-SCST in a production setup. Which planet
> do you live on ?
>
You can continue to gloss over the real issues at hand here, but your
generic handwaving certainly will not get you past RFC-3720 domain
validation.
Best Regards,
--nab
> Bart.
>
On Sat, Dec 13, 2008 at 11:16 AM, Nicholas A. Bellinger
<[email protected]> wrote:
> You can continue to gloss over the real issues at hand here, but your
> generic handwaving certainly will not get you past RFC-3720 domain
> validation.
Every initiator ever tried against iSCSI-SCST works fine with it, and
that's what counts. So there is no "real issue".
Bart.
Steven Rostedt wrote:
> (Added Ingo and Frederic)
>
> On Fri, 2008-12-12 at 22:24 +0300, Vladislav Bolkhovitin wrote:
>> Sam Ravnborg wrote:
>>>>>> drivers/scst/scst_lib.c | 3689
>>>>>> +++++++++++++++++++++++++++++++++++++
>>>>>> drivers/scst/scst_main.c | 1919 +++++++++++++++++++
>>>>>> drivers/scst/scst_module.c | 69
>>>>>> drivers/scst/scst_priv.h | 513 +++++
>>>>>> drivers/scst/scst_targ.c | 5458
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 8 files changed, 12435 insertions(+)
>>>>> There was a lot af TRACE_ENTRY() / TRACE_EXIT() noise.
>>>>> We should have proper tools for that by now (I hope).
>>>> Sorry, I don't see such tools with, for instance, a possibility to be
>>>> compiled out in non-debug builds.
>>>>
>>>> From one side, I can agree, those TRACE_ENTRY()/TRACE_EXIT() statements
>>>> *may* look like a noise (I personally don't notice them), but, from
>>>> other side, in past times they have proved how usable they are. If an
>>>> SCST user has a problem, I simply ask him to make a debug build, then
>>>> enable entry_exit and some other logging levels, then reproduce the
>>>> problem and send me the logs. Then in most cases I can see what's wrong
>>>> and provide a fix without additional actions and questions.
>>> I had ftrace in mind but it has not hit mainline yet.
>
> I'm assuming you are talking about previous kernels. ftrace is in
> 2.6.27, and dynamic ftrace is back in 2.6.28-rcX
>
>>> But will do before this patchset does.
>> Unfortunately, ftrace at the moment can't fully replace
>> TRACE_ENTRY()/TRACE_EXIT() statements, because it lacks:
>>
>> 1. Doing trace in only some modules or source files. Noise from tracing
>> of the whole kernel is not needed if we investigate problem in only one
>> module (SCST in this case)
>
> With dynamic ftrace you can pick and choose the what functions you want
> to trace. 2.6.28 will not have the exit tracing, but 2.6.29 will (see
> linux-tip for latest ftrace)
>
> Here, for current 2.6.28, I can do:
>
> # awk '$2=="t" && $4=="[e1000]" { print $3; }' \
> /proc/kallsyms > /debug/tracing/set_ftrace_filter
>
> and now only the e1000 functions will be traced.
>
> # echo function > /debug/tracing/current_tracer
>
> [root@bxrhel51 ~]# cat /debug/tracing/trace |head
> # tracer: function
> #
> # TASK-PID CPU# TIMESTAMP FUNCTION
> # | | | | |
> sshd-3411 [000] 567.748887: e1000_xmit_frame <-dev_hard_start_xmit
> sshd-3411 [000] 567.748890: e1000_maybe_stop_tx <-e1000_xmit_frame
> sshd-3411 [000] 567.748893: e1000_maybe_stop_tx <-e1000_xmit_frame
> <idle>-0 [000] 567.748943: e1000_intr <-handle_IRQ_event
> <idle>-0 [000] 567.748948: e1000_clean <-net_rx_action
> <idle>-0 [000] 567.748949: e1000_unmap_and_free_tx_resource <-e1000_clean
>
>
> I can pick any function I want to trace. The list of available functions
> to trace is in /debug/tracing/available_filter_functions.
>
>
>> .
>>
>> 2. Ability to inject own messages in the output stream. It's needed to
>> have synchronized stream of output messages about both functions tracing
>> and what those functions doing. For instance:
>>
>> entry in func1()
>> adding cmd C to list L
>> exit from func1()
>
> Is this from within the driver?
>
> By adding this simply patch:
>
> diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
> index 872799b..b198dac 100644
> --- a/drivers/net/e1000/e1000_main.c
> +++ b/drivers/net/e1000/e1000_main.c
> @@ -29,6 +29,8 @@
> #include "e1000.h"
> #include <net/ip6_checksum.h>
>
> +#include <linux/ftrace.h>
> +
> char e1000_driver_name[] = "e1000";
> static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver";
> #define DRV_VERSION "7.3.20-k3-NAPI"
> @@ -3810,6 +3812,8 @@ static int e1000_clean(struct napi_struct *napi, int budget)
> if (tx_cleaned)
> work_done = budget;
>
> + ftrace_printk("adapter is %p work_done: %d\n", adapter, work_done);
> +
> /* If budget not fully consumed, exit the polling mode */
> if (work_done < budget) {
> if (likely(adapter->itr_setting & 3))
>
>
> I could do:
>
> # awk '$2=="t" && $4=="[e1000]" { print $3; }' \
> /proc/kallsyms > /debug/tracing/set_ftrace_filter
> # echo 0 > /debug/tracing/tracing_enabled
> # echo ftrace_printk > /debug/tracing/iter_ctrl
> # echo function > /debug/tracing/current_tracer
> # echo 1 > /debug/tracing/tracing_enabled
> [wait]
> # echo 0 > /debug/tracing/tracing_enabled
> # cat /debug/tracing/trace | head
> # tracer: function
> #
> # TASK-PID CPU# TIMESTAMP FUNCTION
> # | | | | |
> bash-3393 [000] 469.809904: e1000_clean_rx_irq <-e1000_clean
> bash-3393 [000] 469.809939: e1000_alloc_rx_buffers <-e1000_clean_rx_irq
> bash-3393 [000] 469.809940: e1000_check_64k_bound <-e1000_alloc_rx_buffers
> bash-3393 [000] 469.809945: e1000_clean: adapter is ffff88003d13a900 work_done: 64
> bash-3393 [000] 469.809946: e1000_clean <-net_rx_action
>
>
>
> Note in 2.6.29 (and current tip): s/iter_ctrl/tracing_options/
>
>
>
>> Otherwise there would be 2 traces:
>>
>> entry in func1()
>> exit from func1()
>>
>> and
>>
>> adding cmd C to list L
>>
>> which almost impossible to merge to analyze.
>
> Perhaps you need to do it from userspace?
>
> # echo 1 > /debug/tracing/tracing_enabled ; \
> echo hi > /debug/tracing/trace_marker ; \
> sleep 1 ; \
> echo bye > /debug/tracing/trace_marker ; \
> echo 0 > /debug/tracing/tracing_enabled
> # cat /debug/tracing/trace
> # tracer: function
> #
> # TASK-PID CPU# TIMESTAMP FUNCTION
> # | | | | |
> bash-3393 [003] 1068.636510: 0: hi
> <idle>-0 [002] 1069.039891: e1000_intr <-handle_IRQ_event
> <idle>-0 [002] 1069.039898: e1000_clean <-net_rx_action
> <idle>-0 [002] 1069.039901: e1000_clean_rx_irq <-e1000_clean
> <idle>-0 [002] 1069.039950: e1000_alloc_rx_buffers <-e1000_clean_rx_irq
> <idle>-0 [002] 1069.039951: e1000_check_64k_bound <-e1000_alloc_rx_buffers
> <idle>-0 [002] 1069.039952: e1000_update_itr <-e1000_clean
> [...]
> <idle>-0 [002] 1069.487184: e1000_intr <-handle_IRQ_event
> <idle>-0 [002] 1069.487190: e1000_clean <-net_rx_action
> <idle>-0 [002] 1069.487191: e1000_clean_rx_irq <-e1000_clean
> <idle>-0 [002] 1069.487192: e1000_update_itr <-e1000_clean
> <idle>-0 [002] 1069.487192: e1000_update_itr <-e1000_clean
> bash-3393 [003] 1069.638193: 0: bye
>
All the above functionality is almost what we need. The only thing left,
which I forgot to mention, is possibility to log also functions return
value on exit. This is what TRACE_EXIT_RES() in SCST does. Is it
possible to add those?
Also (maybe I simply miss something) looks like ftrace doesn't trace
exit from functions, only entrance to them. Is it true? Is it possibly
to log exit from functions as well?
And one more question. Is it possible to redirect ftrace tracing to
serial console or any other console (netconsole?)? It can be helpful to
investigate hard lockups in IRQ or with IRQs disabled.
Thanks!
Vlad
Nicholas A. Bellinger wrote:
> On Fri, 2008-12-12 at 22:26 +0300, Vladislav Bolkhovitin wrote:
>> Nicholas A. Bellinger wrote:
>>> On Wed, 2008-12-10 at 22:01 +0300, Vladislav Bolkhovitin wrote:
>>>> This patch contains iSCSI-SCST target driver. This driver is a heavily
>>>> modified forked with all respects IET
>>>> (http://iscsitarget.sourceforge.net). Modifications were aimed to make a
>>>> clearer, more reviewable and maintainable code as well as to fix many
>>>> problems and make many improvements. See
>>>> http://scst.sourceforge.net/target_iscsi.html for more details.
>>>>
>>>> It has split user/kernel space architecture, where all management,
>>>> sessions creation, parameters negotiation, etc. made in user space and
>>>> data are transferred in the kernel space. Such architecture for iSCSI
>>>> processing was many times acknowledged as the right one. Particularly,
>>>> in-kernel iSCSI initiator (open-iscsi) has such architecture.
>>>>
>>> Just as with the Open/iSCSI Initiator, IMHO I believe the split
>>> architecture design is difficult both to improve, debug and maintain,
>>> and provides *ZERO* additional benefit in the context of traditional
>>> iSCSI target mode for doing login and connection/session setup in
>>> userspace.
>>>
>>> Also, I appericate that you spent alot of time porting over IET code to
>>> your engine, but during our previous discussion you did not seem
>>> terribly interested in validation against core-iscsi-dv
>>> (http://linux-iscsi.org/index.php/Core-iscsi-dv) to test RFC-3720
>>> interopt and stability. Because the Core-iSCSI Initiator supports every
>>> possible parameter combination up to ErrorRecoveryLevel=0 defined in
>>> RFC-3720, the Core-iSCSI-Dv tests can run badblocks (or any too) to
>>> check data integrity for *EVERY* possible traditional iSCSI key
>>> combination and functionality for your iSCSI-SCST work, and any type of
>>> serious iSCSI-SCST production deployments.
>> The fact that nobody so far cared to do all those complicated and time
>> consuming rather academic tests doesn't mean that iSCSI-SCST won't pass
>> them. IET/iSCSI-SCST have been used for a long time in very different
>> setups, including xBSD and Solaris initiators on non-x86 architectures,
>> without any problems.
>>
>
> Heh, nice try.
>
> Considering that core-iscsi-dv is used for validating the production
> systems used for Linux-iSCSI.org services, I would hardly consider self
> hosted usage of LIO-Target (eg: actually using code we write for public
> project services) an "academic" endevour. Last time I checked you where
> iSCSI-SCST was not running self-hosted production for your own project,
> so I hardly think you are in a place to judge which RFC-3720 domain
> validation tests are of worth or not.
>
> Anyways, you having to guess about if your iSCSI target code will pass a
> RFC-3720 compliance hardly makes it mainline material.
Simply answer how many things in kernel have gone through such
validation things and did absence of going through them change anything
for them from the mainline acceptance POV? Has the mainline open-iscsi
even ran through your tests? And so?
I have nothing against any artificial tests, but one shouldn't
overestimate their value.
BTW, removing people from CC usually considered as a very impolite thing.
Vlad
Nicholas A. Bellinger wrote:
> On Sat, 2008-12-13 at 11:11 +0100, Bart Van Assche wrote:
>> On Sat, Dec 13, 2008 at 11:03 AM, Nicholas A. Bellinger
>> <[email protected]> wrote:
>>> To this day I have not seen a single iSCSI-SCST production setup anywhere, nor
>>> have I heard anyone considering moving into it into any serious production
>>> environments.
>> Many people are using iSCSI-SCST in a production setup. Which planet
>> do you live on ?
>>
>
> You can continue to gloss over the real issues at hand here, but your
> generic handwaving certainly will not get you past RFC-3720 domain
> validation.
Please stop flooding. If you believe that iSCSI-SCST is so bad so it
won't pass some test, prove it! Run your test and show us your the
failures. Otherwise your words worth nothing.
> Best Regards,
>
> --nab
>
>> Bart.
>>
>
>
2008/12/13 Vladislav Bolkhovitin <[email protected]>:
> Also (maybe I simply miss something) looks like ftrace doesn't trace exit
> from functions, only entrance to them. Is it true? Is it possibly to log
> exit from functions as well?
That's true with 2.6.28, the function tracer traces on function entries only.
But there is an add-on on ftrace which let one to trace on entry and
on return, the function
graph tracer. This tracer uses this facility to output a graph of
function calls and measure
the time elapsed during each function call.
You can also register two custom handlers to do some things you need
on entry and on return.
> All the above functionality is almost what we need. The only thing left,
> which I forgot to mention, is possibility to log also functions return value
> on exit. This is what TRACE_EXIT_RES() in SCST does. Is it possible to add
> those?
I want to add that on the function graph tracer. That can be done
pretty easily. The only
problem comes with the type of the return value. Would this tracer be
supposed to always
return a 64 bits value regardless of the real typ of the value? There
would be some pointless bytes
on most return values. I don't know how to proceed for this problem.
> And one more question. Is it possible to redirect ftrace tracing to serial
> console or any other console (netconsole?)? It can be helpful to investigate
> hard lockups in IRQ or with IRQs disabled.
>
> Thanks!
> Vlad
>
I would find it useful too. I thought about something like using
early_printk or something like
that...I don't know. That would be good to redirect the output to the
tty device of the user choice.
James Bottomley wrote:
> On Fri, 2008-12-12 at 22:25 +0300, Vladislav Bolkhovitin wrote:
>>> One thing that leaps immediately to mind is that you could isolate this
>>> to the net layer by putting it in skb_frag_struct. However, such a move
>>> would require a proper API for this in net ...
>> To have net_priv analog in skb was the first idea I was tried. But I
>> quickly gave up, because it would required that all the pages in each
>> skb_frag_struct be from the same originator, i.e. with the same
>> net_priv. It is unpractical to change all the operations with skb's to
>> forbid merging them, if they have different net_priv. There are too many
>> such places in very not obvious code pieces.
>
> Actually, I said carry in skb_frag_struct not skb ... that allows for
> merging of skbs with different page sources. The API changes would have
> to allow setting at this level.
Possibly, you are right, but the amount of changes would be very big,
while the net_priv patch I did is just 309 lines long including all the
descriptions and comments. And it's really simple and straightforward.
>>> right now it looks like
>>> you're using the struct page addition to carry this information from
>>> SCSI to net, which is a bit of a layering violation.
>> I don't think there is any layering violation here. Just lower layer
>> notifies upper layer that transmission of a page has finished. It's done
>> a bit not straightforward, but still basically the same as, for
>> instance, on_free_cmd() callbacks which SCST core uses to notify target
>> drivers and dev handlers that the corresponding command is about to be
>> freed, so they can free associated with it data as well.
>
> The way you transmit the information you want notification is done by a
> private tag in struct page, so you're carrying the information on an
> object that belongs to neither layer ... that's the violation. It's
> essentially an extension of the net API that goes via the mm layer.
I understand your fears, but I think there can be another view on this.
There are 2 layers: TCP and iSCSI. ISCSI asks TCP to send data and TCP
performs that. ISCSI communicates with TCP using sendpage() function, it
asks TCP to send pages. So, the unit of communications is page. When TCP
finished transmitting a page, it notifies iSCSI that the page was
transmitted. Now iSCSI needs to find out its own command, associated
with that page. It does that using page->net_priv, which keeps pointer
to the associated iSCSI command. I don't think there is any layer
violation here, because iSCSI and TCP communicate using pages, not any
other objects. If they communicated using, e.g., skb, no doubts, it
would be a layers violation.
But, since iSCSI and TCP communicate using pages, the situation is the
same as between SCST core and a target driver in the on_free_cmd()
example I used before. When SCST core notifies the target driver using
on_free_cmd() callback that a command finished and is about to be freed,
the target driver also needs to find out its own associated with the
command data and it does that using tgt_priv pointer in struct scst_cmd.
Similarly, for instance, struct scsi_cmnd has host_scribble pointer for
low level drivers.
Thanks,
Vlad
If you don't believe that increasing struct page size for a fringe
feature is a no-go submit the patch to the VM people and wait..
On Wed, Dec 10, 2008 at 10:45 PM, Evgeniy Polyakov <[email protected]> wrote:
> Another approach is to increase skb's shared data (at the end of the
> skb->data), and this approach was not frowned upon too much either, but
> it requires to mess with skb->destructor, which may not be appropriate
> in some cases. If iscsi does not use sockets (it does iirc), things are
> much simpler.
Hello Evgeniy,
Any idea whether it would be acceptable to extend "struct sock" with
one or two pointers to callback functions ? These callback functions
could be called from skb_release_data() etc. through the "struct sock*
sk" member of "struct sk_buff".
Bart.
Hi Bart.
On Tue, Dec 16, 2008 at 05:00:07PM +0100, Bart Van Assche ([email protected]) wrote:
> Any idea whether it would be acceptable to extend "struct sock" with
> one or two pointers to callback functions ? These callback functions
> could be called from skb_release_data() etc. through the "struct sock*
> sk" member of "struct sk_buff".
That's unlikely, since again this will be unused by the all but iscsi
users. But you can be tricky and replace skb destructor with your own
pointer and call 'old' destructor yourself. Its main goal is to adjust
socket memory statistics and since all iscsi sockets are under your full
control, you can play with it. Although this sounds like a hack, with
proper implementation it is not that bad idea. As additional pointer you
can use sk_user_data which is used by rpc socket calls only iirc.
--
Evgeniy Polyakov
Christoph Hellwig wrote:
> If you don't believe that increasing struct page size for a fringe
> feature is a no-go submit the patch to the VM people and wait..
I guessed, it can be no-go, this is why I wrote that this feature isn't
required for iSCSI-SCST functionality. But, since the implementation is
*so* simple and doesn't do any layering violation, I have a hope that
once it disabled by default it will be harmless and, hence, could be
accepted. Only few people need this feature. Otherwise there will be an
alternative for them between enable that feature, then recompile, vs
patch the kernel, enable that feature, then recompile.
When I was developing it my main goal was to do it as simple as
possible. I believe, I succeeded in it. If it's rejected, it will simply
live out of tree until me or somebody else finds time to reimplement it
in an acceptable, although at least 10 times more complicated manner.
Thanks, I'll follow your advise.
Vlad
Vladislav Bolkhovitin <[email protected]> writes:
>
> - Although usage of struct page to keep network related pointer might
> look as a layering violation, it isn't. I wrote in
> http://lkml.org/lkml/2008/12/15/190 why.
Sorry but extending struct page for this is really a bad idea because
of the extreme memory overhead even when it's not used (which is a
problem on distribution kernels) Find some other way to store this
information. Even for patches with more general value it was not
acceptable.
-Andi
--
[email protected]
David M. Lloyd, on 12/18/2008 09:43 PM wrote:
> On 12/18/2008 12:35 PM, Vladislav Bolkhovitin wrote:
>> An iSCSI target driver iSCSI-SCST was a part of the patchset
>> (http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to
>> have TCP zero-copy transmit of user space data was implemented. Patch,
>> implementing this optimization was also sent in the patchset, see
>> http://lkml.org/lkml/2008/12/10/296.
>
> I'm probably ignorant of about 90% of the context here, but isn't this the
> sort of problem that was supposed to have been solved by vmsplice(2)?
No, vmsplice can't help here. ISCSI-SCST is a kernel space driver. But,
even if it was a user space driver, vmsplice wouldn't change anything
much. It doesn't have a possibility for a user to know, when
transmission of the data finished. So, it is intended to be used as:
vmsplice() buffer -> munmap() the buffer -> mmap() new buffer ->
vmsplice() it. But on the mmap() stage kernel has to zero all the newly
mapped pages and zeroing memory isn't much faster, than copying it.
Hence, there would be no considerable performance increase.
Thanks,
Vlad
Andi Kleen, on 12/19/2008 02:27 PM wrote:
> Vladislav Bolkhovitin <[email protected]> writes:
>> - Although usage of struct page to keep network related pointer might
>> look as a layering violation, it isn't. I wrote in
>> http://lkml.org/lkml/2008/12/15/190 why.
>
> Sorry but extending struct page for this is really a bad idea because
> of the extreme memory overhead even when it's not used (which is a
> problem on distribution kernels) Find some other way to store this
> information. Even for patches with more general value it was not
> acceptable.
Sure, this is why I propose to disable that option by default in
distribution kernels, so it would produce no harm. ISCSI-SCST can work
in this configuration quite well too. People who need both iSCSI target
*and* fast working user space device handlers would simply enable that
option and rebuild the kernel. Rejecting this patch provides much worse
alternative: those people would also have to *patch* the kernel at
first, only then enable that option, then rebuild the kernel. (I'm
repeating it to make sure you didn't miss this my point; it was in the
part of my original message, which you cut out.)
Thanks,
Vlad
> Sure, this is why I propose to disable that option by default in
> distribution kernels, so it would produce no harm.
That would make the option useless for most users. You might as well
not bother merging then.
> first, only then enable that option, then rebuild the kernel. (I'm
> repeating it to make sure you didn't miss this my point; it was in the
> part of my original message, which you cut out.)
That was such a ridiculous suggestion, I didn't take it seriously.
Also it should be really not rocket science to use a separate
table for this.
-Andi
--
[email protected]
Andi Kleen, on 12/19/2008 09:00 PM wrote:
>> Sure, this is why I propose to disable that option by default in
>> distribution kernels, so it would produce no harm.
>
> That would make the option useless for most users. You might as well
> not bother merging then.
I believe 99.(9)% of users prefer don't patch kernel, if possible.
>> first, only then enable that option, then rebuild the kernel. (I'm
>> repeating it to make sure you didn't miss this my point; it was in the
>> part of my original message, which you cut out.)
>
> That was such a ridiculous suggestion, I didn't take it seriously.
>
> Also it should be really not rocket science to use a separate
> table for this.
Sorry, what do you mean? If usage of something like a hash table to map
pages to the corresponding iSCSI commands, this approach was evaluated
and rejected, because it wouldn't provide much performance increase,
which would worth the effort. See details in the end of the patch
description in http://lkml.org/lkml/2008/12/10/296
Thanks,
Vlad
On Fri, Dec 19 2008, Vladislav Bolkhovitin wrote:
> David M. Lloyd, on 12/18/2008 09:43 PM wrote:
> >On 12/18/2008 12:35 PM, Vladislav Bolkhovitin wrote:
> >>An iSCSI target driver iSCSI-SCST was a part of the patchset
> >>(http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to
> >>have TCP zero-copy transmit of user space data was implemented. Patch,
> >>implementing this optimization was also sent in the patchset, see
> >>http://lkml.org/lkml/2008/12/10/296.
> >
> >I'm probably ignorant of about 90% of the context here, but isn't this the
> >sort of problem that was supposed to have been solved by vmsplice(2)?
>
> No, vmsplice can't help here. ISCSI-SCST is a kernel space driver. But,
> even if it was a user space driver, vmsplice wouldn't change anything
> much. It doesn't have a possibility for a user to know, when
> transmission of the data finished. So, it is intended to be used as:
> vmsplice() buffer -> munmap() the buffer -> mmap() new buffer ->
> vmsplice() it. But on the mmap() stage kernel has to zero all the newly
> mapped pages and zeroing memory isn't much faster, than copying it.
> Hence, there would be no considerable performance increase.
vmsplice() isn't the right choice, but splice() very well could be. You
could easily use splice internally as well. The vmsplice() part sort-of
applies in the sense that you want to fill pages into a pipe, which is
essentially what vmsplice() does. You'd need some helper to do that. And
the ack-on-xmit-done bits is something that splice-to-socket needs
anyway, so I think it'd be quite a suitable choice for this.
--
Jens Axboe
Jens Axboe, on 12/19/2008 10:07 PM wrote:
> On Fri, Dec 19 2008, Vladislav Bolkhovitin wrote:
>> David M. Lloyd, on 12/18/2008 09:43 PM wrote:
>>> On 12/18/2008 12:35 PM, Vladislav Bolkhovitin wrote:
>>>> An iSCSI target driver iSCSI-SCST was a part of the patchset
>>>> (http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to
>>>> have TCP zero-copy transmit of user space data was implemented. Patch,
>>>> implementing this optimization was also sent in the patchset, see
>>>> http://lkml.org/lkml/2008/12/10/296.
>>> I'm probably ignorant of about 90% of the context here, but isn't this the
>>> sort of problem that was supposed to have been solved by vmsplice(2)?
>> No, vmsplice can't help here. ISCSI-SCST is a kernel space driver. But,
>> even if it was a user space driver, vmsplice wouldn't change anything
>> much. It doesn't have a possibility for a user to know, when
>> transmission of the data finished. So, it is intended to be used as:
>> vmsplice() buffer -> munmap() the buffer -> mmap() new buffer ->
>> vmsplice() it. But on the mmap() stage kernel has to zero all the newly
>> mapped pages and zeroing memory isn't much faster, than copying it.
>> Hence, there would be no considerable performance increase.
>
> vmsplice() isn't the right choice, but splice() very well could be. You
> could easily use splice internally as well. The vmsplice() part sort-of
> applies in the sense that you want to fill pages into a pipe, which is
> essentially what vmsplice() does. You'd need some helper to do that.
Sorry, Jens, but splice() works only if there is a file handle on the
another side, so user space doesn't see data buffers. But SCST needs to
serve a wider usage cases, like reading data with decompression from a
virtual tape, where decompression is done in user space. For those only
complete zero-copy network send, which I implemented, can give the best
performance.
> And
> the ack-on-xmit-done bits is something that splice-to-socket needs
> anyway, so I think it'd be quite a suitable choice for this.
So, are you writing that splice() could also benefit from the zero-copy
transmit feature, like I implemented?
Thanks,
Vlad
On Fri, Dec 19 2008, Vladislav Bolkhovitin wrote:
> Jens Axboe, on 12/19/2008 10:07 PM wrote:
> >On Fri, Dec 19 2008, Vladislav Bolkhovitin wrote:
> >>David M. Lloyd, on 12/18/2008 09:43 PM wrote:
> >>>On 12/18/2008 12:35 PM, Vladislav Bolkhovitin wrote:
> >>>>An iSCSI target driver iSCSI-SCST was a part of the patchset
> >>>>(http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to
> >>>>have TCP zero-copy transmit of user space data was implemented. Patch,
> >>>>implementing this optimization was also sent in the patchset, see
> >>>>http://lkml.org/lkml/2008/12/10/296.
> >>>I'm probably ignorant of about 90% of the context here, but isn't this
> >>>the sort of problem that was supposed to have been solved by vmsplice(2)?
> >>No, vmsplice can't help here. ISCSI-SCST is a kernel space driver. But,
> >>even if it was a user space driver, vmsplice wouldn't change anything
> >>much. It doesn't have a possibility for a user to know, when
> >>transmission of the data finished. So, it is intended to be used as:
> >>vmsplice() buffer -> munmap() the buffer -> mmap() new buffer ->
> >>vmsplice() it. But on the mmap() stage kernel has to zero all the newly
> >>mapped pages and zeroing memory isn't much faster, than copying it.
> >>Hence, there would be no considerable performance increase.
> >
> >vmsplice() isn't the right choice, but splice() very well could be. You
> >could easily use splice internally as well. The vmsplice() part sort-of
> >applies in the sense that you want to fill pages into a pipe, which is
> >essentially what vmsplice() does. You'd need some helper to do that.
>
> Sorry, Jens, but splice() works only if there is a file handle on the
> another side, so user space doesn't see data buffers. But SCST needs to
> serve a wider usage cases, like reading data with decompression from a
> virtual tape, where decompression is done in user space. For those only
> complete zero-copy network send, which I implemented, can give the best
> performance.
__splice_from_pipe() takes a pipe, a descriptor and an actor. There's
absolutely ZERO reason you could not reuse most of that for this
implementation. The big bonus here is that getting the put correct from
networking would even make splice() better for everyone. Win for Linux,
win for you since it'll make it MUCH easier for you to get this stuff
in. Looking at your original patch and I almost think it's a flame bait
to induce discussion (nothing wrong with that, that approach works quite
well and has been used before). There's no way in HELL that it'd ever be
a merge candidate. And I suspect you know that, at least I hope you do
or you are farther away from going forward with this than you think.
So don't look at splice() the system call, look at the infrastructure
and check if that could be useful for your case. To me it looks
absolutely like it could, if you goal is just zero-copy transmit. The
only missing piece is dropping the reference and signalling page
consumption at the right point, which is when the data is safe to be
reused. That very bit is missing, but that should be all as far as I can
tell.
> >And
> >the ack-on-xmit-done bits is something that splice-to-socket needs
> >anyway, so I think it'd be quite a suitable choice for this.
>
> So, are you writing that splice() could also benefit from the zero-copy
> transmit feature, like I implemented?
I like how you want to reinvent everything, perhaps you should spend a
little more time looking into various other approaches? splice() already
does zero-copy network transmit, there are no copies going on. Ideally,
you'd have zero copies moving data into your pipe, but migrade/move
isn't quite there yet. But that doesn't apply to your case at all.
What is missing, as I wrote, is the 'release on ack' and not on pipe
buffer release. This is similar to the get_page/put_page stuff you did
in your patch, but don't go claiming that zero-copy transmit is a
Vladislav original - the ->sendpage() does no copies.
--
Jens Axboe
Vladislav Bolkhovitin wrote:
> This patch implements support for zero-copy TCP transmit of user space
> data. It is necessary in iSCSI-SCST target driver for transmitting data
> from user space buffers, supplied by user space backend handlers. In
> this case SCST core needs to know when TCP finished transmitting the
> data, so the corresponding buffers can be reused or freed. Without this
> patch it isn't possible, so iSCSI-SCST has to use data copying to TCP
> send buffers function sock_sendpage(). ISCSI-SCST also works without
> this patch, but that this patch gives a nice performance improvement.
>
In Xen networking it looks like we're going to need to solve a very
similar problem.
When a guest (non-privileged, with no direct hardware access) wants to
send a network packet, it passes it over to the privileged (host)
domain, who then puts it into the network stack for transmission.
The packet gets passed over in a page granted (read "borrowed") from the
guest domain. We can't return it to the guest while its tangled up in
the host's network stack, so we need notification of when the stack has
finished with the page.
The out of tree Xen patches do this by marking a page as having been
allocated by a foreign allocator, and overloads the private memory of
struct page with a destructor function pointer, which put_page calls as
appropriate. We can do this because the page is definitely "owned" by
the Xen subsystem, so most of the fields are available for recycling;
the main problem is that we need to grab another page flag. Your case
sounds more complex because the source page can be mapped by userspace
and/or be in the pagecache, so everything is already claimed.
As with your case, we can simply copy the page data if this mechanism
isn't available. But it would be nice if it were.
> 1. Add net_priv analog in struct sk_buff, not in struct page. But then
> it would be required that all the pages in each skb must be from the
> same originator, i.e. with the same net_priv. It is unpractical to
> change all the operations with skb's to forbid merging them, if they
> have different net_priv. I tried, but quickly gave up. There are too
> many such places in very not obvious code pieces.
>
I think Rusty has a patch to put some kind of put notifier in struct
skb_shared_info, but I'm not sure of the details.
> 2. Have in iSCSI-SCST a hashed list to translate page to iSCSI cmd by a
> simple search function. This approach was rejected, because to copy a
> page a modern CPU needs using MMX about 1500 ticks.
Is that the cold cache timing?
> It was observed,
> that each page can be referenced by TCP during transmit about 20 times
> or even more. So, if each search needs, say, 20 ticks, the overall
> search time will be 20*20*2 (to get() and put()) = 800 ticks. So, this
> approach would considerably worse performance-wise to the chosen
> approach and provide not too much benefit.
>
Wouldn't you only need to do the lookup on the last put?
An external lookup table might well for for us, if the net_put_page()
change is acceptable to the network folk.
J
On Fri, Dec 19, 2008 at 08:27:36PM +0100, Jens Axboe ([email protected]) wrote:
> What is missing, as I wrote, is the 'release on ack' and not on pipe
> buffer release. This is similar to the get_page/put_page stuff you did
> in your patch, but don't go claiming that zero-copy transmit is a
> Vladislav original - the ->sendpage() does no copies.
Just my small rant: it does, when underlying device does not support
hardware tx checksumming and scatter/gather, which is likely exception
than a rule for the modern NICs.
As of having notifications of the received ack (or from user's point of
view notification of the freeing of the buffer), I have following idea
in mind: extend skb ahsred info by copy of the frag array and additional
destructor field, which will be invoked when not only skb but also all
its clones are freed (that's when shared info is freed), so that user
could save some per-page context in fraglist and work with it when data
is not used anymore.
Extending page or skb structure is a no-go for sure, and actually even
shared info is not rubber, but there we can at least add something...
If only destructor field is allowed (similar patch was not rejected),
scsi can save its pages in the tree (indexed by the page pointer) and
traverse it when destructor is invoked selecting pages found in the
freed skb.
--
Evgeniy Polyakov
On Fri, Dec 19, 2008 at 12:21:41PM -0800, Jeremy Fitzhardinge ([email protected]) wrote:
> I think Rusty has a patch to put some kind of put notifier in struct
> skb_shared_info, but I'm not sure of the details.
Yes, he added destructor callback into shared info.
There maybe a problem though, if iscsi will run over xen network, but in
this case xen may copy the data, or iscsi may do that after determining
that underlying network device does not allow shared info destructor
(vis device/route flag for example).
> Wouldn't you only need to do the lookup on the last put?
>
> An external lookup table might well for for us, if the net_put_page()
> change is acceptable to the network folk.
That sounds like the best solution for this problem.
David, will you accept shared info destructor?
And second fraglist? (Put to the new line to easily say no here :)
--
Evgeniy Polyakov
Evgeniy Polyakov wrote:
> On Fri, Dec 19, 2008 at 12:21:41PM -0800, Jeremy Fitzhardinge ([email protected]) wrote:
>
>> I think Rusty has a patch to put some kind of put notifier in struct
>> skb_shared_info, but I'm not sure of the details.
>>
>
> Yes, he added destructor callback into shared info.
>
> There maybe a problem though, if iscsi will run over xen network, but in
> this case xen may copy the data, or iscsi may do that after determining
> that underlying network device does not allow shared info destructor
> (vis device/route flag for example).
>
Xen only needs the callback in the case of traffic originating in
another Xen domain. If iscsi is involved, it will be running in the
other domain, and so all its callbacks and so on will happen there.
There's no conflict.
>> Wouldn't you only need to do the lookup on the last put?
>>
>> An external lookup table might well for for us, if the net_put_page()
>> change is acceptable to the network folk.
>>
>
> That sounds like the best solution for this problem.
>
An external lookup would just need to change put_page -> net_put_page, I
think.
> David, will you accept shared info destructor?
>
I'm not very familiar with the network stack, but am I right in assuming
that the shared_info destructor would be called when the network stack
has finished with all the pages it refers to? And those pages could be
the combination of pages separately submitted in different skbs?
J
On Fri, Dec 19, 2008 at 02:21:18PM -0800, Jeremy Fitzhardinge ([email protected]) wrote:
> >There maybe a problem though, if iscsi will run over xen network, but in
> >this case xen may copy the data, or iscsi may do that after determining
> >that underlying network device does not allow shared info destructor
> >(vis device/route flag for example).
> >
>
> Xen only needs the callback in the case of traffic originating in
> another Xen domain. If iscsi is involved, it will be running in the
> other domain, and so all its callbacks and so on will happen there.
> There's no conflict.
That's good news.
> >>Wouldn't you only need to do the lookup on the last put?
> >>
> >>An external lookup table might well for for us, if the net_put_page()
> >>change is acceptable to the network folk.
> >>
> >
> >That sounds like the best solution for this problem.
> >
>
> An external lookup would just need to change put_page -> net_put_page, I
> think.
It is not needed I think, having skb shared info destructor has access
to the all pages in that skb.
> >David, will you accept shared info destructor?
> >
>
> I'm not very familiar with the network stack, but am I right in assuming
> that the shared_info destructor would be called when the network stack
> has finished with all the pages it refers to? And those pages could be
> the combination of pages separately submitted in different skbs?
Shared info is freed when there are no skbs referring to the shared info
in question. Skb holds all pages in shared info in the fraglist array,
so when it is about to be freed, it means that network stack does not
use it (particulary it will putpage every page in fraglist). Usually
there are two skbs in the network stack per packet in TCP (allocated at
once though via fastclone mechanims): one is provided to the device
(and will be freed there) and another one is placed into retransmit
queue, where it will be located and freed when ack has been received.
There may be another layers which may clone skb, but its shared info
structure (shared between the clones) will only be freed when all users
freed appropriate cloned skbs.
--
Evgeniy Polyakov
Evgeniy Polyakov wrote:
> Shared info is freed when there are no skbs referring to the shared info
> in question. Skb holds all pages in shared info in the fraglist array,
> so when it is about to be freed, it means that network stack does not
> use it (particulary it will putpage every page in fraglist). Usually
> there are two skbs in the network stack per packet in TCP (allocated at
> once though via fastclone mechanims): one is provided to the device
> (and will be freed there) and another one is placed into retransmit
> queue, where it will be located and freed when ack has been received.
>
> There may be another layers which may clone skb, but its shared info
> structure (shared between the clones) will only be freed when all users
> freed appropriate cloned skbs.
>
I see. One more question: how would I go about submitting some data
with the callback attached to it?
J
On Fri, Dec 19, 2008 at 05:56:11PM -0800, Jeremy Fitzhardinge wrote:
> Evgeniy Polyakov wrote:
>> Shared info is freed when there are no skbs referring to the shared info
>> in question. Skb holds all pages in shared info in the fraglist array,
>> so when it is about to be freed, it means that network stack does not
>> use it (particulary it will putpage every page in fraglist). Usually
>> there are two skbs in the network stack per packet in TCP (allocated at
>> once though via fastclone mechanims): one is provided to the device
>> (and will be freed there) and another one is placed into retransmit
>> queue, where it will be located and freed when ack has been received.
>>
>> There may be another layers which may clone skb, but its shared info
>> structure (shared between the clones) will only be freed when all users
>> freed appropriate cloned skbs.
This is all correct. However, please note that that if any clone
does a pskb_expand_head then it will get its own private copy of
of the shared info. So you can't use the shared info to ref count
the pages in it.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Herbert Xu wrote:
> On Fri, Dec 19, 2008 at 05:56:11PM -0800, Jeremy Fitzhardinge wrote:
>
>> Evgeniy Polyakov wrote:
>>
>>> Shared info is freed when there are no skbs referring to the shared info
>>> in question. Skb holds all pages in shared info in the fraglist array,
>>> so when it is about to be freed, it means that network stack does not
>>> use it (particulary it will putpage every page in fraglist). Usually
>>> there are two skbs in the network stack per packet in TCP (allocated at
>>> once though via fastclone mechanims): one is provided to the device
>>> (and will be freed there) and another one is placed into retransmit
>>> queue, where it will be located and freed when ack has been received.
>>>
>>> There may be another layers which may clone skb, but its shared info
>>> structure (shared between the clones) will only be freed when all users
>>> freed appropriate cloned skbs.
>>>
>
> This is all correct. However, please note that that if any clone
> does a pskb_expand_head then it will get its own private copy of
> of the shared info. So you can't use the shared info to ref count
> the pages in it.
Ah, so the lifetime of the shared_info structure doesn't match the
lifetime of the underlying pages, and this mechanism would be
insufficient for my purposes? If so, how can it be solved?
Thanks,
J
On Fri, Dec 19, 2008 at 10:14:47PM -0800, Jeremy Fitzhardinge wrote:
>
> Ah, so the lifetime of the shared_info structure doesn't match the
> lifetime of the underlying pages, and this mechanism would be
> insufficient for my purposes?
Right.
> If so, how can it be solved?
Well since each individual page can be pulled out I think you
just have to track them.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Herbert Xu wrote:
>> If so, how can it be solved?
>>
>
> Well since each individual page can be pulled out I think you
> just have to track them.
>
Hm. So if I get a destructor call from the shared_info, can I go an
inspect the page refcounts to see if its really the last use?
J
On Fri, Dec 19, 2008 at 11:43:34PM -0800, Jeremy Fitzhardinge wrote:
>
> Hm. So if I get a destructor call from the shared_info, can I go an
> inspect the page refcounts to see if its really the last use?
The pages that were originally in the shared_info at creation
time may no longer be there by the time it's freed because of
pskb_pull_tail.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Sat, Dec 20, 2008 at 07:10:45PM +1100, Herbert Xu ([email protected]) wrote:
> > Hm. So if I get a destructor call from the shared_info, can I go an
> > inspect the page refcounts to see if its really the last use?
>
> The pages that were originally in the shared_info at creation
> time may no longer be there by the time it's freed because of
> pskb_pull_tail.
Things should work fine, since pskb_expand_head() copies whole shared
info structure (and thus will copy destructor), get all pages and then
copy all pointers into the new skb, and then release old skb's data.
So destructor for the pages should not rely on which skb it is called on
and check if pages are about to be really freed (i.e. check theirs
reference counter).
__pskb_pull_tail() is tricky, it just puts some pages it does not want
to be present in the skb, but it could be possible to add there
destructor callback from the original skb with partial flag (or just
having destructor with two parameters: skb and page, and if page is not
NULL, then actually only given page is freed, otherwise the whole skb).
--
Evgeniy Polyakov
Hi Vladislav,
2008/12/18 Vladislav Bolkhovitin <[email protected]>:
> Fr?d?ric Weisbecker, on 12/14/2008 03:35 AM wrote:
>>
>> 2008/12/13 Vladislav Bolkhovitin <[email protected]>:
>>>
>>> Also (maybe I simply miss something) looks like ftrace doesn't trace exit
>>> from functions, only entrance to them. Is it true? Is it possibly to log
>>> exit from functions as well?
>>
>> That's true with 2.6.28, the function tracer traces on function entries
>> only.
>> But there is an add-on on ftrace which let one to trace on entry and
>> on return, the function
>> graph tracer. This tracer uses this facility to output a graph of
>> function calls and measure
>> the time elapsed during each function call.
>> You can also register two custom handlers to do some things you need
>> on entry and on return.
>
> Word "graph" is quite confusing. We don't need any graphs, we need a plain
> execution path tracing as in the attached example (this is what's currently
> done).
The word graph is actually here to explain here that we not only trace
each function call
but we can actually retrieve all of the call path of a function and
then draw it as if it was
C code:
0) ! 108.528 us | }
0) | irq_exit() {
0) | do_softirq() {
0) | __do_softirq() {
0) 0.895 us | __local_bh_disable();
0) | run_timer_softirq() {
0) 0.827 us | hrtimer_run_pending();
0) 1.226 us | _spin_lock_irq();
0) | _spin_unlock_irq() {
0) 6.550 us | }
0) 0.924 us | _local_bh_enable();
0) + 12.129 us | }
0) + 13.911 us | }
0) 0.707 us | idle_cpu();
0) + 17.009 us | }
0) ! 137.419 us | }
0) <========== |
0) 1.045 us | }
0) ! 148.908 us | }
0) ! 151.022 us | }
0) ! 153.022 us | }
0) 0.963 us | journal_mark_dirty();
0) 0.925 us | __brelse();
>>> All the above functionality is almost what we need. The only thing left,
>>> which I forgot to mention, is possibility to log also functions return
>>> value
>>> on exit. This is what TRACE_EXIT_RES() in SCST does. Is it possible to
>>> add
>>> those?
>>
>> I want to add that on the function graph tracer. That can be done
>> pretty easily. The only
>> problem comes with the type of the return value. Would this tracer be
>> supposed to always
>> return a 64 bits value regardless of the real typ of the value? There
>> would be some pointless bytes
>> on most return values. I don't know how to proceed for this problem.
>
> I think if tracer always returns machine word, as Ingo suggested, as
> "%d(%x)" it would be more then sufficient. In the rest of 0.01% of cases it
> wouldn't be hard to print a non-standard return value using ftrace_printk()
> just before returning it.
>
Yeah, I will do some tests to see if I can have a relevant result....
>>> And one more question. Is it possible to redirect ftrace tracing to
>>> serial
>>> console or any other console (netconsole?)? It can be helpful to
>>> investigate
>>> hard lockups in IRQ or with IRQs disabled.
>>>
>>> Thanks!
>>> Vlad
>>>
>>
>> I would find it useful too. I thought about something like using
>> early_printk or something like
>> that...I don't know. That would be good to redirect the output to the
>> tty device of the user choice.
>
> That would make ftrace a complete debug logging subsystem.
ditto, I will try to do something to early_printk() a trace instead of
inserting into the ring buffer.
That could be a trace_option as an example.
> Dec 15 17:09:48 linux-c07e kernel: [ 738.535571] [2519]: ENTRY
> dev_user_ioctl
> Dec 15 17:09:48 linux-c07e kernel: [ 738.535582] [2519]:
> dev_user_ioctl:1735:REGISTER_DEVICE
> Dec 15 17:09:48 linux-c07e kernel: [ 738.535589] [2519]: ENTRY
> dev_user_register_dev
> Dec 15 17:09:48 linux-c07e kernel: [ 738.535601] [2519]: ENTRY
> sgv_pool_create
> Dec 15 17:09:48 linux-c07e kernel: [ 738.535610] [2519]: ENTRY
> sgv_pool_init
> Dec 15 17:09:48 linux-c07e kernel: [ 738.535616] [2519]:
> sgv_pool_init:997:name dev0, sizeof(*obj)=52, clustering_type=0
> Dec 15 17:09:48 linux-c07e kernel: [ 738.535626] [2519]:
> sgv_pool_init:1026:pages=1, size=76
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=2, size=100
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=4, size=148
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=8, size=244
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=16, size=436
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=32, size=820
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=64, size=1588
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=128, size=3124
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=256, size=52
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=512, size=52
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=1024, size=52
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT
> sgv_pool_init: 0
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT
> sgv_pool_create: 1
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: ENTRY
> sgv_pool_create
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: ENTRY
> sgv_pool_init
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:997:name dev0-clust, sizeof(*obj)=52, clustering_type=1
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=1, size=80
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=2, size=108
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=4, size=164
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=8, size=276
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=16, size=500
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=32, size=948
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=64, size=1844
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=128, size=3636
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=256, size=1076
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=512, size=2100
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> sgv_pool_init:1026:pages=1024, size=52
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT
> sgv_pool_init: 0
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT
> sgv_pool_create: 1
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: ENTRY
> __dev_user_set_opt
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> __dev_user_set_opt:2580:dev dev0, parse_type 0, on_free_cmd_type 1,
> memory_reuse_type 1, partial_transfers_type 0, partial_len 0
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: ENTRY
> dev_user_setup_functions
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT
> dev_user_setup_functions
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT
> __dev_user_set_opt: 0
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> dev_user_register_dev:2499:dev b7bffad0, name dev0
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: ENTRY
> __scst_register_virtual_dev_driver
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT
> scst_dev_handler_check: 0
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: scst:
> __scst_register_virtual_dev_driver:1036:Virtual device handler dh-dev0 for
> type 1 registered successfully
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT
> __scst_register_virtual_dev_driver: 0
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: ENTRY
> scst_register_virtual_device
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT
> scst_dev_handler_check: 0
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: ENTRY
> scst_suspend_activity
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> scst_suspend_activity:484:suspend_count 0
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: ENTRY
> scst_susp_wait
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]:
> scst_susp_wait:463:wait_event() returned 0
> Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT
> scst_susp_wait: 0
>
>
Evgeniy Polyakov wrote:
> Things should work fine, since pskb_expand_head() copies whole shared
> info structure (and thus will copy destructor), get all pages and then
> copy all pointers into the new skb, and then release old skb's data.
>
> So destructor for the pages should not rely on which skb it is called on
> and check if pages are about to be really freed (i.e. check theirs
> reference counter).
>
OK.
> __pskb_pull_tail() is tricky, it just puts some pages it does not want
> to be present in the skb, but it could be possible to add there
> destructor callback from the original skb with partial flag (or just
> having destructor with two parameters: skb and page, and if page is not
> NULL, then actually only given page is freed, otherwise the whole skb).
>
Yes, that doesn't sound too bad.
J
On Sunday 21 December 2008 06:09:18 Jeremy Fitzhardinge wrote:
> Evgeniy Polyakov wrote:
> > Things should work fine, since pskb_expand_head() copies whole shared
> > info structure (and thus will copy destructor), get all pages and then
> > copy all pointers into the new skb, and then release old skb's data.
> >
> > So destructor for the pages should not rely on which skb it is called on
> > and check if pages are about to be really freed (i.e. check theirs
> > reference counter).
> >
>
> OK.
>
> > __pskb_pull_tail() is tricky, it just puts some pages it does not want
> > to be present in the skb, but it could be possible to add there
> > destructor callback from the original skb with partial flag (or just
> > having destructor with two parameters: skb and page, and if page is not
> > NULL, then actually only given page is freed, otherwise the whole skb).
> >
>
> Yes, that doesn't sound too bad.
That would be one approach. Actually, my patch solved this by keeping a
parent ref in various cases if the parent had a destructor: we only destroy
the parent when all the clones are gone.
Here's the patch for reference:
net: add destructor for skb data.
If we want to notify something when an skb is truly finished (such as
for tun vringfd support), we need a destructor on the data.
This turns out to be slightly non-trivial as fragments from one skb
get copied to another skb: if the first skb has a destructor (or its
parent does) we need to keep a reference to it and destroy it only
when (all the) children are destroyed. We add an 'orig' pointer to
the skb_shared_info to do this.
But there's currently no way to get from the shinfo to the head (to
kfree it), so we add a 'len' field. A better alternative to this
might be to move the skb_shared_info to before the head of the skb data.
Note that the destructor is responsible for calling kfree: for the tun
device, this is critical since the destructor can be called from any
context and it has to do a copy_to_user, so it queues the skb.
Signed-off-by: Rusty Russell <[email protected]>
---
include/linux/skbuff.h | 9 ++++++
net/core/skbuff.c | 66 ++++++++++++++++++++++++++++++++++++++++---------
2 files changed, 64 insertions(+), 11 deletions(-)
diff -r 3948a0050f81 include/linux/skbuff.h
--- a/include/linux/skbuff.h Tue May 06 21:18:01 2008 +1000
+++ b/include/linux/skbuff.h Thu May 08 13:12:47 2008 +1000
@@ -146,8 +146,12 @@ struct skb_shared_info {
unsigned short gso_segs;
unsigned short gso_type;
__be32 ip6_frag_id;
+ unsigned int len; /* Subtract from this shinfo to find skb->head */
struct sk_buff *frag_list;
skb_frag_t frags[MAX_SKB_FRAGS];
+ struct skb_shared_info *orig;
+ /* This is responsible for kfree() of header. */
+ void (*destructor)(struct skb_shared_info *);
};
/* We divide dataref into two halves. The higher 16 bits hold references
@@ -827,6 +831,11 @@ static inline void skb_fill_page_desc(st
#define SKB_PAGE_ASSERT(skb) BUG_ON(skb_shinfo(skb)->nr_frags)
#define SKB_FRAG_ASSERT(skb) BUG_ON(skb_shinfo(skb)->frag_list)
#define SKB_LINEAR_ASSERT(skb) BUG_ON(skb_is_nonlinear(skb))
+
+static inline unsigned char *skb_shinfo_to_head(struct skb_shared_info *shinfo)
+{
+ return (unsigned char *)shinfo - shinfo->len;
+}
#ifdef NET_SKBUFF_DATA_USES_OFFSET
static inline unsigned char *skb_tail_pointer(const struct sk_buff *skb)
diff -r 3948a0050f81 net/core/skbuff.c
--- a/net/core/skbuff.c Tue May 06 21:18:01 2008 +1000
+++ b/net/core/skbuff.c Thu May 08 13:12:47 2008 +1000
@@ -218,6 +218,9 @@ struct sk_buff *__alloc_skb(unsigned int
shinfo->gso_type = 0;
shinfo->ip6_frag_id = 0;
shinfo->frag_list = NULL;
+ shinfo->destructor = NULL;
+ shinfo->orig = NULL;
+ shinfo->len = skb_end_pointer(skb) - skb->head;
if (fclone) {
struct sk_buff *child = skb + 1;
@@ -311,21 +314,53 @@ static void skb_clone_fraglist(struct sk
skb_get(list);
}
+static void shinfo_put(struct skb_shared_info *shinfo, bool nohdr, bool clone)
+{
+ struct skb_shared_info *orig;
+
+ do {
+ if (clone &&
+ atomic_sub_return(nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
+ &shinfo->dataref)) {
+ return;
+ }
+
+ if (shinfo->nr_frags) {
+ int i;
+ for (i = 0; i < shinfo->nr_frags; i++)
+ put_page(shinfo->frags[i].page);
+ }
+
+ if (shinfo->frag_list)
+ skb_drop_list(&shinfo->frag_list);
+
+ orig = shinfo->orig;
+ if (shinfo->destructor)
+ shinfo->destructor(shinfo);
+ else
+ kfree(skb_shinfo_to_head(shinfo));
+
+ /* We hold a payload reference to our parent. */
+ nohdr = true;
+ clone = true;
+ } while ((shinfo = orig) != NULL);
+}
+
static void skb_release_data(struct sk_buff *skb)
{
- if (!skb->cloned ||
- !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
- &skb_shinfo(skb)->dataref)) {
- if (skb_shinfo(skb)->nr_frags) {
- int i;
- for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
- put_page(skb_shinfo(skb)->frags[i].page);
- }
+ shinfo_put(skb_shinfo(skb), skb->nohdr, skb->cloned);
+}
- if (skb_shinfo(skb)->frag_list)
- skb_drop_fraglist(skb);
+/* Now hold reference to older data, if has a destructor (recursively). */
+static void skb_ref_data_parent(struct sk_buff *parent,
+ struct skb_shared_info *shinfo)
+{
+ struct skb_shared_info *pshinfo = skb_shinfo(parent);
- kfree(skb->head);
+ if (pshinfo->destructor || pshinfo->orig) {
+ shinfo->orig = pshinfo;
+ atomic_add((1 << SKB_DATAREF_SHIFT) + 1, &pshinfo->dataref);
+ parent->cloned = 1;
}
}
@@ -656,6 +691,7 @@ struct sk_buff *pskb_copy(struct sk_buff
get_page(skb_shinfo(n)->frags[i].page);
}
skb_shinfo(n)->nr_frags = i;
+ skb_ref_data_parent(skb, skb_shinfo(n));
}
if (skb_shinfo(skb)->frag_list) {
@@ -721,6 +757,8 @@ int pskb_expand_head(struct sk_buff *skb
if (skb_shinfo(skb)->frag_list)
skb_clone_fraglist(skb);
+ skb_ref_data_parent(skb, (void *)(data + size));
+
skb_release_data(skb);
off = (data + nhead) - skb->head;
@@ -743,6 +781,8 @@ int pskb_expand_head(struct sk_buff *skb
skb->hdr_len = 0;
skb->nohdr = 0;
atomic_set(&skb_shinfo(skb)->dataref, 1);
+ skb_shinfo(skb)->len = skb_end_pointer(skb) - skb->head;
+ skb_shinfo(skb)->destructor = NULL;
return 0;
nodata:
@@ -1962,6 +2002,8 @@ void skb_split(struct sk_buff *skb, stru
skb_split_inside_header(skb, skb1, len, pos);
else /* Second chunk has no header, nothing to copy. */
skb_split_no_header(skb, skb1, len, pos);
+
+ skb_ref_data_parent(skb, skb_shinfo(skb1));
}
/**
@@ -2334,6 +2376,8 @@ struct sk_buff *skb_segment(struct sk_bu
nskb->data_len = len - hsize;
nskb->len += nskb->data_len;
nskb->truesize += nskb->data_len;
+
+ skb_ref_data_parent(skb, skb_shinfo(nskb));
} while ((offset += len) < skb->len);
return segs;
Jens Axboe, on 12/19/2008 10:27 PM wrote:
>>>>>> An iSCSI target driver iSCSI-SCST was a part of the patchset
>>>>>> (http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to
>>>>>> have TCP zero-copy transmit of user space data was implemented. Patch,
>>>>>> implementing this optimization was also sent in the patchset, see
>>>>>> http://lkml.org/lkml/2008/12/10/296.
>>>>> I'm probably ignorant of about 90% of the context here, but isn't this
>>>>> the sort of problem that was supposed to have been solved by vmsplice(2)?
>>>> No, vmsplice can't help here. ISCSI-SCST is a kernel space driver. But,
>>>> even if it was a user space driver, vmsplice wouldn't change anything
>>>> much. It doesn't have a possibility for a user to know, when
>>>> transmission of the data finished. So, it is intended to be used as:
>>>> vmsplice() buffer -> munmap() the buffer -> mmap() new buffer ->
>>>> vmsplice() it. But on the mmap() stage kernel has to zero all the newly
>>>> mapped pages and zeroing memory isn't much faster, than copying it.
>>>> Hence, there would be no considerable performance increase.
>>> vmsplice() isn't the right choice, but splice() very well could be. You
>>> could easily use splice internally as well. The vmsplice() part sort-of
>>> applies in the sense that you want to fill pages into a pipe, which is
>>> essentially what vmsplice() does. You'd need some helper to do that.
>> Sorry, Jens, but splice() works only if there is a file handle on the
>> another side, so user space doesn't see data buffers. But SCST needs to
>> serve a wider usage cases, like reading data with decompression from a
>> virtual tape, where decompression is done in user space. For those only
>> complete zero-copy network send, which I implemented, can give the best
>> performance.
>
> __splice_from_pipe() takes a pipe, a descriptor and an actor. There's
> absolutely ZERO reason you could not reuse most of that for this
> implementation. The big bonus here is that getting the put correct from
> networking would even make splice() better for everyone. Win for Linux,
> win for you since it'll make it MUCH easier for you to get this stuff
> in. Looking at your original patch and I almost think it's a flame bait
> to induce discussion (nothing wrong with that, that approach works quite
> well and has been used before). There's no way in HELL that it'd ever be
> a merge candidate. And I suspect you know that, at least I hope you do
> or you are farther away from going forward with this than you think.
>
> So don't look at splice() the system call, look at the infrastructure
> and check if that could be useful for your case. To me it looks
> absolutely like it could, if you goal is just zero-copy transmit.
I looked at the splice code again to make sure I don't miss anything.
__splice_from_pipe() leads to pipe_to_sendpage(), which leads to
sock_sendpage, then to sock->sendpage(). Sorry, but I don't see any
point why to go over all the complicated splice infrastructure instead
of directly call sock->sendpage(), as I do.
> The
> only missing piece is dropping the reference and signalling page
> consumption at the right point, which is when the data is safe to be
> reused. That very bit is missing, but that should be all as far as I can
> tell.
This is exactly what I implemented in the patch we are discussing.
>>> And
>>> the ack-on-xmit-done bits is something that splice-to-socket needs
>>> anyway, so I think it'd be quite a suitable choice for this.
>> So, are you writing that splice() could also benefit from the zero-copy
>> transmit feature, like I implemented?
>
> I like how you want to reinvent everything, perhaps you should spend a
> little more time looking into various other approaches? splice() already
> does zero-copy network transmit, there are no copies going on. Ideally,
> you'd have zero copies moving data into your pipe, but migrade/move
> isn't quite there yet. But that doesn't apply to your case at all.
>
> What is missing, as I wrote, is the 'release on ack' and not on pipe
> buffer release. This is similar to the get_page/put_page stuff you did
> in your patch, but don't go claiming that zero-copy transmit is a
> Vladislav original - the ->sendpage() does no copies.
Jens, I have never claimed I reinvented ->sendpage(). Quite opposite, I
use it. I only extended it by a missing feature. Although, seems, since
you were misleaded, I should apologize for not too good description of
the patch.
Thanks,
Vlad
Hi Fr?d?ric,
Fr?d?ric Weisbecker, on 12/20/2008 04:06 PM wrote:
> Hi Vladislav,
>
> 2008/12/18 Vladislav Bolkhovitin <[email protected]>:
>> Fr?d?ric Weisbecker, on 12/14/2008 03:35 AM wrote:
>>> 2008/12/13 Vladislav Bolkhovitin <[email protected]>:
>>>> Also (maybe I simply miss something) looks like ftrace doesn't trace exit
>>>> from functions, only entrance to them. Is it true? Is it possibly to log
>>>> exit from functions as well?
>>> That's true with 2.6.28, the function tracer traces on function entries
>>> only.
>>> But there is an add-on on ftrace which let one to trace on entry and
>>> on return, the function
>>> graph tracer. This tracer uses this facility to output a graph of
>>> function calls and measure
>>> the time elapsed during each function call.
>>> You can also register two custom handlers to do some things you need
>>> on entry and on return.
>> Word "graph" is quite confusing. We don't need any graphs, we need a plain
>> execution path tracing as in the attached example (this is what's currently
>> done).
>
> The word graph is actually here to explain here that we not only trace
> each function call
> but we can actually retrieve all of the call path of a function and
> then draw it as if it was
> C code:
>
> 0) ! 108.528 us | }
> 0) | irq_exit() {
> 0) | do_softirq() {
> 0) | __do_softirq() {
> 0) 0.895 us | __local_bh_disable();
> 0) | run_timer_softirq() {
> 0) 0.827 us | hrtimer_run_pending();
> 0) 1.226 us | _spin_lock_irq();
> 0) | _spin_unlock_irq() {
> 0) 6.550 us | }
> 0) 0.924 us | _local_bh_enable();
> 0) + 12.129 us | }
> 0) + 13.911 us | }
> 0) 0.707 us | idle_cpu();
> 0) + 17.009 us | }
> 0) ! 137.419 us | }
> 0) <========== |
> 0) 1.045 us | }
> 0) ! 148.908 us | }
> 0) ! 151.022 us | }
> 0) ! 153.022 us | }
> 0) 0.963 us | journal_mark_dirty();
> 0) 0.925 us | __brelse();
Unfortunately, it lacks very useful "TASK-PID, CPU#, TIMESTAMP" header
fields..
Jeremy Fitzhardinge, on 12/19/2008 11:21 PM wrote:
[...]
> As with your case, we can simply copy the page data if this mechanism
> isn't available. But it would be nice if it were.
>
>> 1. Add net_priv analog in struct sk_buff, not in struct page. But then
>> it would be required that all the pages in each skb must be from the
>> same originator, i.e. with the same net_priv. It is unpractical to
>> change all the operations with skb's to forbid merging them, if they
>> have different net_priv. I tried, but quickly gave up. There are too
>> many such places in very not obvious code pieces.
>>
>
> I think Rusty has a patch to put some kind of put notifier in struct
> skb_shared_info, but I'm not sure of the details.
>
>> 2. Have in iSCSI-SCST a hashed list to translate page to iSCSI cmd by a
>> simple search function. This approach was rejected, because to copy a
>> page a modern CPU needs using MMX about 1500 ticks.
>
> Is that the cold cache timing?
Should be L2 cache hot, which is almost always the case if FILEIO is
used, because data are just copied from the page cache. Although,
frankly, at the moment I can't find from where I got that number..
>> It was observed,
>> that each page can be referenced by TCP during transmit about 20 times
>> or even more. So, if each search needs, say, 20 ticks, the overall
>> search time will be 20*20*2 (to get() and put()) = 800 ticks. So, this
>> approach would considerably worse performance-wise to the chosen
>> approach and provide not too much benefit.
>
> Wouldn't you only need to do the lookup on the last put?
No, because you can't say which one is the last. E.g., a page can be
mmaped to another process, while it's being transmitted. So, the only
possible way is to track all gets and puts done by networking using some
external reference counting (net_ref_cnt in case if iscsi-scst).
Vlad
Rusty Russell, on 12/22/2008 03:43 AM wrote:
> On Sunday 21 December 2008 06:09:18 Jeremy Fitzhardinge wrote:
>> Evgeniy Polyakov wrote:
>>> Things should work fine, since pskb_expand_head() copies whole shared
>>> info structure (and thus will copy destructor), get all pages and then
>>> copy all pointers into the new skb, and then release old skb's data.
>>>
>>> So destructor for the pages should not rely on which skb it is called on
>>> and check if pages are about to be really freed (i.e. check theirs
>>> reference counter).
>>>
>> OK.
>>
>>> __pskb_pull_tail() is tricky, it just puts some pages it does not want
>>> to be present in the skb, but it could be possible to add there
>>> destructor callback from the original skb with partial flag (or just
>>> having destructor with two parameters: skb and page, and if page is not
>>> NULL, then actually only given page is freed, otherwise the whole skb).
>>>
>> Yes, that doesn't sound too bad.
>
> That would be one approach. Actually, my patch solved this by keeping a
> parent ref in various cases if the parent had a destructor: we only destroy
> the parent when all the clones are gone.
>
> Here's the patch for reference:
>
> net: add destructor for skb data.
>
> If we want to notify something when an skb is truly finished (such as
> for tun vringfd support), we need a destructor on the data.
>
> This turns out to be slightly non-trivial as fragments from one skb
> get copied to another skb: if the first skb has a destructor (or its
> parent does) we need to keep a reference to it and destroy it only
> when (all the) children are destroyed. We add an 'orig' pointer to
> the skb_shared_info to do this.
>
> But there's currently no way to get from the shinfo to the head (to
> kfree it), so we add a 'len' field. A better alternative to this
> might be to move the skb_shared_info to before the head of the skb data.
>
> Note that the destructor is responsible for calling kfree: for the tun
> device, this is critical since the destructor can be called from any
> context and it has to do a copy_to_user, so it queues the skb.
Rusty,
Can you describe how one should use your patch, please? Maybe, there is
some code you use to test it?
Thanks,
Vlad
Evgeniy Polyakov, on 12/20/2008 01:32 PM wrote:
> On Sat, Dec 20, 2008 at 07:10:45PM +1100, Herbert Xu ([email protected]) wrote:
>>> Hm. So if I get a destructor call from the shared_info, can I go an
>>> inspect the page refcounts to see if its really the last use?
>> The pages that were originally in the shared_info at creation
>> time may no longer be there by the time it's freed because of
>> pskb_pull_tail.
>
> Things should work fine, since pskb_expand_head() copies whole shared
> info structure (and thus will copy destructor), get all pages and then
> copy all pointers into the new skb, and then release old skb's data.
>
> So destructor for the pages should not rely on which skb it is called on
> and check if pages are about to be really freed (i.e. check theirs
> reference counter).
>
> __pskb_pull_tail() is tricky, it just puts some pages it does not want
> to be present in the skb, but it could be possible to add there
> destructor callback from the original skb with partial flag (or just
> having destructor with two parameters: skb and page, and if page is not
> NULL, then actually only given page is freed, otherwise the whole skb).
Actually, there's another way, which seems to be a lot simpler. Alexey
Kuznetsov privately suggested it to me.
In skb_shared_info new pointer transaction_token would be added, which
would point on:
struct sk_transaction_token
{
atomic_t io_count;
struct sk_transaction_token *next;
unsigned long token;
unsigned long private;
void (*finish_callback)(struct sk_transaction_token *);
};
When skb is translated, transaction_token inherited. If 2 skb are merged
(the same places where I put net_get_page's in my patch), the *older*
token is inherited. This is the main point of this idea.
Before starting new asynchronous send a client would open a new token.
Everything sent then would receive that token. Finish_callback() would
be called and the corresponding token freed, when io_count == 0 *AND*
all previous tokens closed.
This idea seems to be simpler, than even what Rusty implemented. Correct
me, if I wrong. But, unfortunately, in the near future I will have no
time to develop it.. :-(
Vlad
On Tue, Dec 23, 2008 at 10:16:25PM +0300, Vladislav Bolkhovitin ([email protected]) wrote:
> Actually, there's another way, which seems to be a lot simpler. Alexey
> Kuznetsov privately suggested it to me.
>
> In skb_shared_info new pointer transaction_token would be added, which
> would point on:
>
> struct sk_transaction_token
> {
> atomic_t io_count;
> struct sk_transaction_token *next;
> unsigned long token;
> unsigned long private;
> void (*finish_callback)(struct
> sk_transaction_token *);
> };
>
> When skb is translated, transaction_token inherited. If 2 skb are merged
> (the same places where I put net_get_page's in my patch), the *older*
> token is inherited. This is the main point of this idea.
>
> Before starting new asynchronous send a client would open a new token.
> Everything sent then would receive that token. Finish_callback() would
> be called and the corresponding token freed, when io_count == 0 *AND*
> all previous tokens closed.
>
> This idea seems to be simpler, than even what Rusty implemented. Correct
> me, if I wrong. But, unfortunately, in the near future I will have no
> time to develop it.. :-(
Yes, it is simpler and cleaner, but it requires additional allocation.
This is additional (and quite noticeble) overhead.
--
Evgeniy Polyakov
Evgeniy Polyakov, on 12/24/2008 12:38 AM wrote:
> On Tue, Dec 23, 2008 at 10:16:25PM +0300, Vladislav Bolkhovitin ([email protected]) wrote:
>> Actually, there's another way, which seems to be a lot simpler. Alexey
>> Kuznetsov privately suggested it to me.
>>
>> In skb_shared_info new pointer transaction_token would be added, which
>> would point on:
>>
>> struct sk_transaction_token
>> {
>> atomic_t io_count;
>> struct sk_transaction_token *next;
>> unsigned long token;
>> unsigned long private;
>> void (*finish_callback)(struct
>> sk_transaction_token *);
>> };
>>
>> When skb is translated, transaction_token inherited. If 2 skb are merged
>> (the same places where I put net_get_page's in my patch), the *older*
>> token is inherited. This is the main point of this idea.
>>
>> Before starting new asynchronous send a client would open a new token.
>> Everything sent then would receive that token. Finish_callback() would
>> be called and the corresponding token freed, when io_count == 0 *AND*
>> all previous tokens closed.
>>
>> This idea seems to be simpler, than even what Rusty implemented. Correct
>> me, if I wrong. But, unfortunately, in the near future I will have no
>> time to develop it.. :-(
>
> Yes, it is simpler and cleaner, but it requires additional allocation.
> This is additional (and quite noticeble) overhead.
Not necessary requires. For instance, in iscsi-scst sk_transaction_token
can (and should) be part of iSCSI cmd structure, so no additional
allocations would be needed.
Thanks,
Vlad
On Wed, Dec 24, 2008 at 05:37:51PM +0300, Vladislav Bolkhovitin ([email protected]) wrote:
> >Yes, it is simpler and cleaner, but it requires additional allocation.
> >This is additional (and quite noticeble) overhead.
>
> Not necessary requires. For instance, in iscsi-scst sk_transaction_token
> can (and should) be part of iSCSI cmd structure, so no additional
> allocations would be needed.
This is special case, I'm not sure it is always possible to cache that
token and attach to every skb, but if it can be done, then of course
this does not end up with additional overhead.
--
Evgeniy Polyakov
Evgeniy Polyakov, on 12/24/2008 05:44 PM wrote:
> On Wed, Dec 24, 2008 at 05:37:51PM +0300, Vladislav Bolkhovitin ([email protected]) wrote:
>>> Yes, it is simpler and cleaner, but it requires additional allocation.
>>> This is additional (and quite noticeble) overhead.
>> Not necessary requires. For instance, in iscsi-scst sk_transaction_token
>> can (and should) be part of iSCSI cmd structure, so no additional
>> allocations would be needed.
>
> This is special case, I'm not sure it is always possible to cache that
> token and attach to every skb, but if it can be done, then of course
> this does not end up with additional overhead.
I think in most cases there would be possibility to embed
sk_transaction_token to some higher level structure. E.g. Xen apparently
should have something to track packets passed through host/guest
boundary. From other side, kmem cache is too well polished to have much
overhead. I doubt, you would even notice it in this application. In most
cases allocation of such small object in it using SLUB is just about the
same as a list_del() under disabled IRQs.
Vlad
On Wed, Dec 24, 2008 at 08:46:56PM +0300, Vladislav Bolkhovitin ([email protected]) wrote:
> I think in most cases there would be possibility to embed
> sk_transaction_token to some higher level structure. E.g. Xen apparently
> should have something to track packets passed through host/guest
> boundary. From other side, kmem cache is too well polished to have much
> overhead. I doubt, you would even notice it in this application. In most
> cases allocation of such small object in it using SLUB is just about the
> same as a list_del() under disabled IRQs.
I definitely would not rely on that, especially at cache reclaim time.
But it of course depends on the workload and maybe appropriate for the
cases in question. The best solution I think is to combine tag and
separate destructur, so that those who do not want to allocate a token
could still get notification via destructor callback.
--
Evgeniy Polyakov
* Vladislav Bolkhovitin <[email protected]> wrote:
>> The word graph is actually here to explain here that we not only trace
>> each function call but we can actually retrieve all of the call path of
>> a function and then draw it as if it was C code:
>>
>> 0) ! 108.528 us | }
>> 0) | irq_exit() {
>> 0) | do_softirq() {
>> 0) | __do_softirq() {
>> 0) 0.895 us | __local_bh_disable();
>> 0) | run_timer_softirq() {
>> 0) 0.827 us | hrtimer_run_pending();
>> 0) 1.226 us | _spin_lock_irq();
>> 0) | _spin_unlock_irq() {
>> 0) 6.550 us | }
>> 0) 0.924 us | _local_bh_enable();
>> 0) + 12.129 us | }
>> 0) + 13.911 us | }
>> 0) 0.707 us | idle_cpu();
>> 0) + 17.009 us | }
>> 0) ! 137.419 us | }
>> 0) <========== |
>> 0) 1.045 us | }
>> 0) ! 148.908 us | }
>> 0) ! 151.022 us | }
>> 0) ! 153.022 us | }
>> 0) 0.963 us | journal_mark_dirty();
>> 0) 0.925 us | __brelse();
>
> Unfortunately, it lacks very useful "TASK-PID, CPU#, TIMESTAMP" header
> fields..
those obscure readability in the typical usecases, but you can get them
anytime via using this existing trace_options runtime switch:
echo funcgraph-proc > /debug/tracing/trace_options
resulting in traces like this:
# CPU TASK/PID OVERHEAD/DURATION FUNCTION CALLS
# | | | | | | | |
------------------------------------------
0) distccd-28400 => cc1-30212
------------------------------------------
0) cc1-30212 | 0.270 us | }
0) cc1-30212 | | __do_fault() {
0) cc1-30212 | | filemap_fault() {
0) cc1-30212 | | find_lock_page() {
0) cc1-30212 | 0.453 us | find_get_page();
0) cc1-30212 | 0.997 us | }
0) cc1-30212 | | PageUptodate() {
0) cc1-30212 | 0.266 us | constant_test_bit();
0) cc1-30212 | 0.799 us | }
0) cc1-30212 | 0.379 us | mark_page_accessed();
0) cc1-30212 | 3.275 us | }
0) cc1-30212 | 0.276 us | _spin_lock();
0) cc1-30212 | 0.389 us | page_add_file_rmap();
0) cc1-30212 | | unlock_page() {
0) cc1-30212 | 0.266 us | page_waitqueue();
0) cc1-30212 | 0.381 us | __wake_up_bit();
0) cc1-30212 | 1.442 us | }
0) cc1-30212 | 6.897 us | }
0) cc1-30212 |+ 11.663 us | }
0) cc1-30212 | | up_read() {
0) cc1-30212 | 0.280 us | _spin_lock_irqsave();
Thanks,
Ingo
Ingo Molnar, on 12/27/2008 02:20 PM wrote:
> * Vladislav Bolkhovitin <[email protected]> wrote:
>
>>> The word graph is actually here to explain here that we not only trace
>>> each function call but we can actually retrieve all of the call path of
>>> a function and then draw it as if it was C code:
>>>
>>> 0) ! 108.528 us | }
>>> 0) | irq_exit() {
>>> 0) | do_softirq() {
>>> 0) | __do_softirq() {
>>> 0) 0.895 us | __local_bh_disable();
>>> 0) | run_timer_softirq() {
>>> 0) 0.827 us | hrtimer_run_pending();
>>> 0) 1.226 us | _spin_lock_irq();
>>> 0) | _spin_unlock_irq() {
>>> 0) 6.550 us | }
>>> 0) 0.924 us | _local_bh_enable();
>>> 0) + 12.129 us | }
>>> 0) + 13.911 us | }
>>> 0) 0.707 us | idle_cpu();
>>> 0) + 17.009 us | }
>>> 0) ! 137.419 us | }
>>> 0) <========== |
>>> 0) 1.045 us | }
>>> 0) ! 148.908 us | }
>>> 0) ! 151.022 us | }
>>> 0) ! 153.022 us | }
>>> 0) 0.963 us | journal_mark_dirty();
>>> 0) 0.925 us | __brelse();
>> Unfortunately, it lacks very useful "TASK-PID, CPU#, TIMESTAMP" header
>> fields..
>
> those obscure readability in the typical usecases, but you can get them
> anytime via using this existing trace_options runtime switch:
>
> echo funcgraph-proc > /debug/tracing/trace_options
>
> resulting in traces like this:
>
> # CPU TASK/PID OVERHEAD/DURATION FUNCTION CALLS
> # | | | | | | | |
> ------------------------------------------
> 0) distccd-28400 => cc1-30212
> ------------------------------------------
>
> 0) cc1-30212 | 0.270 us | }
> 0) cc1-30212 | | __do_fault() {
> 0) cc1-30212 | | filemap_fault() {
> 0) cc1-30212 | | find_lock_page() {
> 0) cc1-30212 | 0.453 us | find_get_page();
> 0) cc1-30212 | 0.997 us | }
> 0) cc1-30212 | | PageUptodate() {
> 0) cc1-30212 | 0.266 us | constant_test_bit();
> 0) cc1-30212 | 0.799 us | }
> 0) cc1-30212 | 0.379 us | mark_page_accessed();
> 0) cc1-30212 | 3.275 us | }
> 0) cc1-30212 | 0.276 us | _spin_lock();
> 0) cc1-30212 | 0.389 us | page_add_file_rmap();
> 0) cc1-30212 | | unlock_page() {
> 0) cc1-30212 | 0.266 us | page_waitqueue();
> 0) cc1-30212 | 0.381 us | __wake_up_bit();
> 0) cc1-30212 | 1.442 us | }
> 0) cc1-30212 | 6.897 us | }
> 0) cc1-30212 |+ 11.663 us | }
> 0) cc1-30212 | | up_read() {
> 0) cc1-30212 | 0.280 us | _spin_lock_irqsave();
This view is OK for us, I can suggest only two things:
1. Add an option to disable "OVERHEAD/DURATION", it isn't needed in our
case of SCSI commands processing troubleshooting, hence would only hurt
readability and (I guess) add not needed overhead.
2. If possible, make the view more compact, i.e. with less spaces on
each line. Our tracing lines can be quite long.
Thanks,
Vlad
Evgeniy Polyakov, on 12/24/2008 09:08 PM wrote:
> On Wed, Dec 24, 2008 at 08:46:56PM +0300, Vladislav Bolkhovitin ([email protected]) wrote:
>> I think in most cases there would be possibility to embed
>> sk_transaction_token to some higher level structure. E.g. Xen apparently
>> should have something to track packets passed through host/guest
>> boundary. From other side, kmem cache is too well polished to have much
>> overhead. I doubt, you would even notice it in this application. In most
>> cases allocation of such small object in it using SLUB is just about the
>> same as a list_del() under disabled IRQs.
>
> I definitely would not rely on that, especially at cache reclaim time.
> But it of course depends on the workload and maybe appropriate for the
> cases in question. The best solution I think is to combine tag and
> separate destructur, so that those who do not want to allocate a token
> could still get notification via destructor callback.
Although I agree that any additional allocation is something, which
should be avoided, *if possible*. But you shouldn't overestimate the
overhead of the sk_transaction_token allocation in cases, when it would
be needed. At first, sk_transaction_token is quite small, so a single
page in the kmem cache would keep about 100 of them, hence the slow
allocation path would be called only once per 100 objects. Second, in
many cases ->sendpages() needs to allocate a new skb, so already there
is at least one such allocations on the fast path.
Actually, it doesn't look like the skb shared info destructor alone
can't solve the task we are solving, because we need to know not when an
skb transmittion finished, but when transmittion of our *set of pages*
finished. Hence, with skb shared info destructor we would need also to
invent some way to track set of pages <-> set of skbs translation (you
refer it as combining tag and separate destructor), which would bring
this solution on the entire new complexity level for no gain over the
sk_transaction_token solution.
Thanks,
Vlad
On Tue, Dec 30, 2008 at 08:13:12PM +0300, Vladislav Bolkhovitin wrote:
> Ingo Molnar, on 12/27/2008 02:20 PM wrote:
>> * Vladislav Bolkhovitin <[email protected]> wrote:
>>
>>>> The word graph is actually here to explain here that we not only
>>>> trace each function call but we can actually retrieve all of the
>>>> call path of a function and then draw it as if it was C code:
>>>>
>>>> 0) ! 108.528 us | }
>>>> 0) | irq_exit() {
>>>> 0) | do_softirq() {
>>>> 0) | __do_softirq() {
>>>> 0) 0.895 us | __local_bh_disable();
>>>> 0) | run_timer_softirq() {
>>>> 0) 0.827 us | hrtimer_run_pending();
>>>> 0) 1.226 us | _spin_lock_irq();
>>>> 0) | _spin_unlock_irq() {
>>>> 0) 6.550 us | }
>>>> 0) 0.924 us | _local_bh_enable();
>>>> 0) + 12.129 us | }
>>>> 0) + 13.911 us | }
>>>> 0) 0.707 us | idle_cpu();
>>>> 0) + 17.009 us | }
>>>> 0) ! 137.419 us | }
>>>> 0) <========== |
>>>> 0) 1.045 us | }
>>>> 0) ! 148.908 us | }
>>>> 0) ! 151.022 us | }
>>>> 0) ! 153.022 us | }
>>>> 0) 0.963 us | journal_mark_dirty();
>>>> 0) 0.925 us | __brelse();
>>> Unfortunately, it lacks very useful "TASK-PID, CPU#, TIMESTAMP"
>>> header fields..
>>
>> those obscure readability in the typical usecases, but you can get them
>> anytime via using this existing trace_options runtime switch:
>>
>> echo funcgraph-proc > /debug/tracing/trace_options
>>
>> resulting in traces like this:
>>
>> # CPU TASK/PID OVERHEAD/DURATION FUNCTION CALLS
>> # | | | | | | | |
>> ------------------------------------------
>> 0) distccd-28400 => cc1-30212
>> ------------------------------------------
>>
>> 0) cc1-30212 | 0.270 us | }
>> 0) cc1-30212 | | __do_fault() {
>> 0) cc1-30212 | | filemap_fault() {
>> 0) cc1-30212 | | find_lock_page() {
>> 0) cc1-30212 | 0.453 us | find_get_page();
>> 0) cc1-30212 | 0.997 us | }
>> 0) cc1-30212 | | PageUptodate() {
>> 0) cc1-30212 | 0.266 us | constant_test_bit();
>> 0) cc1-30212 | 0.799 us | }
>> 0) cc1-30212 | 0.379 us | mark_page_accessed();
>> 0) cc1-30212 | 3.275 us | }
>> 0) cc1-30212 | 0.276 us | _spin_lock();
>> 0) cc1-30212 | 0.389 us | page_add_file_rmap();
>> 0) cc1-30212 | | unlock_page() {
>> 0) cc1-30212 | 0.266 us | page_waitqueue();
>> 0) cc1-30212 | 0.381 us | __wake_up_bit();
>> 0) cc1-30212 | 1.442 us | }
>> 0) cc1-30212 | 6.897 us | }
>> 0) cc1-30212 |+ 11.663 us | }
>> 0) cc1-30212 | | up_read() {
>> 0) cc1-30212 | 0.280 us | _spin_lock_irqsave();
>
> This view is OK for us, I can suggest only two things:
>
> 1. Add an option to disable "OVERHEAD/DURATION", it isn't needed in our
> case of SCSI commands processing troubleshooting, hence would only hurt
> readability and (I guess) add not needed overhead.
You can disable the overhead but not yet the duration.
But it's on my TODO list and should be available for 2.6.30
> 2. If possible, make the view more compact, i.e. with less spaces on
> each line. Our tracing lines can be quite long.
>
> Thanks,
> Vlad
>
I'm not sure... The spaces are mostly here to keep fixed width columns,
in case we have huge durations, or long cmdlines...
Hi Vlad.
On Tue, Dec 30, 2008 at 08:37:00PM +0300, Vladislav Bolkhovitin ([email protected]) wrote:
> Although I agree that any additional allocation is something, which
> should be avoided, *if possible*. But you shouldn't overestimate the
> overhead of the sk_transaction_token allocation in cases, when it would
> be needed. At first, sk_transaction_token is quite small, so a single
> page in the kmem cache would keep about 100 of them, hence the slow
> allocation path would be called only once per 100 objects. Second, in
> many cases ->sendpages() needs to allocate a new skb, so already there
> is at least one such allocations on the fast path.
Once per 100 objects? With millions of packets per second at extreme
cases this does not scale. Even more common thousand of usual packets
per second with 1.5k mtu will show up (especially freeing actually).
Any additional overhead has to be avoided if possible, even if it looks
innocent.
BSD guys already learned this lesson with packet processing tags at
every layer.
> Actually, it doesn't look like the skb shared info destructor alone
> can't solve the task we are solving, because we need to know not when an
> skb transmittion finished, but when transmittion of our *set of pages*
> finished. Hence, with skb shared info destructor we would need also to
> invent some way to track set of pages <-> set of skbs translation (you
> refer it as combining tag and separate destructor), which would bring
> this solution on the entire new complexity level for no gain over the
> sk_transaction_token solution.
You really do not need to know when transmission is over, but when remote
side acks it (or connection is reset by the timeout). There is no way to
know when transmission is over without creating own skbs and submitting
them avoiding usual tcp/ip stack machinery.
You do not need to know which skbs contain which pages, system only should
track page pointers freed at skb destruction (shared info destruction
actually) time, no matter who owns those pages (since new pages can be
added into the page and some of the old ones can be freed early).
This will be effectively the same token, but it does not mean that
everyone who needs notification will have to perform additional
allocation. Put two pointers: destructor and token and do whatever you
like if one of them is non-empty, but try to avoid unneded overhead when
it is possible.
--
Evgeniy Polyakov
On Tue, 2008-12-30 at 22:03 +0100, Frederic Weisbecker wrote:
>
> > 2. If possible, make the view more compact, i.e. with less spaces on
> > each line. Our tracing lines can be quite long.
> >
> > Thanks,
> > Vlad
> >
>
> I'm not sure... The spaces are mostly here to keep fixed width columns,
> in case we have huge durations, or long cmdlines...
>
I agree. The spaces can easily be chopped with any one of many shell
scripts.
-- Steve
* Fr?d?ric Weisbecker <[email protected]> wrote:
> > All the above functionality is almost what we need. The only thing
> > left, which I forgot to mention, is possibility to log also functions
> > return value on exit. This is what TRACE_EXIT_RES() in SCST does. Is
> > it possible to add those?
>
> I want to add that on the function graph tracer. That can be done pretty
> easily. The only problem comes with the type of the return value. Would
> this tracer be supposed to always return a 64 bits value regardless of
> the real typ of the value? There would be some pointless bytes on most
> return values. I don't know how to proceed for this problem.
Things like mov ...,%eax are zero-extend so they'll zap the high 32 bits.
The real problem are byte return values generated via things like:
movb $1, %al
those wont zero-extend, so you could get garbage in the output. One
approach would be to try a quick hack just to see how common a problem
this is.
We could extract the return type from the debuginfo, hash it in a
read-mostly table and then look it up, but that seems complex both in
terms of build overhead and in terms of runtime overhead.
Ingo
2008/12/16 Ingo Molnar <[email protected]>:
>
> * Fr?d?ric Weisbecker <[email protected]> wrote:
>
>> > All the above functionality is almost what we need. The only thing
>> > left, which I forgot to mention, is possibility to log also functions
>> > return value on exit. This is what TRACE_EXIT_RES() in SCST does. Is
>> > it possible to add those?
>>
>> I want to add that on the function graph tracer. That can be done pretty
>> easily. The only problem comes with the type of the return value. Would
>> this tracer be supposed to always return a 64 bits value regardless of
>> the real typ of the value? There would be some pointless bytes on most
>> return values. I don't know how to proceed for this problem.
>
> Things like mov ...,%eax are zero-extend so they'll zap the high 32 bits.
That's right, but the problem occurs under 32 bits. The return values
for 64 bits
are in eax and edx. And most of the time, the high part (edx) will be junk.
> The real problem are byte return values generated via things like:
>
> movb $1, %al
>
> those wont zero-extend, so you could get garbage in the output. One
> approach would be to try a quick hack just to see how common a problem
> this is.
Yes, I will try something.
> We could extract the return type from the debuginfo, hash it in a
> read-mostly table and then look it up, but that seems complex both in
> terms of build overhead and in terms of runtime overhead.
I thought about it too and as you say it's rather complex. And
thinking about non primitive types (like pid_t....)
that would require a second pass of analysis to retrieve the
corresponding primitive...
* Fr?d?ric Weisbecker <[email protected]> wrote:
> 2008/12/16 Ingo Molnar <[email protected]>:
> >
> > * Fr?d?ric Weisbecker <[email protected]> wrote:
> >
> >> > All the above functionality is almost what we need. The only thing
> >> > left, which I forgot to mention, is possibility to log also functions
> >> > return value on exit. This is what TRACE_EXIT_RES() in SCST does. Is
> >> > it possible to add those?
> >>
> >> I want to add that on the function graph tracer. That can be done pretty
> >> easily. The only problem comes with the type of the return value. Would
> >> this tracer be supposed to always return a 64 bits value regardless of
> >> the real typ of the value? There would be some pointless bytes on most
> >> return values. I don't know how to proceed for this problem.
> >
> > Things like mov ...,%eax are zero-extend so they'll zap the high 32 bits.
>
>
> That's right, but the problem occurs under 32 bits. The return values
> for 64 bits are in eax and edx. And most of the time, the high part
> (edx) will be junk.
for wider types i'd suggest to just print the low bits. Most of the
interesting return types fit into machine word.
> > The real problem are byte return values generated via things like:
> >
> > movb $1, %al
> >
> > those wont zero-extend, so you could get garbage in the output. One
> > approach would be to try a quick hack just to see how common a problem
> > this is.
>
>
> Yes, I will try something.
>
> > We could extract the return type from the debuginfo, hash it in a
> > read-mostly table and then look it up, but that seems complex both in
> > terms of build overhead and in terms of runtime overhead.
>
> I thought about it too and as you say it's rather complex. And thinking
> about non primitive types (like pid_t....) that would require a second
> pass of analysis to retrieve the corresponding primitive...
structure returns would be rather evil to handle, agreed.
Unrelated:
it would be really nice if we could extend ftrace to trace system calls
and their parameters and return values. We used to have that in the
latency-tracer in -rt, and it was rather useful. Controlled by a
trace_option i guess - and blendable into any of the tracer outputs
(function tracer most notably).
Explicit calls in the entry.S files would be OK for this - but maybe we
can get this via the tricky use of a TIF_ flags as well, to force the code
into the ptrace callbacks and then divert it for ftrace's pleasure?
Ingo
2008/12/16 Ingo Molnar <[email protected]>:
>
> * Fr?d?ric Weisbecker <[email protected]> wrote:
>
>> 2008/12/16 Ingo Molnar <[email protected]>:
>> >
>> > * Fr?d?ric Weisbecker <[email protected]> wrote:
>> >
>> >> > All the above functionality is almost what we need. The only thing
>> >> > left, which I forgot to mention, is possibility to log also functions
>> >> > return value on exit. This is what TRACE_EXIT_RES() in SCST does. Is
>> >> > it possible to add those?
>> >>
>> >> I want to add that on the function graph tracer. That can be done pretty
>> >> easily. The only problem comes with the type of the return value. Would
>> >> this tracer be supposed to always return a 64 bits value regardless of
>> >> the real typ of the value? There would be some pointless bytes on most
>> >> return values. I don't know how to proceed for this problem.
>> >
>> > Things like mov ...,%eax are zero-extend so they'll zap the high 32 bits.
>>
>>
>> That's right, but the problem occurs under 32 bits. The return values
>> for 64 bits are in eax and edx. And most of the time, the high part
>> (edx) will be junk.
>
> for wider types i'd suggest to just print the low bits. Most of the
> interesting return types fit into machine word.
>
>> > The real problem are byte return values generated via things like:
>> >
>> > movb $1, %al
>> >
>> > those wont zero-extend, so you could get garbage in the output. One
>> > approach would be to try a quick hack just to see how common a problem
>> > this is.
>>
>>
>> Yes, I will try something.
>>
>> > We could extract the return type from the debuginfo, hash it in a
>> > read-mostly table and then look it up, but that seems complex both in
>> > terms of build overhead and in terms of runtime overhead.
>>
>> I thought about it too and as you say it's rather complex. And thinking
>> about non primitive types (like pid_t....) that would require a second
>> pass of analysis to retrieve the corresponding primitive...
>
> structure returns would be rather evil to handle, agreed.
>
> Unrelated:
>
> it would be really nice if we could extend ftrace to trace system calls
> and their parameters and return values. We used to have that in the
> latency-tracer in -rt, and it was rather useful. Controlled by a
> trace_option i guess - and blendable into any of the tracer outputs
> (function tracer most notably).
>
> Explicit calls in the entry.S files would be OK for this - but maybe we
> can get this via the tricky use of a TIF_ flags as well, to force the code
> into the ptrace callbacks and then divert it for ftrace's pleasure?
>
> Ingo
>
I recently thought about making a syscalls tracer and only thought
about the tracepoints. But that could be better to
trace them and all the function they call with the function graph
tracer. I think about the "graph" one because time of execution
measure could be useful for the -rt development.
I think that putting new calls on entryxx.S would be an overload and
would overlap what already does ptrace on syscall entry/exit.
So I like the idea of using the ptrace callbacks by manipulating the TIF_ flags.
Yeah that's interesting. I will do that for 2.6.30 :-)
Dec 15 17:09:48 linux-c07e kernel: [ 738.535571] [2519]: ENTRY dev_user_ioctl
Dec 15 17:09:48 linux-c07e kernel: [ 738.535582] [2519]: dev_user_ioctl:1735:REGISTER_DEVICE
Dec 15 17:09:48 linux-c07e kernel: [ 738.535589] [2519]: ENTRY dev_user_register_dev
Dec 15 17:09:48 linux-c07e kernel: [ 738.535601] [2519]: ENTRY sgv_pool_create
Dec 15 17:09:48 linux-c07e kernel: [ 738.535610] [2519]: ENTRY sgv_pool_init
Dec 15 17:09:48 linux-c07e kernel: [ 738.535616] [2519]: sgv_pool_init:997:name dev0, sizeof(*obj)=52, clustering_type=0
Dec 15 17:09:48 linux-c07e kernel: [ 738.535626] [2519]: sgv_pool_init:1026:pages=1, size=76
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=2, size=100
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=4, size=148
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=8, size=244
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=16, size=436
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=32, size=820
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=64, size=1588
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=128, size=3124
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=256, size=52
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=512, size=52
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=1024, size=52
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT sgv_pool_init: 0
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT sgv_pool_create: 1
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: ENTRY sgv_pool_create
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: ENTRY sgv_pool_init
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:997:name dev0-clust, sizeof(*obj)=52, clustering_type=1
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=1, size=80
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=2, size=108
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=4, size=164
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=8, size=276
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=16, size=500
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=32, size=948
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=64, size=1844
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=128, size=3636
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=256, size=1076
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=512, size=2100
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: sgv_pool_init:1026:pages=1024, size=52
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT sgv_pool_init: 0
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT sgv_pool_create: 1
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: ENTRY __dev_user_set_opt
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: __dev_user_set_opt:2580:dev dev0, parse_type 0, on_free_cmd_type 1, memory_reuse_type 1, partial_transfers_type 0, partial_len 0
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: ENTRY dev_user_setup_functions
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT dev_user_setup_functions
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT __dev_user_set_opt: 0
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: dev_user_register_dev:2499:dev b7bffad0, name dev0
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: ENTRY __scst_register_virtual_dev_driver
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT scst_dev_handler_check: 0
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: scst: __scst_register_virtual_dev_driver:1036:Virtual device handler dh-dev0 for type 1 registered successfully
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT __scst_register_virtual_dev_driver: 0
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: ENTRY scst_register_virtual_device
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT scst_dev_handler_check: 0
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: ENTRY scst_suspend_activity
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: scst_suspend_activity:484:suspend_count 0
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: ENTRY scst_susp_wait
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: scst_susp_wait:463:wait_event() returned 0
Dec 15 17:09:48 linux-c07e kernel: [ 738.538862] [2519]: EXIT scst_susp_wait: 0
Hello linux-mm,
Recently I submitted a new SCSI target framework (SCST) and 4 target
drivers for it for the first iteration of review and comments. See
http://lkml.org/lkml/2008/12/10/245 for details.
An iSCSI target driver iSCSI-SCST was a part of the patchset
(http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to
have TCP zero-copy transmit of user space data was implemented. Patch,
implementing this optimization was also sent in the patchset, see
http://lkml.org/lkml/2008/12/10/296.
I would like to ask, if the approach used in this patch can be
acceptable from your point of view? I understand, that extending struct
page is a very much undesirable, but, from other side:
- This approach is very simple and straightforward. The patch is only
309 lines long, including comments. All other alternative
implementations would be at least an order of magnitude more complicated.
- Related kernel config option
TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION should be disabled by
default in general distro kernels, so the would be no harm at all from
this patch. ISCSI-SCST can work without this patch or with
TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION option disabled, although
with user space device handlers it will work considerably worse. Only
few distro kernels users need an iSCSI target and only few among such
users need to use user space device handlers. People who need both iSCSI
target *and* fast working user space device handlers would simply enable
that option and rebuild the kernel. Rejecting this patch provides much
worse alternative: those people would also have to *patch* the kernel at
first, only then enable that option, then rebuild the kernel.
- Although usage of struct page to keep network related pointer might
look as a layering violation, it isn't. I wrote in
http://lkml.org/lkml/2008/12/15/190 why.
Thanks,
Vlad
On 12/18/2008 12:35 PM, Vladislav Bolkhovitin wrote:
> An iSCSI target driver iSCSI-SCST was a part of the patchset
> (http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to
> have TCP zero-copy transmit of user space data was implemented. Patch,
> implementing this optimization was also sent in the patchset, see
> http://lkml.org/lkml/2008/12/10/296.
I'm probably ignorant of about 90% of the context here, but isn't this the
sort of problem that was supposed to have been solved by vmsplice(2)?
- DML