Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261867AbVANC6T (ORCPT ); Thu, 13 Jan 2005 21:58:19 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S261876AbVANC6S (ORCPT ); Thu, 13 Jan 2005 21:58:18 -0500 Received: from opersys.com ([64.40.108.71]:51983 "EHLO www.opersys.com") by vger.kernel.org with ESMTP id S261867AbVANC44 (ORCPT ); Thu, 13 Jan 2005 21:56:56 -0500 Message-ID: <41E736BA.5070806@opersys.com> Date: Thu, 13 Jan 2005 22:04:26 -0500 From: Karim Yaghmour Reply-To: karim@opersys.com Organization: Opersys inc. User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.2) Gecko/20040805 Netscape/7.2 X-Accept-Language: en-us, en, fr, fr-be, fr-ca, fr-fr MIME-Version: 1.0 To: linux-kernel , LTT-Dev Subject: [PATCH 1/4] relayfs for 2.6.10: doc Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 38943 Lines: 835 This and the following are the relayfs patches for 2.6.10. Unsplit copies are available here: http://www.opersys.com/relayfs/index.html Comments welcomed as usual. Signed-off-by: Karim Yaghmour (karim@opersys.com) Karim P.S.: I was told earlier that my mail client did some whitespace mangling. I believe this has been corrected. Let me know otherwise. --- linux-2.6.10/Documentation/filesystems/relayfs.txt 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.10-relayfs/Documentation/filesystems/relayfs.txt 2005-01-13 10:40:25.000000000 -0800 @@ -0,0 +1,812 @@ + +relayfs - a high-speed data relay filesystem +============================================ + +relayfs is a filesystem designed to provide an efficient mechanism for +tools and facilities to relay large amounts of data from kernel space +to user space. + +The main idea behind relayfs is that every data flow is put into a +separate "channel" and each channel is a file. In practice, each +channel is a separate memory buffer allocated from within kernel space +upon channel instantiation. Software needing to relay data to user +space would open a channel or a number of channels, depending on its +needs, and would log data to that channel. All the buffering and +locking mechanics are taken care of by relayfs. The actual format and +protocol used for each channel is up to relayfs' clients. + +relayfs makes no provisions for copying the same data to more than a +single channel. This is for the clients of the relay to take care of, +and so is any form of data filtering. The purpose is to keep relayfs +as simple as possible. + + +Usage +===== + +In addition to the relayfs kernel API described below, relayfs +implements basic file operations. Here are the file operations that +are available and some comments regarding their behavior: + +open() enables user to open an _existing_ channel. A channel can be + opened in blocking or non-blocking mode, and can be opened + for reading as well as for writing. Readers will by default + be auto-consuming. + +mmap() results in channel's memory buffer being mmapped into the + caller's memory space. + +read() since we are dealing with circular buffers, the user is only + allowed to read forward. Some apps may want to loop around + read() waiting for incoming data - if there is no data + available, read will put the reader on a wait queue until + data is available (blocking mode). Non-blocking reads return + -EAGAIN if data is not available. + + +write() writing from user space operates exactly as relay_write() does + (described below). + +poll() POLLIN/POLLRDNORM/POLLOUT/POLLWRNORM/POLLERR supported. + +close() decrements the channel's refcount. When the refcount reaches + 0 i.e. when no process or kernel client has the file open + (see relay_close() below), the channel buffer is freed. + + +In order for a user application to make use of relayfs files, the +relayfs filesystem must be mounted. For example, + + mount -t relayfs relayfs /mountpoint + + +The relayfs kernel API +====================== + +relayfs channels are implemented as circular buffers subdivided into +'sub-buffers'. kernel clients write data into the channel using +relay_write(), and are notified via a set of callbacks when +significant events occur within the channel. 'Significant events' +include: + +- a sub-buffer has been filled i.e. the current write won't fit into the + current sub-buffer, and a 'buffer-switch' is triggered, after which + the data is written into the next buffer (if the next buffer is + empty). The client is notified of this condition via two callbacks, + one providing an opportunity to perform start-of-buffer tasks, the + other end-of-buffer tasks. + +- data is ready for the client to process. The client can choose to + be notified either on a per-sub-buffer basis (bulk delivery) or + per-write basis (packet delivery). + +- data has been written to the channel from user space. The client can + use this notification to accept and process 'commands' sent to the + channel via write(2). + +- the channel has been opened/closed/mapped/unmapped from user space. + The client can use this notification to trigger actions within the + kernel application, such as enabling/disabling logging to the + channel. It can also return result codes from the callback, + indicating that the operation should fail e.g. in order to restrict + more than one user space open or mmap. + +- the channel needs resizing, or needs to update its + state based on the results of the resize. Resizing the channel is + up to the kernel client to actually perform. If the channel is + configured for resizing, the client is notified when the unread data + in the channel passes a preset threshold, giving it the opportunity + to allocate a new channel buffer and replace the old one. + +Reader objects +-------------- + +Channel readers use an opaque rchan_reader object to read from +channels. For VFS readers (those using read(2) to read from a +channel), these objects are automatically created and used internally; +only kernel clients that need to directly read from channels, or whose +userspace applications use mmap to access channel data, need to know +anything about rchan_readers - others may skip this section. + +A relay channel can have any number of readers, each represented by an +rchan_reader instance, which is used to encapsulate reader settings +and state. rchan_reader objects should be treated as opaque by kernel +clients. To create a reader object for directly accessing a channel +from kernel space, call the add_rchan_reader() kernel API function: + +rchan_reader *add_rchan_reader(rchan_id, auto_consume) + +This function returns an rchan_reader instance if successful, which +should then be passed to relay_read() when the kernel client is +interested in reading from the channel. + +The auto_consume parameter indicates whether a read done by this +reader will automatically 'consume' that portion of the unread channel +buffer when relay_read() is called (see below for more details). + +To close the reader, call + +remove_rchan_reader(reader) + +which will remove the reader from the list of current readers. + + +To create a reader object representing a userspace mmap reader in the +kernel application, call the add_map_reader() kernel API function: + +rchan_reader *add_map_reader(rchan_id) + +This function returns an rchan_reader instance if successful, whose +main purpose is as an argument to be passed into +relay_buffers_consumed() when the kernel client becomes aware that +data has been read by a user application using mmap to read from the +channel buffer. There is no auto_consume option in this case, since +only the kernel client/user application knows when data has been read. + +To close the map reader, call + +remove_map_reader(reader) + +which will remove the reader from the list of current readers. + +Consumed count +-------------- + +A relayfs channel is a circular buffer, which means that if there is +no reader reading from it or a reader reading too slowly, at some +point the channel writer will 'lap' the reader and data will be lost. +In normal use, readers will always be able to keep up with writers and +the buffer is thus never in danger of becoming full. In many +applications, it's sufficient to ensure that this is practically +speaking always the case, by making the buffers large enough. These +types of applications can basically open the channel as +RELAY_MODE_CONTINOUS (the default anyway) and not worry about the +meaning of 'consume' and skip the rest of this section. + +If it's important for the application that a kernel client never allow +writers to overwrite unread data, the channel should be opened using +RELAY_MODE_NO_OVERWRITE and must be kept apprised of the count of +bytes actually read by the (typically) user-space channel readers. +This count is referred to as the 'consumed count'. read(2) channel +readers automatically update the channel's 'consumed count' as they +read. If the usage mode is to have only read(2) readers, which is +typically the case, the kernel client doesn't need to worry about any +of the relayfs functions having to do with 'bytes consumed' and can +skip the rest of this section. (Note that it is possible to have +multiple read(2) or auto-consuming readers, but like having multiple +readers on a pipe, these readers will race with each other i.e. it's +supported, but doesn't make much sense). + +If the kernel client cannot rely on an auto-consuming reader to keep +the 'consumed count' up-to-date, then it must do so manually, by +making the appropriate calls to relay_buffers_consumed() or +relay_bytes_consumed(). In most cases, this should only be necessary +for bulk mmap clients - almost all packet clients should be covered by +having auto-consuming read(2) readers. For mmapped bulk clients, for +instance, there are no auto-consuming VFS readers, so the kernel +client needs to make the call to relay_buffers_consumed() after +sub-buffers are read. + +Kernel API +---------- + +Here's a summary of the API relayfs provides to in-kernel clients: + +int relay_open(channel_path, bufsize, nbufs, channel_flags, + channel_callbacks, start_reserve, end_reserve, + rchan_start_reserve, resize_min, resize_max, mode, + init_buf, init_buf_size) +int relay_write(channel_id, *data_ptr, count, time_delta_offset, **wrote) +rchan_reader *add_rchan_reader(channel_id, auto_consume) +int remove_rchan_reader(rchan_reader *reader) +rchan_reader *add_map_reader(channel_id) +int remove_map_reader(rchan_reader *reader) +int relay_read(reader, buf, count, wait, *actual_read_offset) +void relay_buffers_consumed(reader, buffers_consumed) +void relay_bytes_consumed(reader, bytes_consumed, read_offset) +int relay_bytes_avail(reader) +int rchan_full(reader) +int rchan_empty(reader) +int relay_info(channel_id, *channel_info) +int relay_close(channel_id) +int relay_realloc_buffer(channel_id, nbufs, async) +int relay_replace_buffer(channel_id) +int relay_reset(int rchan_id) + +---------- +int relay_open(channel_path, bufsize, nbufs, + channel_flags, channel_callbacks, start_reserve, + end_reserve, rchan_start_reserve, resize_min, resize_max, mode) + +relay_open() is used to create a new entry in relayfs. This new entry +is created according to channel_path. channel_path contains the +absolute path to the channel file on relayfs. If, for example, the +caller sets channel_path to "/xlog/9", a "xlog/9" entry will appear +within relayfs automatically and the "xlog" directory will be created +in the filesystem's root. relayfs does not implement any policy on +its content, except to disallow the opening of two channels using the +same file. There are, nevertheless a set of guidelines for using +relayfs. Basically, each facility using relayfs should use a top-level +directory identifying it. The entry created above, for example, +presumably belongs to the "xlog" software. + +The remaining parameters for relay_open() are as follows: + +- channel_flags - an ORed combination of attribute values controlling + common channel characteristics: + + - logging scheme - relayfs use 2 mutually exclusive schemes + for logging data to a channel. The 'lockless scheme' + reserves and writes data to a channel without the need of + any type of locking on the channel. This is the preferred + scheme, but may not be available on a given architecture (it + relies on the presence of a cmpxchg instruction). It's + specified by the RELAY_SCHEME_LOCKLESS flag. The 'locking + scheme' either obtains a lock on the channel for writing or + disables interrupts, depending on whether the channel was + opened for SMP or global usage (see below). It's specified + by the RELAY_SCHEME_LOCKING flag. While a client may want + to explicitly specify a particular scheme to use, it's more + convenient to specify RELAY_SCHEME_ANY for this flag, which + will allow relayfs to choose the best available scheme i.e. + lockless if supported. + + - overwrite mode (default is RELAY_MODE_CONTINUOUS) - + If RELAY_MODE_CONTINUOUS is specified, writes to the channel + will succeed regardless of whether there are up-to-date + consumers or not. If RELAY_MODE_NO_OVERWRITE is specified, + the channel becomes 'full' when the total amount of buffer + space unconsumed by readers equals or exceeds the total + buffer size. With the buffer in this state, writes to the + buffer will fail - clients need to check the return code from + relay_write() to determine if this is the case and act + accordingly - 0 or a negative value indicate the write failed. + + - SMP usage - this applies only when the locking scheme is in + use. If RELAY_USAGE_SMP is specified, it's assumed that the + channel will be used in a per-CPU fashion and consequently, + the only locking that will be done for writes is to disable + local irqs. If RELAY_USAGE_GLOBAL is specified, it's assumed + that writes to the buffer can occur within any CPU context, + and spinlock_irq_save will be used to lock the buffer. + + - delivery mode - if RELAY_DELIVERY_BULK is specified, the + client will be notified via its deliver() callback whenever a + sub-buffer has been filled. Alternatively, + RELAY_DELIVERY_PACKET will cause delivery to occur after the + completion of each write. See the description of the channel + callbacks below for more details. + + - timestamping - if RELAY_TIMESTAMP_TSC is specified and the + architecture supports it, efficient TSC 'timestamps' can be + associated with each write, otherwise more expensive + gettimeofday() timestamping is used. At the beginning of + each sub-buffer, a gettimeofday() timestamp and the current + TSC, if supported, are read, and are passed on to the client + via the buffer_start() callback. This allows correlation of + the current time with the current TSC for subsequent writes. + Each subsequent write is associated with a 'time delta', + which is either the current TSC, if the channel is using + TSCs, or the difference between the buffer_start gettimeofday + timestamp and the gettimeofday time read for the current + write. Note that relayfs never writes either a timestamp or + time delta into the buffer unless explicitly asked to (see + the description of relay_write() for details). + +- bufsize - the size of the 'sub-buffers' making up the circular channel + buffer. For the lockless scheme, this must be a power of 2. + +- nbufs - the number of 'sub-buffers' making up the circular + channel buffer. This must be a power of 2. + + The total size of the channel buffer is bufsize * nbufs rounded up + to the next kernel page size. If the lockless scheme is used, both + bufsize and nbufs must be a power of 2. If the locking scheme is + used, the bufsize can be anything and nbufs must be a power of 2. If + RELAY_SCHEME_ANY is used, the bufsize and nbufs should be a power of 2. + + NOTE: if nbufs is 1, relayfs will bypass the normal size + checks and will allocate an rvmalloced buffer of size bufsize. + This buffer will be freed when relay_close() is called, if the channel + isn't still being referenced. + +- callbacks - a table of callback functions called when events occur + within the data relay that clients need to know about: + + - int buffer_start(channel_id, current_write_pos, buffer_id, + start_time, start_tsc, using_tsc) - + + called at the beginning of a new sub-buffer, the + buffer_start() callback gives the client an opportunity to + write data into space reserved at the beginning of a + sub-buffer. The client should only write into the buffer + if it specified a value for start_reserve and/or + channel_start_reserve (see below) when the channel was + opened. In the latter case, the client can determine + whether to write its one-time rchan_start_reserve data by + examining the value of buffer_id, which will be 0 for the + first sub-buffer. The address that the client can write + to is contained in current_write_pos (the client by + definition knows how much it can write i.e. the value it + passed to relay_open() for start_reserve/ + channel_start_reserve). start_time contains the + gettimeofday() value for the start of the buffer and start + TSC contains the TSC read at the same time. The using_tsc + param indicates whether or not start_tsc is valid (it + wouldn't be if TSC timestamping isn't being used). + + The client should return the number of bytes it wrote to + the channel, 0 if none. + + - int buffer_end(channel_id, current_write_pos, end_of_buffer, + end_time, end_tsc, using_tsc) + + called at the end of a sub-buffer, the buffer_end() + callback gives the client an opportunity to perform + end-of-buffer processing. Note that the current_write_pos + is the position where the next write would occur, but + since the current write wouldn't fit (which is the trigger + for the buffer_end event), the buffer is considered full + even though there may be unused space at the end. The + end_of_buffer param pointer value can be used to determine + exactly the size of the unused space. The client should + only write into the buffer if it specified a value for + end_reserve when the channel was opened. If the client + doesn't write anything i.e. returns 0, the unused space at + the end of the sub-buffer is available via relay_info() - + this data may be needed by the client later if it needs to + process raw sub-buffers (an alternative would be to save + the unused bytes count value in end_reserve space at the + end of each sub-buffer during buffer_end processing and + read it when needed at a later time. The other + alternative would be to use read(2), which makes the + unused count invisible to the caller). end_time contains + the gettimeofday() value for the end of the buffer and end + TSC contains the TSC read at the same time. The using_tsc + param indicates whether or not end_tsc is valid (it + wouldn't be if TSC timestamping isn't being used). + + The client should return the number of bytes it wrote to + the channel, 0 if none. + + - void deliver(channel_id, from, len) + + called when data is ready for the client. This callback + is used to notify a client when a sub-buffer is complete + (in the case of bulk delivery) or a single write is + complete (packet delivery). A bulk delivery client might + wish to then signal a daemon that a sub-buffer is ready. + A packet delivery client might wish to process the packet + or send it elsewhere. The from param is a pointer to the + delivered data and len specifies how many bytes are ready. + + - void user_deliver(channel_id, from, len) + + called when data has been written to the channel from user + space. This callback is used to notify a client when a + successful write from userspace has occurred, independent + of whether bulk or packet delivery is in use. This can be + used to allow userspace programs to communicate with the + kernel client through the channel via out-of-band write(2) + 'commands' instead of via ioctls, for instance. The from + param is a pointer to the delivered data and len specifies + how many bytes are ready. Note that this callback occurs + after the bytes have been successfully written into the + channel, which means that channel readers must be able to + deal with the 'command' data which will appear in the + channel data stream just as any other userspace or + non-userspace write would. + + - int needs_resize(channel_id, resize_type, + suggested_buf_size, suggested_n_bufs) + + called when a channel's buffers are in danger of becoming + full i.e. the number of unread bytes in the channel passes + a preset threshold, or when the current capacity of a + channel's buffer is no longer needed. Also called to + notify the client when a channel's buffer has been + replaced. If resize_type is RELAY_RESIZE_EXPAND or + RELAY_RESIZE_SHRINK, the kernel client should arrange to + call relay_realloc_buffer() with the suggested buffer size + and buffer count, which will allocate (but will not + replace the old one) a new buffer of the recommended size + for the channel. When the allocation has completed, + needs_resize() is again called, this time with a + resize_type of RELAY_RESIZE_REPLACE. The kernel client + should then arrange to call relay_replace_buffer() to + actually replace the old channel buffer with the newly + allocated buffer. Finally, once the buffer replacement + has completed, needs_resize() is again called, this time + with a resize_type of RELAY_RESIZE_REPLACED, to inform the + client that the replacement is complete and additionally + confirming the current sub-buffer size and number of + sub-buffers. Note that a resize can be canceled if + relay_realloc_buffer() is called with the async param + non-zero and the resize conditions no longer hold. In + this case, the RELAY_RESIZE_REPLACED suggested number of + sub-buffers will be the same as the number of sub-buffers + that existed before the RELAY_RESIZE_SHRINK or EXPAND i.e. + values indicating that the resize didn't actually occur. + + - int fileop_notify(channel_id, struct file *filp, enum relay_fileop) + + called when a userspace file operation has occurred or + will occur on a relayfs channel file. These notifications + can be used by the kernel client to trigger actions within + the kernel client when the corresponding event occurs, + such as enabling logging only when a userspace application + opens or mmaps a relayfs file and disabling it again when + the file is closed or unmapped. The kernel client can + also return its own return value, which can affect the + outcome of file operation - returning 0 indicates that the + operation should succeed, and returning a negative value + indicates that the operation should be failed, and that + the returned value should be returned to the ultimate + caller e.g. returning -EPERM from the open fileop will + cause the open to fail with -EPERM. Among other things, + the return value can be used to restrict a relayfs file + from being opened or mmap'ed more than once. The currently + implemented fileops are: + + RELAY_FILE_OPEN - a relayfs file is being opened. Return + 0 to allow it to succeed, negative to + have it fail. A negative return value will + be passed on unmodified to the open fileop. + RELAY_FILE_CLOSE- a relayfs file is being closed. The return + value is ignored. + RELAY_FILE_MAP - a relayfs file is being mmap'ed. Return 0 + to allow it to succeed, negative to have + it fail. A negative return value will be + passed on unmodified to the mmap fileop. + RELAY_FILE_UNMAP- a relayfs file is being unmapped. The return + value is ignored. + + - void ioctl(rchan_id, cmd, arg) + + called when an ioctl call is made using a relayfs file + descriptor. The cmd and arg are passed along to this + callback unmodified for it to do as it wishes with. The + return value from this callback is used as the return value + of the ioctl call. + + If the callbacks param passed to relay_open() is NULL, a set of + default do-nothing callbacks will be defined for the channel. + Likewise, any NULL rchan_callback function contained in a non-NULL + callbacks struct will be filled in with a default callback function + that does nothing. + +- start_reserve - the number of bytes to be reserved at the start of + each sub-buffer. The client can do what it wants with this number + of bytes when the buffer_start() callback is invoked. Typically + clients would use this to write per-sub-buffer header data. + +- end_reserve - the number of bytes to be reserved at the end of each + sub-buffer. The client can do what it wants with this number of + bytes when the buffer_end() callback is invoked. Typically clients + would use this to write per-sub-buffer footer data. + +- channel_start_reserve - the number of bytes to be reserved, in + addition to start_reserve, at the beginning of the first sub-buffer + in the channel. The client can do what it wants with this number of + bytes when the buffer_start() callback is invoked. Typically + clients would use this to write per-channel header data. + +- resize_min - if set, this signifies that the channel is + auto-resizeable. The value specifies the size that the channel will + try to maintain as a normal working size, and that it won't go + below. The client makes use of the resizing callbacks and + relay_realloc_buffer() and relay_replace_buffer() to actually effect + the resize. + +- resize_max - if set, this signifies that the channel is + auto-resizeable. The value specifies the maximum size the channel + can have as a result of resizing. + +- mode - if non-zero, specifies the file permissions that will be given + to the channel file. If 0, the default rw user perms will be used. + +- init_buf - if non-NULL, rather than allocating the channel buffer, + this buffer will be used as the initial channel buffer. The kernel + API function relay_discard_init_buf() can later be used to have + relayfs allocate a normal mmappable channel buffer and switch over + to using it after copying the init_buf contents into it. Currently, + the size of init_buf must be exactly buf_size * n_bufs. The caller + is responsible for managing the init_buf memory. This feature is + typically used for init-time channel use and should normally be + specified as NULL. + +- init_buf_size - the total size of init_buf, if init_buf is specified + as non-NULL. Currently, the size of init_buf must be exactly + buf_size * n_bufs. + +Upon successful completion, relay_open() returns a channel id +to be used for all other operations with the relay. All buffers +managed by the relay are allocated using rvmalloc/rvfree to allow +for easy mmapping to user-space. + +---------- +int relay_write(channel_id, *data_ptr, count, time_delta_offset, **wrote_pos) + +relay_write() reserves space in the channel and writes count bytes of +data pointed to by data_ptr to it. Automatically performs any +necessary locking, depending on the scheme and SMP usage in effect (no +locking is done for the lockless scheme regardless of usage). It +returns the number of bytes written, or 0/negative on failure. If +time_delta_offset is >= 0, the internal time delta, the internal time +delta calculated when the slot was reserved will be written at that +offset. This is the TSC or gettimeofday() delta between the current +write and the beginning of the buffer, whichever method is being used +by the channel. Trying to write a count larger than the bufsize +specified to relay_open() (taking into account the reserved +start-of-buffer and end-of-buffer space as well) will fail. If +wrote_pos is non-NULL, it will receive the location the data was +written to, which may be needed for some applications but is not +normally interesting. Most applications should pass in NULL for this +param. + +---------- +struct rchan_reader *add_rchan_reader(int rchan_id, int auto_consume) + +add_rchan_reader creates and initializes a reader object for a +channel. An opaque rchan_reader object is returned on success, and is +passed to relay_read() when reading the channel. If the boolean +auto_consume parameter is 1, the reader is defined to be +auto-consuming. auto-consuming reader objects are automatically +created and used for VFS read(2) readers. + +---------- +void remove_rchan_reader(struct rchan_reader *reader) + +remove_rchan_reader finds and removes the given reader from the +channel. This function is used only by non-VFS read(2) readers. VFS +read(2) readers are automatically removed when the corresponding file +object is closed. + +---------- +reader add_map_reader(int rchan_id) + +Creates and initializes an rchan_reader object for channel map +readers, and is needed for updating relay_bytes/buffers_consumed() +when kernel clients become aware of the need to do so by their mmap +user clients. + +---------- +int remove_map_reader(reader) + +Finds and removes the given map reader from the channel. This function +is useful only for map readers. + +---------- +int relay_read(reader, buf, count, wait, *actual_read_offset) + +Reads count bytes from the channel, or as much as is available within +the sub-buffer currently being read. The read offset that will be +read from is the position contained within the reader object. If the +wait flag is set, buf is non-NULL, and there is nothing available, it +will wait until there is. If the wait flag is 0 and there is nothing +available, -EAGAIN is returned. If buf is NULL, the value returned is +the number of bytes that would have been read. actual_read_offset is +the value that should be passed as the read offset to +relay_bytes_consumed, needed only if the reader is not auto-consuming +and the channel is MODE_NO_OVERWRITE, but in any case, it must not be +NULL. + +---------- + +int relay_bytes_avail(reader) + +Returns the number of bytes available relative to the reader's current +read position within the corresponding sub-buffer, 0 if there is +nothing available. Note that this doesn't return the total bytes +available in the channel buffer - this is enough though to know if +anything is available, however, or how many bytes might be returned +from the next read. + +---------- +void relay_buffers_consumed(reader, buffers_consumed) + +Adds to the channel's consumed buffer count. buffers_consumed should +be the number of buffers newly consumed, not the total number +consumed. NOTE: kernel clients don't need to call this function if +the reader is auto-consuming or the channel is MODE_CONTINUOUS. + +In order for the relay to detect the 'buffers full' condition for a +channel, it must be kept up-to-date with respect to the number of +buffers consumed by the client. If the addition of the value of the +bufs_consumed param to the current bufs_consumed count for the channel +would exceed the bufs_produced count for the channel, the channel's +bufs_consumed count will be set to the bufs_produced count for the +channel. This allows clients to 'catch up' if necessary. + +---------- +void relay_bytes_consumed(reader, bytes_consumed, read_offset) + +Adds to the channel's consumed count. bytes_consumed should be the +number of bytes actually read e.g. return value of relay_read() and +the read_offset should be the actual offset the bytes were read from +e.g. the actual_read_offset set by relay_read(). NOTE: kernel clients +don't need to call this function if the reader is auto-consuming or +the channel is MODE_CONTINUOUS. + +In order for the relay to detect the 'buffers full' condition for a +channel, it must be kept up-to-date with respect to the number of +bytes consumed by the client. For packet clients, it makes more sense +to update after each read rather than after each complete sub-buffer +read. The bytes_consumed count updates bufs_consumed when a buffer +has been consumed so this count remains consistent. + +---------- +int relay_info(channel_id, *channel_info) + +relay_info() fills in an rchan_info struct with channel status and +attribute information such as usage modes, sub-buffer size and count, +the allocated size of the entire buffer, buffers produced and +consumed, current buffer id, count of writes lost due to buffers full +condition. + +The virtual address of the channel buffer is also available here, for +those clients that need it. + +Clients may need to know how many 'unused' bytes there are at the end +of a given sub-buffer. This would only be the case if the client 1) +didn't either write this count to the end of the sub-buffer or +otherwise note it (it's available as the difference between the buffer +end and current write pos params in the buffer_end callback) (if the +client returned 0 from the buffer_end callback, it's assumed that this +is indeed the case) 2) isn't using the read() system call to read the +buffer. In other words, if the client isn't annotating the stream and +is reading the buffer by mmaping it, this information would be needed +in order for the client to 'skip over' the unused bytes at the ends of +sub-buffers. + +Additionally, for the lockless scheme, clients may need to know +whether a particular sub-buffer is actually complete. An array of +boolean values, one per sub-buffer, contains non-zero if the buffer is +complete, non-zero otherwise. + +---------- +int relay_close(channel_id) + +relay_close() is used to close the channel. It finalizes the last +sub-buffer (the one currently being written to) and marks the channel +as finalized. The channel buffer and channel data structure are then +freed automatically when the last reference to the channel is given +up. + +---------- +int relay_realloc_buffer(channel_id, nbufs, async) + +Allocates a new channel buffer using the specified sub-buffer count +(note that resizing can't change sub-buffer sizes). If async is +non-zero, the allocation is done in the background using a work queue. +When the allocation has completed, the needs_resize() callback is +called with a resize_type of RELAY_RESIZE_REPLACE. This function +doesn't replace the old buffer with the new - see +relay_replace_buffer(). + +This function is called by kernel clients in response to a +needs_resize() callback call with a resize type of RELAY_RESIZE_EXPAND +or RELAY_RESIZE_SHRINK. That callback also includes a suggested +new_bufsize and new_nbufs which should be used when calling this +function. + +Returns 0 on success, or errcode if the channel is busy or if +the allocation couldn't happen for some reason. + +NOTE: if async is not set, this function should not be called with a +lock held, as it may sleep. + +---------- +int relay_replace_buffer(channel_id) + +Replaces the current channel buffer with the new buffer allocated by +relay_realloc_buffer and contained in the channel struct. When the +replacement is complete, the needs_resize() callback is called with +RELAY_RESIZE_REPLACED. This function is called by kernel clients in +response to a needs_resize() callback having a resize type of +RELAY_RESIZE_REPLACE. + +Returns 0 on success, or errcode if the channel is busy or if the +replacement or previous allocation didn't happen for some reason. + +NOTE: This function will not sleep, so can called in any context and +with locks held. The client should, however, ensure that the channel +isn't actively being read from or written to. + +---------- +int relay_reset(rchan_id) + +relay_reset() has the effect of erasing all data from the buffer and +restarting the channel in its initial state. The buffer itself is not +freed, so any mappings are still in effect. NOTE: Care should be +taken that the channnel isn't actually being used by anything when +this call is made. + +---------- +int rchan_full(reader) + +returns 1 if the channel is full with respect to the reader, 0 if not. + +---------- +int rchan_empty(reader) + +returns 1 if the channel is empty with respect to the reader, 0 if not. + +---------- +int relay_discard_init_buf(rchan_id) + +allocates an mmappable channel buffer, copies the contents of init_buf +into it, and sets the current channel buffer to the newly allocated +buffer. This function is used only in conjunction with the init_buf +and init_buf_size params to relay_open(), and is typically used when +the ability to write into the channel at init-time is needed. The +basic usage is to specify an init_buf and init_buf_size to relay_open, +then call this function when it's safe to switch over to a normally +allocated channel buffer. 'Safe' means that the caller is in a +context that can sleep and that nothing is actively writing to the +channel. Returns 0 if successful, negative otherwise. + + +Writing directly into the channel +================================= + +Using the relay_write() API function as described above is the +preferred means of writing into a channel. In some cases, however, +in-kernel clients might want to write directly into a relay channel +rather than have relay_write() copy it into the buffer on the client's +behalf. Clients wishing to do this should follow the model used to +implement relay_write itself. The general sequence is: + +- get a pointer to the channel via rchan_get(). This increments the + channel's reference count. +- call relay_lock_channel(). This will perform the proper locking for + the channel given the scheme in use and the SMP usage. +- reserve a slot in the channel via relay_reserve() +- write directly to the reserved address +- call relay_commit() to commit the write +- call relay_unlock_channel() +- call rchan_put() to release the channel reference + +In particular, clients should make sure they call rchan_get() and +rchan_put() and not hold on to references to the channel pointer. +Also, forgetting to use relay_lock_channel()/relay_unlock_channel() +has no effect if the lockless scheme is being used, but could result +in corrupted buffer contents if the locking scheme is used. + + +Limitations +=========== + +Writes made via the write() system call are currently limited to 2 +pages worth of data. There is no such limit on the in-kernel API +function relay_write(). + +User applications can currently only mmap the complete buffer (it +doesn't really make sense to mmap only part of it, given its purpose). + + +Latest version +============== + +The latest version can be found at: + +http://www.opersys.com/relayfs + +Example relayfs clients, such as dynamic printk and the Linux Trace +Toolkit, can also be found there. + + +Credits +======= + +The ideas and specs for relayfs came about as a result of discussions +on tracing involving the following: + +Michel Dagenais +Richard Moore +Bob Wisniewski +Karim Yaghmour +Tom Zanussi + +Also thanks to Hubertus Franke for a lot of useful suggestions and bug +reports, and for contributing the klog code. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/