Design Notes on Asynchronous I/O (aio) for Linux
-----------------------------------------------
Date: 1/30/2002
- Based on Benjamin LaHaise's in kernel aio patches
http://www.kernel.org/pub/linux/kernel/people/bcrl/aio
http://www.kvack.org/~blah/aio
http://people.redhat.com/bcrl/aio
(Last version up at the point of writing this: aio-0.3.7)
- Refers to some other earlier aio implementations, interfaces on some
other operating systems, and DAFS specifications for context/comparison.
- Includes several inputs and/or review comments from
Ben LaHaise (overall)
Andi Kleen (experiences from an alternative aio design)
Al Viro (i/o multiplexor namespace)
John Myers (concurrency control, prioritized event delivery)
but any errors/inaccuracies are mine, and feedback or further inputs
are more than welcome.
Regards
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Centre
IBM Software Labs, India
Contents:
----------
1. Motivation
1.1 Where aio could be used
1.2 Things that aio helps with
1.3 Alternatives to aio
2. Design Philosophy and Interface Design
2.1 System and interface design philosophy
2.2 Approaches for implementing aio
2.3 Extent of true async behaviour - queue depth/throttle points
2.4 Sequencing of aio operations
2.5 Completion/readiness notification
2.6 Wakeup policies for event notification
2.7 Other goals
3. Interfaces
3.1 Interfaces provided by this implementation
3.2 Extending the interfaces
4. Design Internals
4.1 Low level primitives
4.2 Generic async event handling pieces
4.3 In-kernel interfaces
4.4 Async poll
4.5 Raw disk aio
4.6 Filesystem/buffered aio
4.7 [Placeholder for network aio]
4.8 Extending aio to other operations (todo)
5. Placeholder for performance characteristics
6. Todo Items/Pending Issues
------------------------------------------------------------------
1. Motivation
Asynchronous i/o overlaps application processing with i/o operations
for improved utilization of CPU and devices, and improved application
performance, in a dynamic/adaptive manner, especially under high loads
involving large numbers of i/o operations.
1.1 Where aio could be used:
Application performance and scalable connection management:
(a) Communications aio:
Web Servers, Proxy servers, LDAP servers, X-server
(b) Disk/File aio:
Databases, I/O intensive applications
(c) Combination
Streaming content servers (video/audio/web/ftp)
(transfering/serving data/files directly between disk and network)
Note:
The POSIX spec has examples of using aio in a journalization model, a data
acquisition model and in supercomputing applications. It mentions that
supercomputing and database architectures may often have specialized h/w
that can provide true asynchrony underlying the logical aio interface.
Aio enables an application to keep a device busy (e.g. raw i/o), potentially
improving throughput. While maximum gains are likely to be for unbuffered
i/o case, aio should be supported by all types of files and devices in
the same standard manner.
Besides, being able to do things like hands off zero-copy async sendfile
can be quite useful for streaming content servers.
1.2 Things that aio helps with:
- Ability for a thread to initiate operations or trigger actions
without having to wait for them to complete.
- Ability to queue up batches of operations and later issue a single wait to
wait for completion of any of operations or at least a certain number of
operations (Note: Currently its only "at least one" that's supported).
- Multiplexing large no of connections or input sources in a scalable manner
typically into an event driven service model.
[This can significantly reduce the cost of idle connections, which
could be important in protocols like IMAP or IRC for example where
connections may be idle for most of the time]
- Flexible/dynamic concurrency control tuning and load balancing.
- Performance implications
(a) Application thread gets to utilize its CPU time better
(b) Avoids overhead of extra threads (8KB per kernel thread in linux)
(c) System throughput helped by reducing context switches (since wait causes
less than time-slice runs)
- Ability to perform true zero-copy network i/o on arbitrary user buffers
Currently sendfile or an in-kernel server is the only clean way to use the
zero-copy networking features of 2.4. The async i/o api would enable
extending this to arbitary user buffers.
[Note: Standard O_NONBLOCK doesn't help as the API doesn't take the buffer
away from the user. As a result the kernel can avoid a copy only if it
MMU write protects the buffer and relies on COW to avoid overwrites while
the buffer is in use. This would be rather expensive due to the TLB
flushing requirements, especially as it involves IPIs on SMP. ]
1.2.1 Other expected features (aka POSIX):
- Support for synchronous polling as well as asynchronous notification
(signals/callbacks) of completion status, with ability to co-relate
event(s) with the i/o request(s).
[Note: Right now async notification is not performed by the core kernel aio
implementation, but delivered via glibc userspace threads which wait for
events and then signal the application.
TBD: There are some suggestions for a direct signal delivery mechanism from
the kernel for aio requests to avoid the pthreads overhead for some users
of POSIX aio which use SIGEV_SIGNAL and do not link with the pthreads
library. Possibly a SIGEV_EVENT opcode could be introduced to make the
native API closer to a POSIX extension.
]
- Allow multiple outstanding aio's to the same open instance and to multiple
open instances (sequence might be affected by synchronized data integrity
requirements or priorities)
[Note: Currently there are firmer guarantees on ordering for sockets by the
in-kernel aio, while for file/disk aio barrier operations may need
to be added in the future]
- Option to wait for notification of aio and non-aio events through a single
interface
[TBD: Ties in with Ben's recent idea of implementing userland wait queues ]
- Support for cancellation of outstanding i/o requests
[Note: Not implemented as yet but in plan (just done for aio_poll, others
to follow). Cancellation can by its very nature only be best effort]
- Specification of relative priorities of aio requests (optional)
[Note: Not implemented as yet. Should be linked to the new priority based
disk i/o scheduler when that happens]
1.2.2 Also Desirable:
- Ability to drive certain sequences of related async operations/transfers
in one shot from an application e.g. zero-copy async transfers across
devices (zero-copy sendfile aio)
1.3 Alternatives to aio
1.Using more threads (has its costs)
- static committed resource overhead per thread
- potentially more context switches
2.Communications aio alternatives - /dev/*poll
- specialized device node based interface for registration and
notifications of events
- suitable for readiness notification on sockets, but not for
driving i/o.
3.Real-time signals
- only a notification mechanism
- requires fcntl (F_SETSIG) for edge triggered readiness notification
enablement or aio interfaces (aio_sigevent settings: SIGEV_SIGNAL)
for completion notification enablement through RT signals.
- the mechanism has potential overflow issues (when signal queue
limits are hit) where signals could get lost, especially with
fasync route (which tends to generate a signal for every event
rather than aggregate for an fd) and needs to be supplemented with
some other form of polling over the sigtimedwait interface.
The only way to tune the queue lengths is via sysctl.
- relatively heavy when it comes to large numbers of events
(btw, signal delivery with signal handlers is costly and not very
suitable for this purpose because of the complications of locking
against them in user space; so the sigtimedwait sort of
interface is preferable)
- there are some other problems with flexibility in setting the
receipient of the signal via F_SETOWN (per process queues) which
hinders concurrency.
[Question to Ponder: More efficient implementation and extensions to RT signal
interfaces, or have a different interface altogether ? ]
Please refer to www.kegel.com/c10k.html for a far more detailed coverage of
these mechanisms, and how they can be used by applications.
Reasons for prefering aio:
- Desirable to have a unified approach, rather than multiple isolated
mechanisms if it can be done efficiently
- Multiplexing across different kinds of operations and sources
- Clear cut well-known system call interface preferable to more indirect
interfaces
- Driving optimizations from low level/core primitives can be more efficient
and beneficial across multiple subsystems
- Can separate the core event completion queue and notification mechanisms for
flexiblity and efficiency. (Can have tunable wakeup semantics, tunable
queue lengths, more efficient event ring buffer implementation)
Note: There are synchronization concerns when the two are not
unified from a caller's perpective though, so the interfaces need to
be designed with that in mind.
2. Design Philosophy and Interface Design
2.1 System and Interface design philosophy:
Alternatives:
a. Entire system built on an asynchronous model, all the way through
(e.g NT i/o subsystem). So most operations can be invoked in sync or async
mode (sub-options of the same operation specific interface).
Internally, the sync mode = async mode + wait for completion.
b. Async operations are initiated through a separate interface, and could
follow a separate path from the synchronous operations, to a degree
(use common code, and low down things may be truly async and common for
for both, but at the higher level the paths could be different)
The POSIX aio interface is aligned with (b). This is the approach that the
Linux implementation takes. Submission of all async i/o ops happens
through a single call with different command options, and data used for
representing different operations.
Advantages:
- No change in existing sync interfaces (can't afford to do that anyway)
- Less impact on existing sync i/o path. This code does not have the overhead
of maintaing async state (can use the stack), and can stay simple.
Disadvantages:
- Need to introduce interfaces or cmd structures for each operation
that can be async. (A little akin to an ioctl style approach)
- Different code paths implies some amount of duplication/maintenance
concerns. Can be minimized by using as much common code as possible.
2.2 Approaches for implementing aio
2.2.1 Alternative ways of driving the operation to completion
1. Using threads to make things _look_ async to the application
a. User level threads
- glibc approach (one user thread per operation ?)
poor scalability, performance
b. Pool of threads
- have a pool of threads servicing an aio request queue for the
task - tradeof between degree of concurrency/utilization and
resource consumption.
2. Hybrid approach (SGI kaio uses this)
- If the underlying operation is async in nature, initiate it right away
(better utilization of underlying device), and just handle waiting for
completion via thread pool (could become a serialization point depending
on load and number of threads) unless operation completes rightaway in a
non-blocking manner.
- If underlying operation is sync, then initiate it via the thread pool
Note:
- SGI kaio has internal async i/o initiation interfaces for raw i/o and
generic read.
- SGI kaio has these slave threads in the context of the aio task => at
least one per task
- SGI kaio slave threads perform a blocking wait for the operation
just dequeued to complete before checking for completion of the next
operation in the queue => number slave threads determines the degree of
asynchrony.
3. Implement a true async state machine for each type of aio operation.
(i.e a sequence of non-blocking steps, continuation driven by IRQ and event
threads, based on low level primitives designed for this purpose)
- Relatively harder to get right, and harder to debug, but provides
more flexibility, and greater asynchrony
This aio implementation takes approach 3 (with some caveats, as we shall see
later).
Andi Kleen had experimented with a new raw i/o device which would be
truly async from the application's perspective until it had to block waiting
for request queue slots. Instead of using a thread for completion as in
approach 3 above, it sent RT signals to the application to signal i/o
completion. The experience indicated that RT signals didn't seem to very
suitable and synchronization was rather complicated. There also were
problems with flow control with the elevator (application blocking on
request queue slots and plugging issues).
A paper from Univ of Wisconsin-Madison talks about a block async i/o
implementation called BAIO. This scheme uses one slave thread per task
similar to the SGI kaio approach, but in this case the BAIO service thread
checks for completion in a non-blocking manner (it gets notified of i/o
completion by the device driver) and in turn notifies the application.
BAIO does not have to deal with synchronous underlying operations
(doesn't access filesystems, as it only intends to expose a low level
disk access mechanism enabling customized user level filesystems), and
hence its async state machine is simple.
2.2.1.1 Optimization/Fast-path for non-blocking case
In case an operation can complete in a non-blocking manner via the
normal path, the additional async state path can be avoided. An F_ATOMIC
flag check has been introduced down the sync i/o path to check for this,
thus providing a fast path for aio. This idea comes from TUX.
2.2.2 Handling User Space Data Tranfer
With asynchronous i/o, steps of the operation aren't guaranteed to execute
in the caller's context. Hence transfers/copies to/from user space need to be
handled carefully. Most of this discussion is relevant for buffered i/o,
since direct i/o avoids user/kernel space data copies.
In a thread pool approach, if a per-task thread pool is used, then such
transfers can happen in the context of one of these threads. Typically
the copy_to_user operations required to read transfered data into user
space buffers after i/o completion would be handled by these aio threads.
Both SGI kaio and BAIO rely on per-task service threads for this purpose.
It may be possible to pass down all the user space data for the operation
when initiating i/o while in the caller's context without blocking, though
this is inherently likely to use extra kernel space memory. The same is
true on the way up on i/o completion, where it may be possible to continue
holding on to the in-kernel buffers until the caller actually gathers
completion data, so that copy into user space can happen in the caller's
context. However this again holds up additional memory resources which may
not be suitable especially for large data transfers.
[BTW, on windows NT, iirc this sort of stuff happens through APCs or
asynchronous procedure calls, in a very crude sense somewhat like softirqs
running in the context of a specified task]
Instead, an approach similar to that taken with direct i/o has been adopted,
where the user space buffers are represented in terms of physical memory
descriptors (a list of tuples of the form <page, offset, len>), called kvecs,
rather than by virtual address, so that they are uniformly accessible in any
process context. This required new in-kernel *kvec* interfaces which operate on
this form of i/o currency or memory descriptors. Each entry/tuple in the kvec is
called a kveclet, and represents a contiguous area of physical memory. A
virtual address range or iovec (in the case readv/writev) would map to a set
of such tuples which makes up a kvec.
Note:
-----
This fits in very nicely with the current multi-page bio implementation
which also uses a similar vector representation, and also with the
zero-copy network code implementation. Ben has submiited some patches to
make this all a common data structure.
TBD: Some simple changes are needed in the multi-page bio code to get this
to work properly without requiring a copy of the descriptors.
There is a discussion on various alternative representations that have
been considered in the past in sec 1.2.2 of:
http://lse.sourceforge.net/io/bionotes.txt
The only possible drawback is that this approach does keep the user pages
pinned in memory all the while until the i/o completes. However, it neatly
avoids the per-task service thread requirement of other aio implementations.
2.3 Extent of true async behaviour - Queue depth/Throttle points
There has been some discussion about the extent to which asynchronous
behaviour should be supported in case the operation has to wait for some
resource to become available (typically memory, or request queue slots).
There obviously has to be some kind of throttling of requests by the system
beyond which it cannot take in any more asynchronous io for processing.
In such cases, it should return an error (as it does for non-blocking i/o)
indicating temporary resource unavailability (-EAGAIN), rather than block
waiting for resource (or could there be value in the the latter option ?).
It seems appropriate for these bounds to be determined by the aio queue depth
and associated resource limits, rather than by other system resources (though
the allowable queue depth could be related to general resource availability).
This would mean that ideally, when one initiates an async i/o
operation, the operation gets queued without blocking anywhere, or returns
an error in the event it hits the aio resource limits.
[Note/TBD: This is the intended direction, but this aspect of the code is
still under construction and is not complete. Currently async raw aio would
probably block if it needs to wait for request queue slots. Async file i/o
attempts to avoid blocking the app due to sub i/os for bmap kind of operations
but it currently could block waiting for the inode semaphore. The long term
direction is to convert this wait to an async state driven mechanism. The
async state machine also has to be extended to the waits for bmap operations
which has so far only been pushed out of the app's context to that of the
event thread that drives the next step of the state machine (which means
that it could block keventd temporarily).]
2.4 Sequencing of aio operations
Specifying serialization restrictions or relative priorities:
- posix_synchronized_io (for multiple requests to the same fd)
says that reads should see data written by requests preceding it - enforces
ordering to that extent, if specified.
- aio_req_prio (not supported in the current implementation as yet)
app can indicate some requests are lower priority than others, so the system
can optimize system throughput and latency of other requests at the cost
latency of such requests.
Some Notes:
- This feature should get linked to the priority based i/o scheduler when
that goes in, in order to make sure that the i/os really get scheduled
as per the priorities.
- The priority of a request is specified relative to (and is lower than)
the process priority, so it can't starve other process's requests etc when
passed down to to the i/o scheduler for example. Besides the i/o scheduler
would also have some kind of aging scheme of its own, or translate
priorities to deadlines or latency estimates to handle things fairly.
- [TBD: Priorities typically indicate hints or expectations unlike
i/o barriers or synchronized i/o reqmts for strict ordering (except
possibly for real time applications ?) ]
- Posix, says that the same priority requests to a character device should
be handled fifo.
- As John Myers suggested, considering priorities on the event delivery
path in itself may be useful even without control on i/o scheduling. This
aspect could possibly be implemented early, since it would be needed in
any case in the complete implementation to make sure that priorities are
respected all the way through from initiation to completion processing.
(See point 5 on prioritized event delivery under Sec 2.5)
- To account for priorities at the intermediate steps in the async
state machine, multiple priority task queues could be used instead of
a single task queue to drive the steps.
Beyond these restrictions and hints, sequencing is up to the system, with
dual goals:
- Maximize throughput (global decision)
- Minimize latency (local, for a request)
There are inherent tradeoffs between the above, though improving system
throughput could help with average latency, provided pipeline startup time
isn't significant. A balanced objective could be to maximize throughput within
reasonable latency bounds.
Since each operation may involve several steps which could potentially
run into temporary resource contention or availability delay points, the
sequence in which operations complete, or even reach the target device are
affected by system scheduling decisions in terms of resource acquisition
at each of these stages.
Note/TBD: Since the current implementation uses event threads to drive
stages of the async state machine, in situations where a sub-step isn't
completely non-blocking (as desired), then the implementation ends up
causing some degree of serialization, or rather further accentuating the
order in which the requests reached the sub-step. This may be seem
reasonable and possibly even beneficial for operations that are likely
to contend for the same resources (e.g requests to the same device),
but not optimal for requests that can proceeed in a relatively independent
fashion. The eventual objective is to make sure that sub-steps are
indeed non-blocking, and there is a plan to introduce some debugging aids
to help enforce this. As discussed in Section 2.3, things like bmap,
wait for request, and inode semaphore acquisition are still to be converted
to non-blocking steps (currently a todo).
2.5 Completion/Readiness notification:
Comment: Readiness notification can be treated as a completion of an
asynchonous operation to await readiness.
POSIX aio provides for waiting for completion of a particular request, or
for an array of requests, either by means of polling, or asynchronously
through signals. On some operating systems, there is a notion
of an I/O Completion port (IOCP), which provides a flexible and scalable way
of grouping completion events. One can associate multiple file descriptors
with such a completion port, so that all completion events for requests on
those files are sent to the completion port. The application can thus issue
a wait on the completion port in order to get notified of any completion
event for that group. The level of concurrency can be increased simply by
increasing the number of threads waiting on the completion port. There are
also certain additional concurrency control features that can be associated
with IOCPs (as on NT), where the system decides how many threads to
wakeup when completion events occur, depending on the concurrency limits
set for the queue, and the actual number of runnable threads at that moment.
Keeping the number of runnable threads constant in this manner protects
against blocking due to page faults and other operations that cannot be
performed asynchronously.
On a similar note, the DAFS api spec incorportes completion groups for
handling async i/o completion, the design being motivated by VI completion
queues, NT IOCPs and the Solaris aiowait interfaces. Association of an
i/o with a completion group (NULL would imply the default completion queue)
happens at the time of i/o submission which lets the provider know where
to place the event when it completes, contrary to aio_suspend style interface
which specifies the grouping only when waiting on completion.
This implementation for Linux makes use a similar notion to provide
support for completion queues. There are api's to setup and destroy such
completion queues, specifying the maximum queue lengths that a queue is
configured for. Every asynchronous i/o request is associated with a completion
queue when it is submitted (like the DAFS interfaces), and an application
can issue a wait on a given queue to be notified of a completion event for
any request associated with that queue.
BSD kqueue (Jonathan Lemon) provides a very generic method for registering
for and handling notification of events or conditions based on the concept
of filters of different types. This covers a wide range of conditions
including file/socket readiness notification (as in poll), directory/file
(vnode) change notifications, process create/exit/stop notifications, signal
notification, timer notification and also aio completion notification
(via SIGEV_EVENT). The kqueue is equivalent to a completion queue, and
the interface allows one to both register for events and wait for (and
pick up) any events on the queue within the same call. It is rather flexible
in terms of providing for various kinds of event registration/notification
requirements, e.g oneshot or everytime, temporary disabling, clearing
state if transitions need to be notifiied, and it supports both edge and
level triggered types of filters.
2.5.1 Some Requirements which are addressed:
1. Efficient for large numbers of events and connections
- The interface to register events to wait for should be separate from
the interface used to actually poll/wait for the registered events to
complete (unlike traditional poll/select), so that registrations can
hold across multiple poll waits with minimum user-kernel transfers.
(It is better to handle this at interface definition level than
through some kind of an internal poll cache)
The i/o submission routine takes a completion queue as a parameter,
which associates/registers the events with a given completion group/queue.
The application can issue multiple waits on the completion queue using a
separate interface.
- Ability to reap many events together (unlike current sigtimedwait
and sigwaitinfo interfaces)
The interface used to wait for and retrieve events, can return an
array of completed events rather than just a single event.
- Scalable/tunable queue limits - at least have a limit per queue rather
than system wide limits
Queue limits can be specified when creating a completion group.
TBD: A control interface for changing queue parameters/limits (e.g
io_queue_grow) might be useful
- Room for more flexible/tunable wakeup semantics for better concurrency
control
Since the core event queue can be separated from the notification mechanism,
the design allows one to provide for alternative wakeup semantics
to optimize concurrency and reduce redundant or under-utilized context
switches. Implementing these might require some additional parameters or
interfaces to be defined. BTW, it is desirable to provide a unified interface
for notification and event retrieval to a caller, to avoid synchronization
complexities, even if the core policies are separable underneath in-kernel.
[See the discussion in Sec 2.6 on wakeup policies for a more
detailed discussion on this]
2. Enable flexible grouping of operations
- Flexible grouping at the time of i/o submission
(different operations on the same fd can belong to different groups,
operations on different fds can belong to the same group)
- Ability to wait for at least a specified number of operations from
a specified group to complete (at least N vs at least 1 helps with
batching on the way up, so that the application can perform its post
processing activities in a batch, without redundant context switches)
The DAFS api supports such a notion, both in its cg_batch_wait interface
which returns when either N events have completed, or with less than N
events in case of a timeout, and also in the form of a num_completions
hint at the time of i/o submission. The latter is a hint that gets sent
out to the server as a characteristic of the completion queue or session,
so the server can use this hint to batch its responses accordingly.
Knowing that the caller is interested only in batch completions helps
with appropriate optimizations.
Note: The Linux aio implementation today only supports "at least one"
and not "at least N" (e.g the aio_nwait interface on AIX).
The tradeoffs between responsiveness and fairness issues tend to
to get amplified when considering "at least N" type of semantics,
and this is one of the main concerns in supporting it.
[See discussion on wakeup policies later]
- Support dynamic additions to the group rather than a static or one time
list passed through a single call
Multiple i/o submissions can specify the same completion group, enabling
events to be added to the group.
[Question: Is the option of the completion group being different from the
submission batch/group (i.e. per iocb grouping field) useful to have ?
POSIX allows this]
3. Should also be able to wait for a specific operation to complete (without
being very inefficient about it)
One could either have low overhead group setup/teardown so such an operation
may be assigned a group of its own (costs can be amortized across multiple
such operations by reusing the same group if possible) or provide an
interface to wait for a specific operation to complete.
The latter would be more useful, though it requires a per-request wait queue
or something similar. The current implementation has a syscall interface
defined for this (io_wait), which hasn't been coded up as yet. The plan is
to use hashed wait queues to conserve on space.
There are also some semantics issues in terms of possibilities of another
waiter on the queue picking up the corresponding completion event for this
operation. To address this, the io_wait interface might be modified to
include an argument for the returned event.
BTW, there is an option of dealing with this using the group primitives
either in user space, or even in kernel by waiting in a loop for any event
in the group until the desired event occurs, but this could involve some
extra interim wakeups / context switches under the covers, and a user
level event distribution mechanism for the other events picked up in the
meantime.
4. Enable Flexible distribution of responsibility across multiple
threads/components
Different threads can handle submission for different operations,
and another pool of threads could wait on completion.
The degree of concurrency can be improved simply by increasing threads
in the pool that wait for and process completion of operations for
that group.
5. Support for Prioritized Event Delivery
This involves the basic infrastructure to be able to accord higher
priority to the delivery of certain completion events over others,
(e.g. depending on the request priority settings of the corresponding
request), i.e. if multiple completion events have arrived on the
queue, then the events for higher priorities should be picked up
first by the application.
TBD/Todo:
One way of implementing this would be to have separate queues
for different priorities and attempt to build an aggregate (virtual)
queue. There are some design issues to be considered here as in any
scheduling logic, and this needs to be looked at in totality in
conjunction with some of the other requirements. For example, things
like aging of events on the queue, could get a little complex to do.
One of the approaches under consideration is to try to handle the
interpretation of priorities in userspace, leaving some such decisions
to the application. It is the application which decides the limits
for each of the queues, so the kernel avoids having to handle that
or balance space across the queues. Only kernel support for making
a multiplexed wait on a group of completion queues possible
might suffice to get this to work. Ben has in a mind a rather generic
way of doing this (across not just completion queues, but also possibly
across other sorts of waits) by providing primitives that expose the
richness of the kernel's wait queue interfaces directly to userspace.
The idea is that something like the following would become possible:
user_wait_queue_t wait;
int ret;
add_wait_queue(high_pri_ctx, wait)
add_wait_queue(low_pri_ctx, wait)
ret = process_wait(); /* call it schedule() if you want */
while (vsys_getevents(high_pri_ctx...) > 0)
...
...
ie, a very similar interface to what the kernel uses which can be mixed
and matched across the different kinds of things that need to be waited
upon (locks, io completion, etc). Such a mechanism can also be used for
building the more complex locks that glibc needs to provide
efficiently without sacrificing a rich and simple interface.
Notice, that for true aio_req_prio, the kernel would have to
be aware of completion queue priorities, but that it may still be
possible for the order in which events are picked up (across the
queues) to be handled by the application.
BTW, another possibility is to maintain a userland queue (or set of
queues, for each priority), into which events get drawn in whenever
events are requested and then later distributed/picked up by the
application's threads. One of the tricky issues with such multi-level
queues is handling flow control, which is not very appealing.
(Interestingly Viro's suggested interface (3.2) also deals with composite
queues. Just one level of aggregation suffices for the prioritized
delivery requirements, while Viro's interface supports multiple
levels of aggregation. )
2.6 Wakeup Policies for Event Notification
2.6.1 The wakeup policy used in this implementation
The design is geared towards minimizing the latency of completion
processing which directly related to the responsiveness of the
(server) application to events. Ensuring fairness (or even starvation
avoidance) is not expected to be an issue with the expected application
model of symmetric server threads (i.e. threads which take the same actions
on completion of given events), except in so far as it affects load
balancing which in turn could affect latency.
[TBD: I'm not sure of this, but starvation may be an issue when it
comes to non-symmetric threads, where the event is a readiness
indicator which the thread uses to decide on availability of space
in order to push its data or something of that sort.]
The wakeup policy in the current implementation is to wakeup a thread
on the completion queue whenever an i/o completes. Any thread who picks up
the event first (this could even be a new caller who wasn't already waiting
on the queue) gets it, and if no events are available for a thread to pick
up, it goes back to sleep again. By ensuring that the thread who gets to
the event earlist picks it up, this keeps the latency minimal. Also in view of
better cache utilization the wake queue mechanism is LIFO by default.
(A new exclusive LIFO wakeup option has been introduced for this purpose)
Making the wakeups exclusive reduces some contention or spurious wakeups.
When events are not coming in rapidly enough for all the threads to get
enough events that they can use up their full time slice processing, there
is a likelihood of some contention and redundant or rather underutilized
context switches. (While it just might happen that a thread is gets deprived
of events as other threads keep picking them up instead, as discussed, that
may not be significant, and probably just an indicator that the number of
threads could be reduced. )
In the situation when there are a lot of events in the queue, then every
thread tries to pick up as many events as it can (upto the number specified
by the caller), but one at a time. The latter aspect (of holding the lock
across the acquisition of only a single event at a time) helps with some amount
of load balancing (for event distribution, or completion work) on SMP when
these threads are running parallely multiple CPUs.
2.6.2 TBD: Note on at least N semantics:
In some situations where an application is interested in batch results
and overall throughput (vs responsiveness to individual events), an "at
least N" kind of wakeup semantics, vs "at least one" can help amortize the cost
of a wakeup/context switch across multiple completions. (This is better than
just a time based sleep which doesn't have any co-relation with the i/o
or event completion rates - one could have too many events building up or
perhaps too little depending on the load). This makes sense when the amount
of post-processing on receipt of an event is very small and the resulting
latency is tolerable (combination of timeout + N lets one specify the
bounds), so the application would rather receive notifications in batches.
Things get a little tricky when trying to define the policies for "at
least N" when multiple threads are involved, possibly with different values
of N (though that is not a typical situation), in terms of event distribution,
simply because the tradeoffs between latency and fairness tend to widen in
this case.
A natural extension of the current scheme to an at least N scheme,
would be to wake up only waiters whose "N-value" matches or exceeds the
number of events available, and then have them try to pick up their N events
in one shot (i.e. atomically) if available or go back to sleep. If a thread
finds more events available after it picks up its N events, or after it
times out, then just as before it keeps picking up as many events as it can
(upto the specified limit) but one at time. This helps reduce the load
balancing vs batching conflict (the policy is batch upto n, balance beyond
that).
[TBD: Implementing the "check for N and wakeup" scheme above correctly in
the presence of exclusive waits may require support in the wait queue
wakeup logic to account for the status returned by a wait queue function
to decide if the entry should be treated as done/woken up. The approach
would be that the earliest waiter whose conditions are satisfied would
get woken up]
Obviously the possibility of starvation is relatively more glaring
in this case, than with at-least-one, e.g. consider the case when 2N-1 events
are just picked up by one thread, while the other thread is idle, and the
2Nth event comes in just then. As mentioned earlier, starvation is not an
issue in itself, but the load balancing implication is worth keeping in
mind. The maximum number of events requested in one shot and the timeout
provide the bounds on this sort of a thing from an application's
perspective. (BTW, The DAFS cg_batch_wait interface is "exactly N", which
is one other way of handling this; actually it is exactly N or less on a
timeout)
Notice that trying to implement at-least-N semantics purely in user space
above at-least-one primitives with multiple waits has latency issues in the
multiple waiters case (besides the extra wakeups/context switches). In the
worst case, with m threads, the latency for actual completion processing
(where completion processing happens in batches of N events), could be
delayed upto the arrival of the m*N-1 th event.
Remark: "At least N" is still a TBD.
2.6.3 TBD: Load/Queue Length based wakeup semantics:
This is another option, from a networking analogy, where the system could
tune the N-value for wakeup on a queue, based on event rates or space
available to queue more requests. This is however based on the expectation
that completion processing would trigger a fresh batch of aio requests on the
queue.
Note: Being able to wait on a specific aio, or a submit and wait
for all the submitted events to complete (the way it is supported in
BSD kqueues) are other interfaces that could potentially reduce
the number of context switches, and are useful in some situations
(no implemented as yet).
2.6.3 TBD/Future: Per Completion Queue Concurrency Control
There have been some thoughts about achieving IOCP concurrency control
via associated scheduling group definitions, independently of aio completion
queue semantics, so an application could possibly choose to use both aio
and scheduling groups together.
This might make sense because the system has no persistent association of
the completion queue with threads that aren't waiting on that queue.
Implicit grouping (e.g association with of a thread with the last ctx it
invoked io_getevents on) is possible, but does make some assumptions
(even if these might reflect the most typical cases) on the way the
application threads handle completions and IOCP waits.
On the other hand as John Myers indicated, a pure scheduling group feature
that only looks at wakeups, without knowledge of the reason for the wakeup
(the ability to distinguish between more events/work coming in which can
be handled by any one from the set of threads, or indicating completion
of synchronous actions meant for a specific thread) may not be able to
take the kind of more informed decisions that a more tightly coupled
feature or abstraction that operates at a slightly higher level can.
One way to solve this would be for the scheduling group implementation
(if and when it is implemented) to also allow for (in-kernel) priority
indicators for waiters (or the wait queues, whichever seems appropriate),
so that it can handle such decisions. Components like aio could take care of
setting up such priorities as it sees fit (e.g accord lower priorities for
the completion wait queue waits) to cause the desired behaviour.
2.7 Other Goals
- Support POSIX as well as completion port style interfaces/wrappers
The base kernel interfaces are deisgned to provide the minimum native support
required for the library to implement both styles of interfaces,
- Low overhead for kernel and user
[Potential todos: Possibly via an mmaped ring buffer, vsyscalls]
- Extensible to newer operations
e.g. aio_sendfile, aio_readv/writev + anything else that seems
useful in the future (semaphores, notifications etc ?)
3. Interfaces
3.1 The interfaces provided by this implementation
The interfaces are based on a new set of system calls for aio:
- Create/Setup a new completion context/queue. This completion context
can only be shared across tasks that share the same mm (i.e. threads).
__io_setup(int maxevents, io_context_t *ctxp)
- Submit an aio operation. The iocb describes the kind of operation and
the associated paramters. The completion queue to associate the
operations with is specified too.
__io_submit(io_context_t ctx, int nr, struct iocb **iocbs)
- Retrieve completion events for operations associated with the completion
queue. If no events are present, then wait for upto the timeout for at
least one event to arrive.
__io_getevents(io_context_t ctx, int nr, struct io_event *events,
struct timespace *timeout)
- Wait upto the timeout for the i/o described by the specific iocb to complete.
[Ques: Should this interface be modified to retreive the event as well ?]
--io_wait(io_context_t ctx, struct iocb *iocb,
struct timespec *timeout)
- Cancel the operation described by the specified iocb.
__io_cancel(io_context_t ctx, void *iocb)
- Teardown/Destroy a new completion context/queue (happens by default upon
process/mm exit)
Pending requests would be cancelled if possible, and the resources would
get cleaned up when all in-flight requests get completed/cancelled.
Naturally any unclaimed events would automatically be lost.
__io_destroy(io_context_t ctx)
The library interface that a user sees is built on top of the above system
calls. It also provides a mechanism to associate callback routines with the
iocb's which are invoked in user space as part of an event processing loop
when the corresponding event arrives. There are helper routines (io_prep_read,
io_prep_write, io_prep_poll, etc) which can be used for filling up an iocb
structure for a given operation, before passing it down through io_submit.
Please refer to the aio man pages for details on the interfaces and how
to use them [Todo: Reference to man pages fron Ben]
POSIX aio is implemented in user space library code over these basic
system calls. This involves some amount of book-keeping and extra threads
to handle some of the notification work (e.g. SIGEV_NOTIFY is handled by
sending the notification signals to the kernel from the user space thread).
(Note: The plan is to add support for direct signal delivery from the
kernel for aio requests in which case this dependence on pthreads would
change)
3.2 Extending the Interfaces
Alternatives to using system calls for some of the aio interfaces,
particularly the event polling/retreival pieces include implementing a
pseudo device driver interface (like the /dev/poll and /dev/epoll
approaches), or a pseudo file system interface. A system call approach
appears to be a more direct and clear-cut interface than any specialized
device driver ioctl or read/write operations approach, which was one of the
reasons why the possibility of a /dev/aio was abandoned during aio
development.
TBD/Future:
The filesystem namespace based approach that Al Viro has suggested for
i/o multiplexors (for flexible and scalable event polling/notification),
provides for some interesting features over aio completion queues
like naming, sharing (across processes rather than just threads),
access control, persistence, and hierarchial grouping (i.e. more than
just a single level of grouping). The model uses AF_UNIX socket
sendmsg/recvmsg calls with specific datagram formats (SCM_RIGHTS datagrams)
on the namespace objects instead of any new apis for registration and polling
of events. The interface is defined so that recvmsg gets a set of new
open descriptors for each of the underlying channels with events. This
makes it feasible to share event registrations across processes, since the
fd used to register the event needn't be available when the event is
picked up.
However, it still would make sense to have a separate mechanism for
async i/o and associated notifications. Possibly if something like the
above is implemented, one could consider ways of associating aio completion
queues with it, if that fits sematically, or move things like async
poll out of aio in there. Most of the aio operations (other
than async poll today, and possibly aio_sendfile later) involve user
space buffers, so sharing across processes may not make much sense, except
perhaps in the case of shared memory buffers.
4. Design Internals
4.1 Low Level Primitives :
4.1.1 wait_queue functions
This primitive is based on an extension to the existing wait queue scheme.
The idea is that both asynchronous and synchronous waiters just use the
same wait queue associated with any given data structure transparent
to the caller of wakeup. (This avoids the need to attach new notify/fasync
sort of structures for every relevant operation/data structure involved
in the async state machine)
To support asynchronous waiters, the wait queue entry structure now
contains a function pointer for the callback to be invoked for async
notification. The default action, in case such a callback is not specified,
is to assume that the entry corresponds to a synchronous waiter (as before)
and to wake it up accordingly. The callback runs with interrupts disabled
and with the internal wait queue spinlock held, so the amount of work done
in the callback is expected to be very restricted. Additional spinlocks
should be avoided. The right thing to do if more processing is required
is to queue up a work-to-do action to be run in the context of an event
thread (see next section). Extreme caution is recommended in using wait
queue callbacks as it is rather prone to races if not used with care.
There is a routine to atomically check for a condition and add a wait queue
entry if the condition is not met (add_wait_queue_cond). The check for the
condition happens with the internal wait queue spin lock held. This avoids
missing events between the check and addition to the wait queue, which could
be fatal for the async state machine. The standard way of handling the
possibility of missed events with synchronous waiters was to add the wait
queue entry before performing the check for the condition, and to just
silently remove the entry thereafter if the condition has already been
met. However in the case of async waiters where the follow on action happens
in the wait queue function, this could lead to duplicate event detection,
which could be a problem if the follow on action is not defined to be
idempotent. The add_wait_queue_cond() feature helps guard against this.
[Note: An associated implication of this is that checks for wait_queue_active
outside of the internal wait queue lock are no longer appropriate, as it
could lead to a missed event]
The wait queue callback should check for the occurance of the condition
before taking action just as in typical condition wait/signal scenarios.
Notice that the callback is responsible for pulling the entry off the wait
queue once it has been successfully signalled, unlike the synchronous case
where queueing and dequeueing happens in the same context.
4.1.2 Work-to-dos (wtd) for async state machine
Work-to-dos provide the basic abstraction for representing actions for
driving the async state machine through all the steps needed to complete
an async i/o operation. The design of work-to-dos in this aio implementation
is based on suggestions from Jeff Merkey for implementing such async state
machines, and is modelled on the approach taken in Ingo Molnar's
implementation of TUX.
As mentioned in the previous section, because of restricted conditions
under which wait queue functions are called, it isn't always possible to
drive steps of the async state machine purely through wait queue functions.
Instead the wait queue function in turn could queue a work-to-do action
to be invoked in a more suitable context, typically by a system worker thread.
This is achieved using the task-queue primitives on Linux. Currently aio
just uses the same task queue which is serviced by keventd (i.e. the context
task queue). In the future this could possibly be handled by a pool of
dedicated aio system worker threads. [TBD: Also, priorities may be supported
by having multiple task queues of different priority]
struct wtd_stack {
void (*fn)(void *data); /* action */
void *data; /* context data for the action */
};
struct worktodo {
wait_queue_t wait; /* this gets linked to the wait
queue for the event which is
expected to trigger/schedule this
wtd */
struct tq_struct tq; /* this gets linked to the task
queue on which the wtd has to be
scheduled (context_tq today) */
void *data; /* for use by the wtd_ primatives */
/* The stack of actions */
int sp;
struct wtd_stack stack[3];
};
A typical pattern for the asynchronous version corresponding to a
synchronous operation consisting a set of non-blocking steps with
synchronous waits between steps, could be something like the following:
(Lets label this pattern as A)
- Initiate step 1, and register an async waiter or callback
- Async waiter completes and queues a work-to-do for the next step
- The work-to-do initiates step 2 when it gets serviced, and registers an
async waiter or callback to catch completion
- Async waiter/callback initiates step 3 ...
.. and so on till step n.
Of course, there are other possible patterns, e.g where the operation can be
split off into multiple independent sub-steps which can be initiated at the
same time, and then use callbacks/async waiters to collect/consolidate the
results and if required queue a work-to-do action after that to drive the
follow up action. (Lets label this pattern B)
The work-to-do structure is designed so that state information can be passed
along from one step to the next (unlike synchronous operations, state
can't be carried over on stack in this case). There is also support for
stacking actions within the same work-to-do structure. This feature has
been used in the network aio implementation (which is currently under a
revamp) to enable calling routines to stack their post completion actions
(and associated data) before invoking a routine that might involve an async
wait. For example, consider a nested construct of the form:
func1()
{
func2();
post_process1();
}
func2()
{
func3();
post_process2();
}
func3()
{
process;
wait for completion;
post_process3();
}
The asynchronous version of the above could have the following pattern,
assuming that a worktodo structure is shared/passed on in some manner down
the levels of nesting:
- func1 initializes the worktodo with the action post_process1(), before
calling func2
- func2 pushes the action post_process2() on the worktodo stack before
calling func3
- func3 pushes the action post_process3() on the worktodo stack
- func3 then replaces its synchronous wait by setting up an asynchronous
waiter which would schedule the worktodo sequence
- the worktodo sequence simply pops each action by turn and exceutes it
to achieve the desired effect.
(Lets label this pattern C)
Some caution is needed when using the async waiter + work-to-do combination,
e.g maintaining the 1-1 association with an event and the queueing of the
worktodo, and guards against duplication or event misses (as discussed in the
previous section). Also, one needs to be very careful about recursions in
the chained operations (can't have stack overflows in the kernel).
4.2 Generic async event handling pieces
4.2.1 The completion queue
The in kernel representation of the completion queue structure (kioctx),
contains a list of in-use (active) and free requests (where each request
in the in-kernel iocb representation, i.e. kiocb), and also a circular
ring buffer, where completion events are queued up as they arrive, and
picked up in FIFO order. There is a per kioctx wait queue which is used to
wait for events on that queue. The reference count of a kioctx is incremented
when it is in use (i.e. when there are pending requests).
A completion queue is associated with the mm struct for the concerned task,
thus threads which share the same address space also share completion
queues. The ctx_id is unique per-mm. The completion queues for a given
address space are linked together with the list grounded in the mm struct.
On process exit (i.e. when the mm users count goes to zero), the completion
queue is released (the actual free could happen a little later depending on
the reference count, i.e. in case the kioctx is in use).
The ring buffer is designed to be virtually contiguous, so if necessary
(i.e. if the higher order page allocation needed to accomdate the specified
number of events fails) it may be vmalloc'ed. The requests/kiocbs are
also preallocated when the kioctx is created, but these needn't be contigous
and are allocated from slab.
4.2.2 I/O Submission, Completion and Event Pickup
New requests can be submitted only if there is enough space left in the
ring buffer to accomodate completion events for all pending requests as
well as the new one in the ring buffer.
The io_submit interface invokes the corresponding async file op based on
the operation code specified in the iocb. The file descriptor reference
count is incremented to protect against the case the process exits and
closes the file while i/o is still in progress. In such a scenario the
request, file descriptor and the kioctx state are not freed immediately,
but in a deferred manner as and when the completions (or cancellation
possibly once that is supported) happen, and it is safe to do so.
When the operation completes, the corresponding completion path (via
async waiters or worktodos) invokes aio_complete to takes care of
queuing the completion status/event at the end the ring buffer,
waking up any threads that may be waiting for events on the queue,
releasing the request and other related cleanups (e.g decrementing the
file descriptor reference count).
When the io_getevents interface is invoked for harvesting events, it picks
up completion events available in the circular ring buffer (i.e. from
the head of queue), or waits for events to come in, depending on the wakeup
and event distribution policies discussed in Sec 2.6.
4.2.3 TBD: User space memory mapping of the Ring Buffer
The design allows for the possibility of modifying the implementation to
allow for the events ring buffer to be mapped in user space, if that helps
with performance (avoiding some memory copies and system call overheads).
The current implementation prepares for avoiding the complexities of
user-kernel locking in such a case by making sure that only one side
updates any field (basically head and tail of the ring buffer), and also
banking on the assumption that reading an old value won't cause any real
harm. The Kernel/Producer updates the tail and the User/Consumer updates
the head. If the User sees an old value of tail, it may not see some just
arrived events, which is similar to the case when the events haven't
arrived, and so harmless. If the Kernel sees an old value of Head, then it
may think there isn't enough space in the queue and will try again later.
TBD: As Andi Kleen observed, schemes like this could be rather fragile and
hard to change as past experience with such optimizations in the networking
code have indicated where proper spin locks had to be added eventually.
So we need to understand how significant a performance benefit is acheived
by moving to a user space mapped ring buffer to decide if it is worth it.
4.3 In-kernel interfaces
4.3.1 Operations
The in-kernel interfaces that were added for aio implementation include
the following:
- New filesystem operations (fops) to support asynchronous read/write
functions (f_op->aio_read/write)
- Several helper routines for manipulating and operating on kvecs, the common
i/o currency discussed in Sec 2.2.2 (e.g. mapping user space buffers to
kvecs, copying data to/from/across kvecs i.e. *kvec_dst* routines)
- New filesystem read/write operations (fop->kvec_read/write) which
operate directly on kvec_cb's in asynchronous mode. These are the
operations that have to be defined for different file types, e.g
raw aio, buffered filesystem aio and network aio.
The f_op->read/write operations are expected to be changed to support an
F_ATOMIC flag which can be used to service an aio operation synchronously
if it can be done in a non-blocking manner.This provides a fast path that
avoids some of the overheads of async state machine when the operation can
complete without blocking/waiting at all. Currently, F_ATOMIC is implemented
via f_op->new_read/write for compatibility reasons.
[Todo: The plan is to add f_ops->flags_supported to enable read/write
operations to be converted wholesale with requiring additional code to
check for supported operations in all callees.]
The generic f_op->aio_read/write operations, first attempt the non-blocking
synchronous path described above, and take the async route only if it fails
(i.e. returns an error indicating that the operation might block). In that
case they convert the user virtual address range to a kvec and then invoke
the appropriate async kvec fops. Notice that this mechanism should be
extendable to readv/writev in a relatively lightweight manner (compared to
kiobufs), though aio readv/writev is still a Todo right now.
4.3.2 The i/o Container Data structure, kvec_cb
The i/o unit which is passed around to the kvec fops is the kvec_cb structure,
This contains a pointer to the kvec array discussed earlier plus associated
callback state (i.e. callback routine and data pointer) for i/o completion.
struct kveclet {
struct page *page;
unsigned offset;
unsigned length;
};
struct kvec {
unsigned max_nr;
unsigned nr;
struct kveclet veclet[0];
};
struct kvec_cb {
struct kvec *vec;
void (*fn)(void *data, struct kvec *vec, ssize_t res);
void *data;
};
struct kvec_cb_list {
struct list_head list;
struct kvec_cb cb;
};
The callback routine would typically be set to invoke aio_complete for
performing completion notification. For a compound operation like aio_sendfile,
which involves two i/os (input on one fd and output to the other), the
callback could be used for driving the next stage of processing, i.e. to
initiate the second i/o.
[TBD: With this framework, callback chaining is not inherently supported.
Intermediate layers could save pointers to higher layer callbacks as part
of their callback data, and thus implement chaining themselves, but a
standard mechanism would be preferable. ]
The *kvec_dst* helper routines which are used for retrieving or transfering
data from/to kvecs are designed to accept as argument a context structure
(kvec_dst) to maintain state related to the remaining portions to transfer.
Since a kvec contains fragments of non-uniform size, locating the portion
to transfer given the offset in number of bytes from the start of the kvec
is not a single step calculation, so its more efficient to maintain this
information as part of the context structure. These routines also take
care of performing temporary kmaps of veclets for memory copy operations,
as needed.
The map_user_kvec() routine is used to map a user space buffer to a kvec
structure (it allocates the required number of veclet entries). It also
takes care of bringing the corresponding physical pages if they are swapped
out. It increases the reference count of the page, essentially pinning it
in memory for the duration of the io.
(TBD/Check: Where does unmap_kvec happen ?)
4.4 Async poll
Async poll enables applications to make use of the advantages of aio
completion queues for readiness notification, avoiding some of the
scalability limitations and quirks of traditional poll/select. Instead of
passing in an array of <fd, event> pairs, one prepares iocbs corresponding
to each such <fd, event> pair, and then submits these iocbs using io_submit
associating them with a completion queue. Notifications can now be obtained
by waiting for events on the completion queue. Unlike select/poll, one does
not need to rebuild the event set for every iteration of the event loop; the
application just has to resubmit iocbs for the events it has already reaped,
in case it needs to include them in the set again for the next poll wait.
The implementation is a simple extension of the existing poll/select code,
which associates an iocb with a poll table structure and replaces the
synchronous wait on a poll table entry by an asynchronous completion
sequence (using a wait queue function + worktodo construct) that issues
aio_complete for the corresponding iocb thus affecting the notification.
4.5 Raw-disk aio
The internal async kvec f_ops for raw disk i/o are implemented along the
lines of pattern B discussed in Sec 4.1.2. The common raw_rw_kvec routine
invokes brw_kvec_async, which shoots out all the i/o pieces to the low level
block layer, and sets up the block i/o completion callbacks to take care
of invoking the kvec_cb callback when all the pieces are done. The kvec_cb
callback takes care of issuing aio_complete for completion notification.
TBD/Todo:
There is one problem with the implementation today, in that if the
submit_bh/bio operation used by brw_kvec_async blocks waiting for
request queue slots to become free, then it blocks the caller, so the
operation wouldn't be truly async in that case. Fixing this is one of
items in the current Todo list. For example, instead of the synchronous
request wait, a non-blocking option supplemented with an async waiter for
request queue slots, which in turn drives the corresponding i/o
once requests are available, using state machine steps along the lines
employed for file i/o, could be considered.
[Note/Todo: In the aio patches based off 2.4, brw_kvec_async sets up
buffer heads and keeps track of the io_count and the list of bhs (in
a brw_cb structure, which also embeds the kvec_cb structure) in order
to determine when all the pieces are done. In 2.5, it would allocate
a bio struct to represent the entire i/o unless the size exceeds the
maximum request size allowed, in which case multiple bios may need to be
allocated. The bio struct could be set up to directly point to the veclet
list in the kvec, avoiding the need to copy/translate descriptors in the
process]
4.6 File-system/buffered aio
The generic file kvec f_ops (generic_file_kvec_read/write), for buffered
i/o on filesystems employ a state machine that can be considered close to
pattern A (with a mix of pattern B) discussed in Sec 4.1.2. The state
information required through all the iterative steps of this state machine
is maintained in an iodesc structure that is setup in the beginning, and
passed along as context data for the worktodo actions.
The operation first maps the page cache pages corresponding to the specified
range. These would form the source/target of the i/o operation. It maintains
a list of these pages, as well as the kvec information representing the user
buffer from/to which the transfer has to happen, as part of the iodesc
structure, together with pointers or state information describing how much
of the transfer has completed (i/o to/from the page cache pages, and the
memcopy to/from the user buffer veclets). In case of read, the post
processing action for completion of i/o on a particular page would involve
copying the data into the user space buffer, while for write, the copy
from the user space buffer to the page happens early before committing the
writeout of the page (i.e. between prepare_write and commit_write).
Notice that the potential blocking points down the typical read/write path
involve:
(a) Waiting to acquire locks on the concerned pages (page cache pages
corresponding to the range where i/o is requested) before starting i/o
(b) Waiting for the io to complete:
- for read, this involves waiting for the page locks again
(indicative that the page lock has been released after i/o
completion), and then checking if the page is now uptodate
- for write (O_SYNC case), this involves waiting for the page
buffers, i.e. waiting for the writeout to complete.
[TBD: Currently its really only O_DSYNC, and not meta-data sync that's
affected ]
Each of these waits has been converted to an async wait (wtd_wait_on_page
and wtd_wait_on_buffer) operation, that triggers the next step of the i/o
(i.e as in pattern A). Notice that this becomes multi-step when the
i/o involves multiple pages and any of lock acquisitions is expected to
require a wait. Some speedup is achieved by initiating as much work as
possible, e.g. initiating as many readpage operations as possibly early on
the readpath, and initiating all the writeouts together down the write path
before waiting for completion of any (this is where the resemblance to
pattern B comes in).
Currently the filesystems modified to support aio include ext2, ext3 and
nfs (nfs kvec f_ops internally make use of the generic_file_kvec* operations,
after calling nfs_revalidate_inode).
[Note/Todo: There is still some work to do to make the steps non-blocking.
The bmap/extent determination operations performed by the filesystem
are blocking, and the acquisition of the inode semaphore also needs to
be converted to a wtd based operation]
4.7 Network aio
[Todo: To be added later since the code is under a rewrite - pattern C
in 4.1.2 ? ]
4.8 Extending aio to other operations (e.g sendfile)
[Todo/Plan:
The idea here is to make use of the kvec callbacks to kick the operation
into the next state, i.e. on completion of input from the source fd, trigger
the i/o to the output fd.
]
5. Performance Characteristics
[Todo: Research/Inputs required]
6. Todo Items/Pending Issues
- aio fsync
- aio sendfile
- direct aio path (reorder vfs paths to have a single rw_kvec interface from
fs when it really needs to do i/o)
- aio readv/writev
- i/o cancellation implementation (best effort; cancel i/os on process exit ?)
- io_wait implementation (needs hashed waitqueues)
- check for any races in current filesystem implementation (?)
- implementations for other filesystems
- network aio rewrite
- in-kernel signal delivery mechanism for aio requests
- making sub-tasks truly async (waiting for request slots, bmap calls)
- debugging aids to help detect drivers which aren't totally async (e.g
use semaphores - need to check which) or other sub-tasks which aren't
truly async
- flow control in aio (address write throtting issue)
- implementing io_queue_grow (changing queue lengths)
- mmaped ring buffer (Could lockless approaches be more fragile than
we forsee now ? Is it worth it ? How much does it save ?)
- kernel memory pinning issue (pinning user buffers too early ? may be
able to improve this with cross-memory descriptors once aio flow control
is in place)
- explore at-least-N
- explore io_submit_wait
- aio request priorities (get the basic scheme in place, later relate it
to the priority based i/o scheduler when that happens)
- user space grouping of multiple completion queues (handling priorities,
concurrency control etc; expose wait-queue primitives to userspace)
- interfacing with generic event namespace (pollfs) approach (viro's idea)
7. References/Related patches:
1. Dan Kegel's c10k site: (http://www.kegel.com/c10k.html)
Talks about the /dev/epoll patch, RT signals, Signal-per-fd approach,
BSD kqueues and lots of links and discussions on various programming
models for handling large numbers of clients/connections, with
comparative studies.
2. NT I/O completion ports, Solaris and AIX aio, POSIX aio specs
3. SGI's kaio implementation - (http://oss.sgi.com/projects/kaio)
4. Block Asynchronous I/O: A Flexible Infrastructure for User Level Filesystems
- Muthian Sivathanu, Venkateshwaran Venkataramani, and Remzi H.
Arapaci-Dusseau, Univ of Winsconsin-Madison
(http://www.cs.wisc.edu/~muthian/baio-paper.pdf)
5. The Direct Access File System Protocol & API Specifications
- DAFS Collaborative (http://www.dafscollaborative.org)
6. 2.5 block i/o design notes - (http://lse.sourceforge.net/io/bionotes.txt)