diff options
author | Jonathan Corbet <corbet@lwn.net> | 2016-11-30 17:46:23 -0700 |
---|---|---|
committer | Jonathan Corbet <corbet@lwn.net> | 2016-11-30 17:46:23 -0700 |
commit | 2fc9254d0c473e26a668d8efc75dd71146a07852 (patch) | |
tree | 70a341f1fe80735c59003f682da0bfef5d1e162e /Documentation/core-api | |
parent | c3396656666c2c98db306dc91c480d9fbde35cc9 (diff) | |
parent | 326bc876fed4fad2f46dd1be637e90e1b525ee8c (diff) | |
download | linux-2fc9254d0c473e26a668d8efc75dd71146a07852.tar.bz2 |
Merge branch 'silvio' into docs-next
Diffstat (limited to 'Documentation/core-api')
-rw-r--r-- | Documentation/core-api/assoc_array.rst | 551 | ||||
-rw-r--r-- | Documentation/core-api/atomic_ops.rst | 658 | ||||
-rw-r--r-- | Documentation/core-api/index.rst | 3 | ||||
-rw-r--r-- | Documentation/core-api/local_ops.rst | 206 |
4 files changed, 1418 insertions, 0 deletions
diff --git a/Documentation/core-api/assoc_array.rst b/Documentation/core-api/assoc_array.rst new file mode 100644 index 000000000000..dcda7c623cec --- /dev/null +++ b/Documentation/core-api/assoc_array.rst @@ -0,0 +1,551 @@ +======================================== +Generic Associative Array Implementation +======================================== + +Overview +======== + +This associative array implementation is an object container with the following +properties: + +1. Objects are opaque pointers. The implementation does not care where they + point (if anywhere) or what they point to (if anything). +.. note:: Pointers to objects _must_ be zero in the least significant bit.** + +2. Objects do not need to contain linkage blocks for use by the array. This + permits an object to be located in multiple arrays simultaneously. + Rather, the array is made up of metadata blocks that point to objects. + +3. Objects require index keys to locate them within the array. + +4. Index keys must be unique. Inserting an object with the same key as one + already in the array will replace the old object. + +5. Index keys can be of any length and can be of different lengths. + +6. Index keys should encode the length early on, before any variation due to + length is seen. + +7. Index keys can include a hash to scatter objects throughout the array. + +8. The array can iterated over. The objects will not necessarily come out in + key order. + +9. The array can be iterated over whilst it is being modified, provided the + RCU readlock is being held by the iterator. Note, however, under these + circumstances, some objects may be seen more than once. If this is a + problem, the iterator should lock against modification. Objects will not + be missed, however, unless deleted. + +10. Objects in the array can be looked up by means of their index key. + +11. Objects can be looked up whilst the array is being modified, provided the + RCU readlock is being held by the thread doing the look up. + +The implementation uses a tree of 16-pointer nodes internally that are indexed +on each level by nibbles from the index key in the same manner as in a radix +tree. To improve memory efficiency, shortcuts can be emplaced to skip over +what would otherwise be a series of single-occupancy nodes. Further, nodes +pack leaf object pointers into spare space in the node rather than making an +extra branch until as such time an object needs to be added to a full node. + + +The Public API +============== + +The public API can be found in ``<linux/assoc_array.h>``. The associative +array is rooted on the following structure:: + + struct assoc_array { + ... + }; + +The code is selected by enabling ``CONFIG_ASSOCIATIVE_ARRAY`` with:: + + ./script/config -e ASSOCIATIVE_ARRAY + + +Edit Script +----------- + +The insertion and deletion functions produce an 'edit script' that can later be +applied to effect the changes without risking ``ENOMEM``. This retains the +preallocated metadata blocks that will be installed in the internal tree and +keeps track of the metadata blocks that will be removed from the tree when the +script is applied. + +This is also used to keep track of dead blocks and dead objects after the +script has been applied so that they can be freed later. The freeing is done +after an RCU grace period has passed - thus allowing access functions to +proceed under the RCU read lock. + +The script appears as outside of the API as a pointer of the type:: + + struct assoc_array_edit; + +There are two functions for dealing with the script: + +1. Apply an edit script:: + + void assoc_array_apply_edit(struct assoc_array_edit *edit); + +This will perform the edit functions, interpolating various write barriers +to permit accesses under the RCU read lock to continue. The edit script +will then be passed to ``call_rcu()`` to free it and any dead stuff it points +to. + +2. Cancel an edit script:: + + void assoc_array_cancel_edit(struct assoc_array_edit *edit); + +This frees the edit script and all preallocated memory immediately. If +this was for insertion, the new object is _not_ released by this function, +but must rather be released by the caller. + +These functions are guaranteed not to fail. + + +Operations Table +---------------- + +Various functions take a table of operations:: + + struct assoc_array_ops { + ... + }; + +This points to a number of methods, all of which need to be provided: + +1. Get a chunk of index key from caller data:: + + unsigned long (*get_key_chunk)(const void *index_key, int level); + +This should return a chunk of caller-supplied index key starting at the +*bit* position given by the level argument. The level argument will be a +multiple of ``ASSOC_ARRAY_KEY_CHUNK_SIZE`` and the function should return +``ASSOC_ARRAY_KEY_CHUNK_SIZE bits``. No error is possible. + + +2. Get a chunk of an object's index key:: + + unsigned long (*get_object_key_chunk)(const void *object, int level); + +As the previous function, but gets its data from an object in the array +rather than from a caller-supplied index key. + + +3. See if this is the object we're looking for:: + + bool (*compare_object)(const void *object, const void *index_key); + +Compare the object against an index key and return ``true`` if it matches and +``false`` if it doesn't. + + +4. Diff the index keys of two objects:: + + int (*diff_objects)(const void *object, const void *index_key); + +Return the bit position at which the index key of the specified object +differs from the given index key or -1 if they are the same. + + +5. Free an object:: + + void (*free_object)(void *object); + +Free the specified object. Note that this may be called an RCU grace period +after ``assoc_array_apply_edit()`` was called, so ``synchronize_rcu()`` may be +necessary on module unloading. + + +Manipulation Functions +---------------------- + +There are a number of functions for manipulating an associative array: + +1. Initialise an associative array:: + + void assoc_array_init(struct assoc_array *array); + +This initialises the base structure for an associative array. It can't fail. + + +2. Insert/replace an object in an associative array:: + + struct assoc_array_edit * + assoc_array_insert(struct assoc_array *array, + const struct assoc_array_ops *ops, + const void *index_key, + void *object); + +This inserts the given object into the array. Note that the least +significant bit of the pointer must be zero as it's used to type-mark +pointers internally. + +If an object already exists for that key then it will be replaced with the +new object and the old one will be freed automatically. + +The ``index_key`` argument should hold index key information and is +passed to the methods in the ops table when they are called. + +This function makes no alteration to the array itself, but rather returns +an edit script that must be applied. ``-ENOMEM`` is returned in the case of +an out-of-memory error. + +The caller should lock exclusively against other modifiers of the array. + + +3. Delete an object from an associative array:: + + struct assoc_array_edit * + assoc_array_delete(struct assoc_array *array, + const struct assoc_array_ops *ops, + const void *index_key); + +This deletes an object that matches the specified data from the array. + +The ``index_key`` argument should hold index key information and is +passed to the methods in the ops table when they are called. + +This function makes no alteration to the array itself, but rather returns +an edit script that must be applied. ``-ENOMEM`` is returned in the case of +an out-of-memory error. ``NULL`` will be returned if the specified object is +not found within the array. + +The caller should lock exclusively against other modifiers of the array. + + +4. Delete all objects from an associative array:: + + struct assoc_array_edit * + assoc_array_clear(struct assoc_array *array, + const struct assoc_array_ops *ops); + +This deletes all the objects from an associative array and leaves it +completely empty. + +This function makes no alteration to the array itself, but rather returns +an edit script that must be applied. ``-ENOMEM`` is returned in the case of +an out-of-memory error. + +The caller should lock exclusively against other modifiers of the array. + + +5. Destroy an associative array, deleting all objects:: + + void assoc_array_destroy(struct assoc_array *array, + const struct assoc_array_ops *ops); + +This destroys the contents of the associative array and leaves it +completely empty. It is not permitted for another thread to be traversing +the array under the RCU read lock at the same time as this function is +destroying it as no RCU deferral is performed on memory release - +something that would require memory to be allocated. + +The caller should lock exclusively against other modifiers and accessors +of the array. + + +6. Garbage collect an associative array:: + + int assoc_array_gc(struct assoc_array *array, + const struct assoc_array_ops *ops, + bool (*iterator)(void *object, void *iterator_data), + void *iterator_data); + +This iterates over the objects in an associative array and passes each one to +``iterator()``. If ``iterator()`` returns ``true``, the object is kept. If it +returns ``false``, the object will be freed. If the ``iterator()`` function +returns ``true``, it must perform any appropriate refcount incrementing on the +object before returning. + +The internal tree will be packed down if possible as part of the iteration +to reduce the number of nodes in it. + +The ``iterator_data`` is passed directly to ``iterator()`` and is otherwise +ignored by the function. + +The function will return ``0`` if successful and ``-ENOMEM`` if there wasn't +enough memory. + +It is possible for other threads to iterate over or search the array under +the RCU read lock whilst this function is in progress. The caller should +lock exclusively against other modifiers of the array. + + +Access Functions +---------------- + +There are two functions for accessing an associative array: + +1. Iterate over all the objects in an associative array:: + + int assoc_array_iterate(const struct assoc_array *array, + int (*iterator)(const void *object, + void *iterator_data), + void *iterator_data); + +This passes each object in the array to the iterator callback function. +``iterator_data`` is private data for that function. + +This may be used on an array at the same time as the array is being +modified, provided the RCU read lock is held. Under such circumstances, +it is possible for the iteration function to see some objects twice. If +this is a problem, then modification should be locked against. The +iteration algorithm should not, however, miss any objects. + +The function will return ``0`` if no objects were in the array or else it will +return the result of the last iterator function called. Iteration stops +immediately if any call to the iteration function results in a non-zero +return. + + +2. Find an object in an associative array:: + + void *assoc_array_find(const struct assoc_array *array, + const struct assoc_array_ops *ops, + const void *index_key); + +This walks through the array's internal tree directly to the object +specified by the index key.. + +This may be used on an array at the same time as the array is being +modified, provided the RCU read lock is held. + +The function will return the object if found (and set ``*_type`` to the object +type) or will return ``NULL`` if the object was not found. + + +Index Key Form +-------------- + +The index key can be of any form, but since the algorithms aren't told how long +the key is, it is strongly recommended that the index key includes its length +very early on before any variation due to the length would have an effect on +comparisons. + +This will cause leaves with different length keys to scatter away from each +other - and those with the same length keys to cluster together. + +It is also recommended that the index key begin with a hash of the rest of the +key to maximise scattering throughout keyspace. + +The better the scattering, the wider and lower the internal tree will be. + +Poor scattering isn't too much of a problem as there are shortcuts and nodes +can contain mixtures of leaves and metadata pointers. + +The index key is read in chunks of machine word. Each chunk is subdivided into +one nibble (4 bits) per level, so on a 32-bit CPU this is good for 8 levels and +on a 64-bit CPU, 16 levels. Unless the scattering is really poor, it is +unlikely that more than one word of any particular index key will have to be +used. + + +Internal Workings +================= + +The associative array data structure has an internal tree. This tree is +constructed of two types of metadata blocks: nodes and shortcuts. + +A node is an array of slots. Each slot can contain one of four things: + +* A NULL pointer, indicating that the slot is empty. +* A pointer to an object (a leaf). +* A pointer to a node at the next level. +* A pointer to a shortcut. + + +Basic Internal Tree Layout +-------------------------- + +Ignoring shortcuts for the moment, the nodes form a multilevel tree. The index +key space is strictly subdivided by the nodes in the tree and nodes occur on +fixed levels. For example:: + + Level: 0 1 2 3 + =============== =============== =============== =============== + NODE D + NODE B NODE C +------>+---+ + +------>+---+ +------>+---+ | | 0 | + NODE A | | 0 | | | 0 | | +---+ + +---+ | +---+ | +---+ | : : + | 0 | | : : | : : | +---+ + +---+ | +---+ | +---+ | | f | + | 1 |---+ | 3 |---+ | 7 |---+ +---+ + +---+ +---+ +---+ + : : : : | 8 |---+ + +---+ +---+ +---+ | NODE E + | e |---+ | f | : : +------>+---+ + +---+ | +---+ +---+ | 0 | + | f | | | f | +---+ + +---+ | +---+ : : + | NODE F +---+ + +------>+---+ | f | + | 0 | NODE G +---+ + +---+ +------>+---+ + : : | | 0 | + +---+ | +---+ + | 6 |---+ : : + +---+ +---+ + : : | f | + +---+ +---+ + | f | + +---+ + +In the above example, there are 7 nodes (A-G), each with 16 slots (0-f). +Assuming no other meta data nodes in the tree, the key space is divided +thusly:: + + KEY PREFIX NODE + ========== ==== + 137* D + 138* E + 13[0-69-f]* C + 1[0-24-f]* B + e6* G + e[0-57-f]* F + [02-df]* A + +So, for instance, keys with the following example index keys will be found in +the appropriate nodes:: + + INDEX KEY PREFIX NODE + =============== ======= ==== + 13694892892489 13 C + 13795289025897 137 D + 13889dde88793 138 E + 138bbb89003093 138 E + 1394879524789 12 C + 1458952489 1 B + 9431809de993ba - A + b4542910809cd - A + e5284310def98 e F + e68428974237 e6 G + e7fffcbd443 e F + f3842239082 - A + +To save memory, if a node can hold all the leaves in its portion of keyspace, +then the node will have all those leaves in it and will not have any metadata +pointers - even if some of those leaves would like to be in the same slot. + +A node can contain a heterogeneous mix of leaves and metadata pointers. +Metadata pointers must be in the slots that match their subdivisions of key +space. The leaves can be in any slot not occupied by a metadata pointer. It +is guaranteed that none of the leaves in a node will match a slot occupied by a +metadata pointer. If the metadata pointer is there, any leaf whose key matches +the metadata key prefix must be in the subtree that the metadata pointer points +to. + +In the above example list of index keys, node A will contain:: + + SLOT CONTENT INDEX KEY (PREFIX) + ==== =============== ================== + 1 PTR TO NODE B 1* + any LEAF 9431809de993ba + any LEAF b4542910809cd + e PTR TO NODE F e* + any LEAF f3842239082 + +and node B:: + + 3 PTR TO NODE C 13* + any LEAF 1458952489 + + +Shortcuts +--------- + +Shortcuts are metadata records that jump over a piece of keyspace. A shortcut +is a replacement for a series of single-occupancy nodes ascending through the +levels. Shortcuts exist to save memory and to speed up traversal. + +It is possible for the root of the tree to be a shortcut - say, for example, +the tree contains at least 17 nodes all with key prefix ``1111``. The +insertion algorithm will insert a shortcut to skip over the ``1111`` keyspace +in a single bound and get to the fourth level where these actually become +different. + + +Splitting And Collapsing Nodes +------------------------------ + +Each node has a maximum capacity of 16 leaves and metadata pointers. If the +insertion algorithm finds that it is trying to insert a 17th object into a +node, that node will be split such that at least two leaves that have a common +key segment at that level end up in a separate node rooted on that slot for +that common key segment. + +If the leaves in a full node and the leaf that is being inserted are +sufficiently similar, then a shortcut will be inserted into the tree. + +When the number of objects in the subtree rooted at a node falls to 16 or +fewer, then the subtree will be collapsed down to a single node - and this will +ripple towards the root if possible. + + +Non-Recursive Iteration +----------------------- + +Each node and shortcut contains a back pointer to its parent and the number of +slot in that parent that points to it. None-recursive iteration uses these to +proceed rootwards through the tree, going to the parent node, slot N + 1 to +make sure progress is made without the need for a stack. + +The backpointers, however, make simultaneous alteration and iteration tricky. + + +Simultaneous Alteration And Iteration +------------------------------------- + +There are a number of cases to consider: + +1. Simple insert/replace. This involves simply replacing a NULL or old + matching leaf pointer with the pointer to the new leaf after a barrier. + The metadata blocks don't change otherwise. An old leaf won't be freed + until after the RCU grace period. + +2. Simple delete. This involves just clearing an old matching leaf. The + metadata blocks don't change otherwise. The old leaf won't be freed until + after the RCU grace period. + +3. Insertion replacing part of a subtree that we haven't yet entered. This + may involve replacement of part of that subtree - but that won't affect + the iteration as we won't have reached the pointer to it yet and the + ancestry blocks are not replaced (the layout of those does not change). + +4. Insertion replacing nodes that we're actively processing. This isn't a + problem as we've passed the anchoring pointer and won't switch onto the + new layout until we follow the back pointers - at which point we've + already examined the leaves in the replaced node (we iterate over all the + leaves in a node before following any of its metadata pointers). + + We might, however, re-see some leaves that have been split out into a new + branch that's in a slot further along than we were at. + +5. Insertion replacing nodes that we're processing a dependent branch of. + This won't affect us until we follow the back pointers. Similar to (4). + +6. Deletion collapsing a branch under us. This doesn't affect us because the + back pointers will get us back to the parent of the new node before we + could see the new node. The entire collapsed subtree is thrown away + unchanged - and will still be rooted on the same slot, so we shouldn't + process it a second time as we'll go back to slot + 1. + +.. note:: + + Under some circumstances, we need to simultaneously change the parent + pointer and the parent slot pointer on a node (say, for example, we + inserted another node before it and moved it up a level). We cannot do + this without locking against a read - so we have to replace that node too. + + However, when we're changing a shortcut into a node this isn't a problem + as shortcuts only have one slot and so the parent slot number isn't used + when traversing backwards over one. This means that it's okay to change + the slot number first - provided suitable barriers are used to make sure + the parent slot number is read after the back pointer. + +Obsolete blocks and leaves are freed up after an RCU grace period has passed, +so as long as anyone doing walking or iteration holds the RCU read lock, the +old superstructure should not go away on them. diff --git a/Documentation/core-api/atomic_ops.rst b/Documentation/core-api/atomic_ops.rst new file mode 100644 index 000000000000..55e43f1c80de --- /dev/null +++ b/Documentation/core-api/atomic_ops.rst @@ -0,0 +1,658 @@ +======================================================= +Semantics and Behavior of Atomic and Bitmask Operations +======================================================= + +:Author: David S. Miller + +This document is intended to serve as a guide to Linux port +maintainers on how to implement atomic counter, bitops, and spinlock +interfaces properly. + +Atomic Type And Operations +========================== + +The atomic_t type should be defined as a signed integer and +the atomic_long_t type as a signed long integer. Also, they should +be made opaque such that any kind of cast to a normal C integer type +will fail. Something like the following should suffice:: + + typedef struct { int counter; } atomic_t; + typedef struct { long counter; } atomic_long_t; + +Historically, counter has been declared volatile. This is now discouraged. +See :ref:`Documentation/process/volatile-considered-harmful.rst +<volatile_considered_harmful>` for the complete rationale. + +local_t is very similar to atomic_t. If the counter is per CPU and only +updated by one CPU, local_t is probably more appropriate. Please see +:ref:`Documentation/core-api/local_ops.rst <local_ops>` for the semantics of +local_t. + +The first operations to implement for atomic_t's are the initializers and +plain reads. :: + + #define ATOMIC_INIT(i) { (i) } + #define atomic_set(v, i) ((v)->counter = (i)) + +The first macro is used in definitions, such as:: + + static atomic_t my_counter = ATOMIC_INIT(1); + +The initializer is atomic in that the return values of the atomic operations +are guaranteed to be correct reflecting the initialized value if the +initializer is used before runtime. If the initializer is used at runtime, a +proper implicit or explicit read memory barrier is needed before reading the +value with atomic_read from another thread. + +As with all of the ``atomic_`` interfaces, replace the leading ``atomic_`` +with ``atomic_long_`` to operate on atomic_long_t. + +The second interface can be used at runtime, as in:: + + struct foo { atomic_t counter; }; + ... + + struct foo *k; + + k = kmalloc(sizeof(*k), GFP_KERNEL); + if (!k) + return -ENOMEM; + atomic_set(&k->counter, 0); + +The setting is atomic in that the return values of the atomic operations by +all threads are guaranteed to be correct reflecting either the value that has +been set with this operation or set with another operation. A proper implicit +or explicit memory barrier is needed before the value set with the operation +is guaranteed to be readable with atomic_read from another thread. + +Next, we have:: + + #define atomic_read(v) ((v)->counter) + +which simply reads the counter value currently visible to the calling thread. +The read is atomic in that the return value is guaranteed to be one of the +values initialized or modified with the interface operations if a proper +implicit or explicit memory barrier is used after possible runtime +initialization by any other thread and the value is modified only with the +interface operations. atomic_read does not guarantee that the runtime +initialization by any other thread is visible yet, so the user of the +interface must take care of that with a proper implicit or explicit memory +barrier. + +.. warning:: + + ``atomic_read()`` and ``atomic_set()`` DO NOT IMPLY BARRIERS! + + Some architectures may choose to use the volatile keyword, barriers, or + inline assembly to guarantee some degree of immediacy for atomic_read() + and atomic_set(). This is not uniformly guaranteed, and may change in + the future, so all users of atomic_t should treat atomic_read() and + atomic_set() as simple C statements that may be reordered or optimized + away entirely by the compiler or processor, and explicitly invoke the + appropriate compiler and/or memory barrier for each use case. Failure + to do so will result in code that may suddenly break when used with + different architectures or compiler optimizations, or even changes in + unrelated code which changes how the compiler optimizes the section + accessing atomic_t variables. + +Properly aligned pointers, longs, ints, and chars (and unsigned +equivalents) may be atomically loaded from and stored to in the same +sense as described for atomic_read() and atomic_set(). The READ_ONCE() +and WRITE_ONCE() macros should be used to prevent the compiler from using +optimizations that might otherwise optimize accesses out of existence on +the one hand, or that might create unsolicited accesses on the other. + +For example consider the following code:: + + while (a > 0) + do_something(); + +If the compiler can prove that do_something() does not store to the +variable a, then the compiler is within its rights transforming this to +the following:: + + tmp = a; + if (a > 0) + for (;;) + do_something(); + +If you don't want the compiler to do this (and you probably don't), then +you should use something like the following:: + + while (READ_ONCE(a) < 0) + do_something(); + +Alternatively, you could place a barrier() call in the loop. + +For another example, consider the following code:: + + tmp_a = a; + do_something_with(tmp_a); + do_something_else_with(tmp_a); + +If the compiler can prove that do_something_with() does not store to the +variable a, then the compiler is within its rights to manufacture an +additional load as follows:: + + tmp_a = a; + do_something_with(tmp_a); + tmp_a = a; + do_something_else_with(tmp_a); + +This could fatally confuse your code if it expected the same value +to be passed to do_something_with() and do_something_else_with(). + +The compiler would be likely to manufacture this additional load if +do_something_with() was an inline function that made very heavy use +of registers: reloading from variable a could save a flush to the +stack and later reload. To prevent the compiler from attacking your +code in this manner, write the following:: + + tmp_a = READ_ONCE(a); + do_something_with(tmp_a); + do_something_else_with(tmp_a); + +For a final example, consider the following code, assuming that the +variable a is set at boot time before the second CPU is brought online +and never changed later, so that memory barriers are not needed:: + + if (a) + b = 9; + else + b = 42; + +The compiler is within its rights to manufacture an additional store +by transforming the above code into the following:: + + b = 42; + if (a) + b = 9; + +This could come as a fatal surprise to other code running concurrently +that expected b to never have the value 42 if a was zero. To prevent +the compiler from doing this, write something like:: + + if (a) + WRITE_ONCE(b, 9); + else + WRITE_ONCE(b, 42); + +Don't even -think- about doing this without proper use of memory barriers, +locks, or atomic operations if variable a can change at runtime! + +.. warning:: + + ``READ_ONCE()`` OR ``WRITE_ONCE()`` DO NOT IMPLY A BARRIER! + +Now, we move onto the atomic operation interfaces typically implemented with +the help of assembly code. :: + + void atomic_add(int i, atomic_t *v); + void atomic_sub(int i, atomic_t *v); + void atomic_inc(atomic_t *v); + void atomic_dec(atomic_t *v); + +These four routines add and subtract integral values to/from the given +atomic_t value. The first two routines pass explicit integers by +which to make the adjustment, whereas the latter two use an implicit +adjustment value of "1". + +One very important aspect of these two routines is that they DO NOT +require any explicit memory barriers. They need only perform the +atomic_t counter update in an SMP safe manner. + +Next, we have:: + + int atomic_inc_return(atomic_t *v); + int atomic_dec_return(atomic_t *v); + +These routines add 1 and subtract 1, respectively, from the given +atomic_t and return the new counter value after the operation is +performed. + +Unlike the above routines, it is required that these primitives +include explicit memory barriers that are performed before and after +the operation. It must be done such that all memory operations before +and after the atomic operation calls are strongly ordered with respect +to the atomic operation itself. + +For example, it should behave as if a smp_mb() call existed both +before and after the atomic operation. + +If the atomic instructions used in an implementation provide explicit +memory barrier semantics which satisfy the above requirements, that is +fine as well. + +Let's move on:: + + int atomic_add_return(int i, atomic_t *v); + int atomic_sub_return(int i, atomic_t *v); + +These behave just like atomic_{inc,dec}_return() except that an +explicit counter adjustment is given instead of the implicit "1". +This means that like atomic_{inc,dec}_return(), the memory barrier +semantics are required. + +Next:: + + int atomic_inc_and_test(atomic_t *v); + int atomic_dec_and_test(atomic_t *v); + +These two routines increment and decrement by 1, respectively, the +given atomic counter. They return a boolean indicating whether the +resulting counter value was zero or not. + +Again, these primitives provide explicit memory barrier semantics around +the atomic operation:: + + int atomic_sub_and_test(int i, atomic_t *v); + +This is identical to atomic_dec_and_test() except that an explicit +decrement is given instead of the implicit "1". This primitive must +provide explicit memory barrier semantics around the operation:: + + int atomic_add_negative(int i, atomic_t *v); + +The given increment is added to the given atomic counter value. A boolean +is return which indicates whether the resulting counter value is negative. +This primitive must provide explicit memory barrier semantics around +the operation. + +Then:: + + int atomic_xchg(atomic_t *v, int new); + +This performs an atomic exchange operation on the atomic variable v, setting +the given new value. It returns the old value that the atomic variable v had +just before the operation. + +atomic_xchg must provide explicit memory barriers around the operation. :: + + int atomic_cmpxchg(atomic_t *v, int old, int new); + +This performs an atomic compare exchange operation on the atomic value v, +with the given old and new values. Like all atomic_xxx operations, +atomic_cmpxchg will only satisfy its atomicity semantics as long as all +other accesses of \*v are performed through atomic_xxx operations. + +atomic_cmpxchg must provide explicit memory barriers around the operation, +although if the comparison fails then no memory ordering guarantees are +required. + +The semantics for atomic_cmpxchg are the same as those defined for 'cas' +below. + +Finally:: + + int atomic_add_unless(atomic_t *v, int a, int u); + +If the atomic value v is not equal to u, this function adds a to v, and +returns non zero. If v is equal to u then it returns zero. This is done as +an atomic operation. + +atomic_add_unless must provide explicit memory barriers around the +operation unless it fails (returns 0). + +atomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0) + + +If a caller requires memory barrier semantics around an atomic_t +operation which does not return a value, a set of interfaces are +defined which accomplish this:: + + void smp_mb__before_atomic(void); + void smp_mb__after_atomic(void); + +For example, smp_mb__before_atomic() can be used like so:: + + obj->dead = 1; + smp_mb__before_atomic(); + atomic_dec(&obj->ref_count); + +It makes sure that all memory operations preceding the atomic_dec() +call are strongly ordered with respect to the atomic counter +operation. In the above example, it guarantees that the assignment of +"1" to obj->dead will be globally visible to other cpus before the +atomic counter decrement. + +Without the explicit smp_mb__before_atomic() call, the +implementation could legally allow the atomic counter update visible +to other cpus before the "obj->dead = 1;" assignment. + +A missing memory barrier in the cases where they are required by the +atomic_t implementation above can have disastrous results. Here is +an example, which follows a pattern occurring frequently in the Linux +kernel. It is the use of atomic counters to implement reference +counting, and it works such that once the counter falls to zero it can +be guaranteed that no other entity can be accessing the object:: + + static void obj_list_add(struct obj *obj, struct list_head *head) + { + obj->active = 1; + list_add(&obj->list, head); + } + + static void obj_list_del(struct obj *obj) + { + list_del(&obj->list); + obj->active = 0; + } + + static void obj_destroy(struct obj *obj) + { + BUG_ON(obj->active); + kfree(obj); + } + + struct obj *obj_list_peek(struct list_head *head) + { + if (!list_empty(head)) { + struct obj *obj; + + obj = list_entry(head->next, struct obj, list); + atomic_inc(&obj->refcnt); + return obj; + } + return NULL; + } + + void obj_poke(void) + { + struct obj *obj; + + spin_lock(&global_list_lock); + obj = obj_list_peek(&global_list); + spin_unlock(&global_list_lock); + + if (obj) { + obj->ops->poke(obj); + if (atomic_dec_and_test(&obj->refcnt)) + obj_destroy(obj); + } + } + + void obj_timeout(struct obj *obj) + { + spin_lock(&global_list_lock); + obj_list_del(obj); + spin_unlock(&global_list_lock); + + if (atomic_dec_and_test(&obj->refcnt)) + obj_destroy(obj); + } + +.. note:: + + This is a simplification of the ARP queue management in the generic + neighbour discover code of the networking. Olaf Kirch found a bug wrt. + memory barriers in kfree_skb() that exposed the atomic_t memory barrier + requirements quite clearly. + +Given the above scheme, it must be the case that the obj->active +update done by the obj list deletion be visible to other processors +before the atomic counter decrement is performed. + +Otherwise, the counter could fall to zero, yet obj->active would still +be set, thus triggering the assertion in obj_destroy(). The error +sequence looks like this:: + + cpu 0 cpu 1 + obj_poke() obj_timeout() + obj = obj_list_peek(); + ... gains ref to obj, refcnt=2 + obj_list_del(obj); + obj->active = 0 ... + ... visibility delayed ... + atomic_dec_and_test() + ... refcnt drops to 1 ... + atomic_dec_and_test() + ... refcount drops to 0 ... + obj_destroy() + BUG() triggers since obj->active + still seen as one + obj->active update visibility occurs + +With the memory barrier semantics required of the atomic_t operations +which return values, the above sequence of memory visibility can never +happen. Specifically, in the above case the atomic_dec_and_test() +counter decrement would not become globally visible until the +obj->active update does. + +As a historical note, 32-bit Sparc used to only allow usage of +24-bits of its atomic_t type. This was because it used 8 bits +as a spinlock for SMP safety. Sparc32 lacked a "compare and swap" +type instruction. However, 32-bit Sparc has since been moved over +to a "hash table of spinlocks" scheme, that allows the full 32-bit +counter to be realized. Essentially, an array of spinlocks are +indexed into based upon the address of the atomic_t being operated +on, and that lock protects the atomic operation. Parisc uses the +same scheme. + +Another note is that the atomic_t operations returning values are +extremely slow on an old 386. + + +Atomic Bitmask +============== + +We will now cover the atomic bitmask operations. You will find that +their SMP and memory barrier semantics are similar in shape and scope +to the atomic_t ops above. + +Native atomic bit operations are defined to operate on objects aligned +to the size of an "unsigned long" C data type, and are least of that +size. The endianness of the bits within each "unsigned long" are the +native endianness of the cpu. :: + + void set_bit(unsigned long nr, volatile unsigned long *addr); + void clear_bit(unsigned long nr, volatile unsigned long *addr); + void change_bit(unsigned long nr, volatile unsigned long *addr); + +These routines set, clear, and change, respectively, the bit number +indicated by "nr" on the bit mask pointed to by "ADDR". + +They must execute atomically, yet there are no implicit memory barrier +semantics required of these interfaces. :: + + int test_and_set_bit(unsigned long nr, volatile unsigned long *addr); + int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); + int test_and_change_bit(unsigned long nr, volatile unsigned long *addr); + +Like the above, except that these routines return a boolean which +indicates whether the changed bit was set _BEFORE_ the atomic bit +operation. + +WARNING! It is incredibly important that the value be a boolean, +ie. "0" or "1". Do not try to be fancy and save a few instructions by +declaring the above to return "long" and just returning something like +"old_val & mask" because that will not work. + +For one thing, this return value gets truncated to int in many code +paths using these interfaces, so on 64-bit if the bit is set in the +upper 32-bits then testers will never see that. + +One great example of where this problem crops up are the thread_info +flag operations. Routines such as test_and_set_ti_thread_flag() chop +the return value into an int. There are other places where things +like this occur as well. + +These routines, like the atomic_t counter operations returning values, +must provide explicit memory barrier semantics around their execution. +All memory operations before the atomic bit operation call must be +made visible globally before the atomic bit operation is made visible. +Likewise, the atomic bit operation must be visible globally before any +subsequent memory operation is made visible. For example:: + + obj->dead = 1; + if (test_and_set_bit(0, &obj->flags)) + /* ... */; + obj->killed = 1; + +The implementation of test_and_set_bit() must guarantee that +"obj->dead = 1;" is visible to cpus before the atomic memory operation +done by test_and_set_bit() becomes visible. Likewise, the atomic +memory operation done by test_and_set_bit() must become visible before +"obj->killed = 1;" is visible. + +Finally there is the basic operation:: + + int test_bit(unsigned long nr, __const__ volatile unsigned long *addr); + +Which returns a boolean indicating if bit "nr" is set in the bitmask +pointed to by "addr". + +If explicit memory barriers are required around {set,clear}_bit() (which do +not return a value, and thus does not need to provide memory barrier +semantics), two interfaces are provided:: + + void smp_mb__before_atomic(void); + void smp_mb__after_atomic(void); + +They are used as follows, and are akin to their atomic_t operation +brothers:: + + /* All memory operations before this call will + * be globally visible before the clear_bit(). + */ + smp_mb__before_atomic(); + clear_bit( ... ); + + /* The clear_bit() will be visible before all + * subsequent memory operations. + */ + smp_mb__after_atomic(); + +There are two special bitops with lock barrier semantics (acquire/release, +same as spinlocks). These operate in the same way as their non-_lock/unlock +postfixed variants, except that they are to provide acquire/release semantics, +respectively. This means they can be used for bit_spin_trylock and +bit_spin_unlock type operations without specifying any more barriers. :: + + int test_and_set_bit_lock(unsigned long nr, unsigned long *addr); + void clear_bit_unlock(unsigned long nr, unsigned long *addr); + void __clear_bit_unlock(unsigned long nr, unsigned long *addr); + +The __clear_bit_unlock version is non-atomic, however it still implements +unlock barrier semantics. This can be useful if the lock itself is protecting +the other bits in the word. + +Finally, there are non-atomic versions of the bitmask operations +provided. They are used in contexts where some other higher-level SMP +locking scheme is being used to protect the bitmask, and thus less +expensive non-atomic operations may be used in the implementation. +They have names similar to the above bitmask operation interfaces, +except that two underscores are prefixed to the interface name. :: + + void __set_bit(unsigned long nr, volatile unsigned long *addr); + void __clear_bit(unsigned long nr, volatile unsigned long *addr); + void __change_bit(unsigned long nr, volatile unsigned long *addr); + int __test_and_set_bit(unsigned long nr, volatile unsigned long *addr); + int __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); + int __test_and_change_bit(unsigned long nr, volatile unsigned long *addr); + +These non-atomic variants also do not require any special memory +barrier semantics. + +The routines xchg() and cmpxchg() must provide the same exact +memory-barrier semantics as the atomic and bit operations returning +values. + +.. note:: + + If someone wants to use xchg(), cmpxchg() and their variants, + linux/atomic.h should be included rather than asm/cmpxchg.h, unless the + code is in arch/* and can take care of itself. + +Spinlocks and rwlocks have memory barrier expectations as well. +The rule to follow is simple: + +1) When acquiring a lock, the implementation must make it globally + visible before any subsequent memory operation. + +2) When releasing a lock, the implementation must make it such that + all previous memory operations are globally visible before the + lock release. + +Which finally brings us to _atomic_dec_and_lock(). There is an +architecture-neutral version implemented in lib/dec_and_lock.c, +but most platforms will wish to optimize this in assembler. :: + + int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock); + +Atomically decrement the given counter, and if will drop to zero +atomically acquire the given spinlock and perform the decrement +of the counter to zero. If it does not drop to zero, do nothing +with the spinlock. + +It is actually pretty simple to get the memory barrier correct. +Simply satisfy the spinlock grab requirements, which is make +sure the spinlock operation is globally visible before any +subsequent memory operation. + +We can demonstrate this operation more clearly if we define +an abstract atomic operation:: + + long cas(long *mem, long old, long new); + +"cas" stands for "compare and swap". It atomically: + +1) Compares "old" with the value currently at "mem". +2) If they are equal, "new" is written to "mem". +3) Regardless, the current value at "mem" is returned. + +As an example usage, here is what an atomic counter update +might look like:: + + void example_atomic_inc(long *counter) + { + long old, new, ret; + + while (1) { + old = *counter; + new = old + 1; + + ret = cas(counter, old, new); + if (ret == old) + break; + } + } + +Let's use cas() in order to build a pseudo-C atomic_dec_and_lock():: + + int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) + { + long old, new, ret; + int went_to_zero; + + went_to_zero = 0; + while (1) { + old = atomic_read(atomic); + new = old - 1; + if (new == 0) { + went_to_zero = 1; + spin_lock(lock); + } + ret = cas(atomic, old, new); + if (ret == old) + break; + if (went_to_zero) { + spin_unlock(lock); + went_to_zero = 0; + } + } + + return went_to_zero; + } + +Now, as far as memory barriers go, as long as spin_lock() +strictly orders all subsequent memory operations (including +the cas()) with respect to itself, things will be fine. + +Said another way, _atomic_dec_and_lock() must guarantee that +a counter dropping to zero is never made visible before the +spinlock being acquired. + +.. note:: + + Note that this also means that for the case where the counter is not + dropping to zero, there are no memory ordering requirements. diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index 91b3a010817a..2872ca1a52f1 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -11,6 +11,9 @@ Core utilities .. toctree:: :maxdepth: 1 + assoc_array + atomic_ops + local_ops workqueue Interfaces for kernel debugging diff --git a/Documentation/core-api/local_ops.rst b/Documentation/core-api/local_ops.rst new file mode 100644 index 000000000000..1062ddba62c7 --- /dev/null +++ b/Documentation/core-api/local_ops.rst @@ -0,0 +1,206 @@ + +.. _local_ops: + +================================================= +Semantics and Behavior of Local Atomic Operations +================================================= + +:Author: Mathieu Desnoyers + + +This document explains the purpose of the local atomic operations, how +to implement them for any given architecture and shows how they can be used +properly. It also stresses on the precautions that must be taken when reading +those local variables across CPUs when the order of memory writes matters. + +.. note:: + + Note that ``local_t`` based operations are not recommended for general + kernel use. Please use the ``this_cpu`` operations instead unless there is + really a special purpose. Most uses of ``local_t`` in the kernel have been + replaced by ``this_cpu`` operations. ``this_cpu`` operations combine the + relocation with the ``local_t`` like semantics in a single instruction and + yield more compact and faster executing code. + + +Purpose of local atomic operations +================================== + +Local atomic operations are meant to provide fast and highly reentrant per CPU +counters. They minimize the performance cost of standard atomic operations by +removing the LOCK prefix and memory barriers normally required to synchronize +across CPUs. + +Having fast per CPU atomic counters is interesting in many cases: it does not +require disabling interrupts to protect from interrupt handlers and it permits +coherent counters in NMI handlers. It is especially useful for tracing purposes +and for various performance monitoring counters. + +Local atomic operations only guarantee variable modification atomicity wrt the +CPU which owns the data. Therefore, care must taken to make sure that only one +CPU writes to the ``local_t`` data. This is done by using per cpu data and +making sure that we modify it from within a preemption safe context. It is +however permitted to read ``local_t`` data from any CPU: it will then appear to +be written out of order wrt other memory writes by the owner CPU. + + +Implementation for a given architecture +======================================= + +It can be done by slightly modifying the standard atomic operations: only +their UP variant must be kept. It typically means removing LOCK prefix (on +i386 and x86_64) and any SMP synchronization barrier. If the architecture does +not have a different behavior between SMP and UP, including +``asm-generic/local.h`` in your architecture's ``local.h`` is sufficient. + +The ``local_t`` type is defined as an opaque ``signed long`` by embedding an +``atomic_long_t`` inside a structure. This is made so a cast from this type to +a ``long`` fails. The definition looks like:: + + typedef struct { atomic_long_t a; } local_t; + + +Rules to follow when using local atomic operations +================================================== + +* Variables touched by local ops must be per cpu variables. +* *Only* the CPU owner of these variables must write to them. +* This CPU can use local ops from any context (process, irq, softirq, nmi, ...) + to update its ``local_t`` variables. +* Preemption (or interrupts) must be disabled when using local ops in + process context to make sure the process won't be migrated to a + different CPU between getting the per-cpu variable and doing the + actual local op. +* When using local ops in interrupt context, no special care must be + taken on a mainline kernel, since they will run on the local CPU with + preemption already disabled. I suggest, however, to explicitly + disable preemption anyway to make sure it will still work correctly on + -rt kernels. +* Reading the local cpu variable will provide the current copy of the + variable. +* Reads of these variables can be done from any CPU, because updates to + "``long``", aligned, variables are always atomic. Since no memory + synchronization is done by the writer CPU, an outdated copy of the + variable can be read when reading some *other* cpu's variables. + + +How to use local atomic operations +================================== + +:: + + #include <linux/percpu.h> + #include <asm/local.h> + + static DEFINE_PER_CPU(local_t, counters) = LOCAL_INIT(0); + + +Counting +======== + +Counting is done on all the bits of a signed long. + +In preemptible context, use ``get_cpu_var()`` and ``put_cpu_var()`` around +local atomic operations: it makes sure that preemption is disabled around write +access to the per cpu variable. For instance:: + + local_inc(&get_cpu_var(counters)); + put_cpu_var(counters); + +If you are already in a preemption-safe context, you can use +``this_cpu_ptr()`` instead:: + + local_inc(this_cpu_ptr(&counters)); + + + +Reading the counters +==================== + +Those local counters can be read from foreign CPUs to sum the count. Note that +the data seen by local_read across CPUs must be considered to be out of order +relatively to other memory writes happening on the CPU that owns the data:: + + long sum = 0; + for_each_online_cpu(cpu) + sum += local_read(&per_cpu(counters, cpu)); + +If you want to use a remote local_read to synchronize access to a resource +between CPUs, explicit ``smp_wmb()`` and ``smp_rmb()`` memory barriers must be used +respectively on the writer and the reader CPUs. It would be the case if you use +the ``local_t`` variable as a counter of bytes written in a buffer: there should +be a ``smp_wmb()`` between the buffer write and the counter increment and also a +``smp_rmb()`` between the counter read and the buffer read. + + +Here is a sample module which implements a basic per cpu counter using +``local.h``:: + + /* test-local.c + * + * Sample module for local.h usage. + */ + + + #include <asm/local.h> + #include <linux/module.h> + #include <linux/timer.h> + + static DEFINE_PER_CPU(local_t, counters) = LOCAL_INIT(0); + + static struct timer_list test_timer; + + /* IPI called on each CPU. */ + static void test_each(void *info) + { + /* Increment the counter from a non preemptible context */ + printk("Increment on cpu %d\n", smp_processor_id()); + local_inc(this_cpu_ptr(&counters)); + + /* This is what incrementing the variable would look like within a + * preemptible context (it disables preemption) : + * + * local_inc(&get_cpu_var(counters)); + * put_cpu_var(counters); + */ + } + + static void do_test_timer(unsigned long data) + { + int cpu; + + /* Increment the counters */ + on_each_cpu(test_each, NULL, 1); + /* Read all the counters */ + printk("Counters read from CPU %d\n", smp_processor_id()); + for_each_online_cpu(cpu) { + printk("Read : CPU %d, count %ld\n", cpu, + local_read(&per_cpu(counters, cpu))); + } + del_timer(&test_timer); + test_timer.expires = jiffies + 1000; + add_timer(&test_timer); + } + + static int __init test_init(void) + { + /* initialize the timer that will increment the counter */ + init_timer(&test_timer); + test_timer.function = do_test_timer; + test_timer.expires = jiffies + 1; + add_timer(&test_timer); + + return 0; + } + + static void __exit test_exit(void) + { + del_timer_sync(&test_timer); + } + + module_init(test_init); + module_exit(test_exit); + + MODULE_LICENSE("GPL"); + MODULE_AUTHOR("Mathieu Desnoyers"); + MODULE_DESCRIPTION("Local Atomic Ops"); |