summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2012-07-30rbd: set snapc->seq only when refreshing headerAlex Elder1-8/+2
In rbd_header_add_snap() there is code to set snapc->seq to the just-added snapshot id. This is the only remnant left of the use of that field for recording which snapshot an rbd_dev was associated with. That functionality is no longer supported, so get rid of that final bit of code. Doing so means we never actually set snapc->seq any more. On the server, the snapshot context's sequence value represents the highest snapshot id ever issued for a particular rbd image. So we'll make it have that meaning here as well. To do so, set this value whenever the rbd header is (re-)read. That way it will always be consistent with the rest of the snapshot context we maintain. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30rbd: preserve snapc->seq in rbd_header_set_snap()Alex Elder1-11/+7
In rbd_header_set_snap(), there is logic to make the snap context's seq field get set to a particular snapshot id, or 0 if there is no snapshot for the rbd image. This seems to be an artifact of how the current snapshot id for an rbd_dev was recorded before the rbd_dev->snap_id field began to be used for that purpose. There's no need to update the value of snapc->seq here any more, so stop doing it. Tidy up a few local variables in that function while we're at it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30rbd: don't use snapc->seq that wayAlex Elder1-14/+0
In what appears to be an artifact of a different way of encoding whether an rbd image maps a snapshot, __rbd_refresh_header() has code that arranges to update the seq value in an rbd image's snapshot context to point to the first entry in its snapshot array if that's where it was pointing initially. We now use rbd_dev->snap_id to record the snapshot id--using the special value CEPH_NOSNAP to indicate the rbd_dev is not mapping a snapshot at all. There is therefore no need to check for this case, nor to update the seq value, in __rbd_refresh_header(). Just preserve the seq value that rbd_read_header() provides (which, at the moment, is nothing). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30rbd: send header version when notifyingJosh Durgin1-2/+5
Previously the original header version was sent. Now, we update it when the header changes. Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com> Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30rbd: use reference counting for the snap contextJosh Durgin1-18/+18
This prevents a race between requests with a given snap context and header updates that free it. The osd client was already expecting the snap context to be reference counted, since it get()s it in ceph_osdc_build_request and put()s it when the request completes. Also remove the second down_read()/up_read() on header_rwsem in rbd_do_request, which wasn't actually preventing this race or protecting any other data. Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com> Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30rbd: set image size when header is updatedJosh Durgin1-0/+1
The image may have been resized. Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com> Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30rbd: expose the correct size of the device in sysfsJosh Durgin1-3/+8
If an image was mapped to a snapshot, the size of the head version would be shown. Protect capacity with header_rwsem, since it may change. Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com> Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30rbd: only reset capacity when pointing to headJosh Durgin1-1/+6
Snapshots cannot be resized, and the new capacity of head should not be reflected by the snapshot. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30rbd: return errors for mapped but deleted snapshotJosh Durgin1-2/+30
When a snapshot is deleted, the OSD will return ENOENT when reading from it. This is normally interpreted as a hole by rbd, which will return zeroes. To minimize the time in which this can happen, stop requests early when we are notified that our snapshot no longer exists. [elder@inktank.com: updated __rbd_init_snaps_header() logic] Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30libceph: trivial fix for the incorrect debug outputJiaju Zhang1-1/+1
This is a trivial fix for the debug output, as it is inconsistent with the function name so may confuse people when debugging. [elder@inktank.com: switched to use __func__] Signed-off-by: Jiaju Zhang <jjzhang@suse.de> Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30ceph: fix potential double freeAlan Cox1-0/+1
We re-run the loop but we don't re-set the attrs pointer back to NULL. Signed-off-by: Alan Cox <alan@linux.intel.com> Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30libceph: reset connection retry on successfully negotiationSage Weil1-0/+2
We exponentially back off when we encounter connection errors. If several errors accumulate, we will eventually wait ages before even trying to reconnect. Fix this by resetting the backoff counter after a successful negotiation/ connection with the remote node. Fixes ceph issue #2802. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30libceph: protect ceph_con_open() with mutexSage Weil1-0/+2
Take the con mutex while we are initiating a ceph open. This is necessary because the may have previously been in use and then closed, which could result in a racing workqueue running con_work(). Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30ceph: close old con before reopening on mds reconnectSage Weil1-0/+1
When we detect a mds session reset, close the old ceph_connection before reopening it. This ensures we clean up the old socket properly and keep the ceph_connection state correct. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30libceph: (re)initialize bio_iter on start of message receiveSage Weil1-5/+6
Previously, we were opportunistically initializing the bio_iter if it appeared to be uninitialized in the middle of the read path. The problem is that a sequence like: - start reading message - initialize bio_iter - read half a message - messenger fault, reconnect - restart reading message - ** bio_iter now non-NULL, not reinitialized ** - read past end of bio, crash Instead, initialize the bio_iter unconditionally when we allocate/claim the message for read. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30libceph: resubmit linger ops when pg mapping changesSage Weil1-5/+21
The linger op registration (i.e., watch) modifies the object state. As such, the OSD will reply with success if it has already applied without doing the associated side-effects (setting up the watch session state). If we lose the ACK and resubmit, we will see success but the watch will not be correctly registered and we won't get notifies. To fix this, always resubmit the linger op with a new tid. We accomplish this by re-registering as a linger (i.e., 'registered') if we are not yet registered. Then the second loop will treat this just like a normal case of re-registering. This mirrors a similar fix on the userland ceph.git, commit 5dd68b95, and ceph bug #2796. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30libceph: fix mutex coverage for ceph_con_closeSage Weil1-1/+7
Hold the mutex while twiddling all of the state bits to avoid possible races. While we're here, make not of why we cannot close the socket directly. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30libceph: report socket read/write error messageSage Weil1-2/+6
We need to set error_msg to something useful before calling ceph_fault(); do so here for try_{read,write}(). This is more informative than libceph: osd0 192.168.106.220:6801 (null) Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30libceph: support crush tunablesSage Weil4-7/+58
The server side recently added support for tuning some magic crush variables. Decode these variables if they are present, or use the default values if they are not present. Corresponds to ceph.git commit 89af369c25f274fe62ef730e5e8aad0c54f1e5a5. Signed-off-by: caleb miles <caleb.miles@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30libceph: move feature bits to separate headerSage Weil6-22/+29
This is simply cleanup that will keep things more closely synced with the userland code. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30rbd: kill num_reply parametersAlex Elder1-13/+6
Several functions include a num_reply parameter, but it is never used. Just get rid of it everywhere--it seems to be something that never got fully implemented. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30rbd: option symbol renamesAlex Elder1-22/+22
Use the name "ceph_opts" consistently (rather than just "opt") for pointers to a ceph_options structure. Change the few spots that don't use "rbd_opts" for a rbd_options pointer to match the rest. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30rbd: more symbol renamesAlex Elder1-26/+27
Rename variables named "obj" which represent object names so they're consistently named "object_name". Rename the "cls" and "method" parameters in rbd_req_sync_exec() to be "class_name" and "method_name", and make similar changes to the names of local variables in that function representing the lengths of those names. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30rbd: rename some fields in struct rbd_devAlex Elder1-27/+28
An rbd image is not a single object, but a logical construct made up of an aggregation of objects. Rename some fields in struct rbd_dev, in hopes of reinforcing this. obj --> image_name obj_len --> image_name_len obj_md_name --> header_name Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30rbd: use rbd_dev consistentlyAlex Elder1-61/+64
Most variables that represent a struct rbd_device are named "rbd_dev", but in some cases "dev" is used instead. Change all the "dev" references so they use "rbd_dev" consistently, to make it clear from the name that we're working with an RBD device (as opposed to, for example, a struct device). Similarly, change the name of the "dev" field in struct rbd_notify_info to be "rbd_dev". Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30rbd: dynamically allocate snapshot nameAlex Elder1-10/+16
There is no need to impose a small limit the length of the snapshot name recorded for an rbd image in a struct rbd_dev. Remove the limitation by allocating space for the snapshot name dynamically. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30rbd: dynamically allocate image nameAlex Elder2-16/+13
There is no need to impose a small limit the length of the rbd image name recorded in a struct rbd_dev. Remove the limitation by allocating space for the image name dynamically. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30rbd: dynamically allocate image header nameAlex Elder1-11/+20
There is no need to impose a small limit the length of the header name recorded for an rbd image in a struct rbd_dev. Remove the limitation by allocating space for the header name dynamically. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30rbd: dynamically allocate object prefixAlex Elder1-8/+26
There is no need to impose a small limit the length of the object prefix recorded for an rbd image in a struct rbd_image_header. Remove the limitation by allocating space for the object prefix dynamically. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30rbd: dynamically allocate pool nameAlex Elder1-8/+19
There is no need to impose a small limit the length of the pool name recorded for an rbd image in a struct rbd_device. Remove the limitation by allocating space for the pool name ynamically. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30rbd: create pool_id device attributeAlex Elder2-6/+22
Add an entry under /sys/bus/rbd/devices/<N>/ named "pool_id" that provides the id for the pool the rbd image is assocatied with. This is in addition to the pool name already provided. Rename the "poolid" field in struct rbd_device to be "pool_id". Update the documentation to reflect the addition of this new entry. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30rbd: rename rbd_dev->block_nameAlex Elder1-6/+6
Each rbd image has a name that forms the basis of all data objects backing the device. Old (format 1) images refer to this name as the "block name," while new (format 2) images use the term "object prefix" for this. Change the field name in the in-core rbd image header structure to reflect the more modern usage. We intentionally keep the the name "block_name" in the on-disk definition for format 1 image headers. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30rbd: define dup_token()Alex Elder1-0/+36
Define a new function dup_token(), to be used during argument parsing for making dynamically-allocated copies of tokens being parsed. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30libceph: define ceph_extract_encoded_string()Alex Elder1-0/+47
This adds a new utility routine which will return a dynamically- allocated buffer containing a string that has been decoded from ceph over-the-wire format. It also returns the length of the string if the address of a size variable is supplied to receive it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-30rbd: drop a useless local variableAlex Elder1-2/+1
In rbd_req_sync_notify_ack(), a local variable was needlessly being used to hold a null pointer. Just pass NULL instead. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30libceph: fix off-by-one bug in ceph_encode_filepath()Alex Elder1-1/+1
There is a BUG_ON() call that doesn't account for the single byte structure version at the start of an encoded filepath in ceph_encode_filepath(). Fix that. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30ceph: clean up useless d_parent checksSage Weil2-15/+3
d_parent is never NULL, and IS_ROOT() is the proper way to check for a (non-self-referential) parent. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30libceph: prevent the race of incoming work during teardownGuanjun He3-0/+8
Add an atomic variable 'stopping' as flag in struct ceph_messenger, set this flag to 1 in function ceph_destroy_client(), and add the condition code in function ceph_data_ready() to test the flag value, if true(1), just return. Signed-off-by: Guanjun He <gjhe@suse.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-30libceph: fix messenger retrySage Weil2-16/+8
In ancient times, the messenger could both initiate and accept connections. An artifact if that was data structures to store/process an incoming ceph_msg_connect request and send an outgoing ceph_msg_connect_reply. Sadly, the negotiation code was referencing those structures and ignoring important information (like the peer's connect_seq) from the correct ones. Among other things, this fixes tight reconnect loops where the server sends RETRY_SESSION and we (the client) retries with the same connect_seq as last time. This bug pretty easily triggered by injecting socket failures on the MDS and running some fs workload like workunits/direct_io/test_sync_io. Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30libceph: initialize rb, list nodes in ceph_osd_requestSage Weil1-0/+3
These don't strictly need to be initialized based on how they are used, but it is good practice to do so. Reported-by: Alex Elder <elder@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30libceph: initialize msgpool message typesSage Weil3-7/+10
Initialize the type field for messages in a msgpool. The caller was doing this for osd ops, but not for the reply messages. Reported-by: Alex Elder <elder@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-05libceph: allow sock transition from CONNECTING to CLOSEDSage Weil1-12/+13
It is possible to close a socket that is in the OPENING state. For example, it can happen if ceph_con_close() is called on the con before the TCP connection is established. con_work() will come around and shut down the socket. Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-05libceph: initialize mon_client con only onceSage Weil1-4/+3
Do not re-initialize the con on every connection attempt. When we ceph_con_close, there may still be work queued on the socket (e.g., to close it), and re-initializing will clobber the work_struct state. Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-05libceph: set peer name on con_open, not initSage Weil5-16/+21
The peer name may change on each open attempt, even when the connection is reused. Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-05libceph: drop declaration of ceph_con_get()Alex Elder1-2/+0
For some reason the declaration of ceph_con_get() and ceph_con_put() did not get deleted in this commit: d59315ca libceph: drop ceph_con_get/put helpers and nref member Clean that up. Signed-off-by: Alex Elder <elder@inktank.com>
2012-07-05libceph: add some fine ASCII artAlex Elder1-1/+41
Sage liked the state diagram I put in my commit description so I'm putting it in with the code. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05libceph: small changes to messenger.cAlex Elder1-32/+31
This patch gathers a few small changes in "net/ceph/messenger.c": out_msg_pos_next() - small logic change that mostly affects indentation write_partial_msg_pages(). - use a local variable trail_off to represent the offset into a message of the trail portion of the data (if present) - once we are in the trail portion we will always be there, so we don't always need to check against our data position - avoid computing len twice after we've reached the trail - get rid of the variable tmpcrc, which is not needed - trail_off and trail_len never change so mark them const - update some comments read_partial_message_bio() - bio_iovec_idx() will never return an error, so don't bother checking for it Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05libceph: distinguish two phases of connect sequenceAlex Elder1-24/+28
Currently a ceph connection enters a "CONNECTING" state when it begins the process of (re-)connecting with its peer. Once the two ends have successfully exchanged their banner and addresses, an additional NEGOTIATING bit is set in the ceph connection's state to indicate the connection information exhange has begun. The CONNECTING bit/state continues to be set during this phase. Rather than have the CONNECTING state continue while the NEGOTIATING bit is set, interpret these two phases as distinct states. In other words, when NEGOTIATING is set, clear CONNECTING. That way only one of them will be active at a time. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05libceph: separate banner and connect writesAlex Elder1-9/+11
There are two phases in the process of linking together the two ends of a ceph connection. The first involves exchanging a banner and IP addresses, and if that is successful a second phase exchanges some detail about each side's connection capabilities. When initiating a connection, the client side now queues to send its information for both phases of this process at the same time. This is probably a bit more efficient, but it is slightly messier from a layering perspective in the code. So rearrange things so that the client doesn't send the connection information until it has received and processed the response in the initial banner phase (in process_banner()). Move the code (in the (con->sock == NULL) case in try_write()) that prepares for writing the connection information, delaying doing that until the banner exchange has completed. Move the code that begins the transition to this second "NEGOTIATING" phase out of process_banner() and into its caller, so preparing to write the connection information and preparing to read the response are adjacent to each other. Finally, preparing to write the connection information now requires the output kvec to be reset in all cases, so move that into the prepare_write_connect() and delete it from all callers. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05libceph: define and use an explicit CONNECTED stateAlex Elder2-2/+8
There is no state explicitly defined when a ceph connection is fully operational. So define one. It's set when the connection sequence completes successfully, and is cleared when the connection gets closed. Be a little more careful when examining the old state when a socket disconnect event is reported. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>