summaryrefslogtreecommitdiffstats
path: root/net/ceph
AgeCommit message (Collapse)AuthorFilesLines
2011-04-07Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6Linus Torvalds1-1/+1
* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6: Fix common misspellings
2011-03-31Fix common misspellingsLucas De Marchi1-1/+1
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-29libceph: Create a new key type "ceph".Tommi Virtanen3-8/+77
This allows us to use existence of the key type as a feature test, from userspace. Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-29libceph: Get secret from the kernel keys api when mounting with key=NAME.Tommi Virtanen2-0/+59
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-29ceph: Move secret key parsing earlier.Tommi Virtanen6-15/+59
This makes the base64 logic be contained in mount option parsing, and prepares us for replacing the homebew key management with the kernel key retention service. Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-29libceph: fix null dereference when unregistering linger requestsSage Weil1-3/+3
We should only clear r_osd if we are neither registered as a linger or a regular request. We may unregister as a linger while still registered as a regular request (e.g., in reset_osd). Incorrectly clearing r_osd there leads to a null pointer dereference in __send_request. Also simplify the parallel check in __unregister_request() where we just removed r_osd_item and know it's empty. Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-29ceph: unlock on error in ceph_osdc_start_request()Dan Carpenter1-1/+3
There was a missing unlock on the error path if __map_request() failed. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-26ceph: fix possible NULL pointer dereferenceMariusz Kozlowski1-1/+1
This patch fixes 'event_work' dereference before it is checked for NULL. Signed-off-by: Mariusz Kozlowski <mk@lab.zgora.pl> Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-25ceph: flush msgr_wq during mds_client shutdownSage Weil1-2/+2
The release method for mds connections uses a backpointer to the mds_client, so we need to flush the workqueue of any pending work (and ceph_connection references) prior to freeing the mds_client. This fixes an oops easily triggered under UML by while true ; do mount ... ; umount ... ; done Also fix an outdated comment: the flush in ceph_destroy_client only flushes OSD connections out. This bug is basically an artifact of the ceph -> ceph+libceph conversion. Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-22libceph: add lingering request and watch/notify event frameworkYehuda Sadeh2-12/+374
Lingering requests are requests that are sent to the OSD normally but tracked also after we get a successful request. This keeps the OSD connection open and resends the original request if the object moves to another OSD. The OSD can then send notification messages back to us if another client initiates a notify. This framework will be used by RBD so that the client gets notification when a snapshot is created by another node or tool. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-21libceph: fix osd request queuing on osdmap updatesSage Weil1-133/+122
If we send a request to osd A, and the request's pg remaps to osd B and then back to A in quick succession, we need to resend the request to A. The old code was only calling kick_requests after processing all incremental maps in a message, so it was very possible to not resend a request that needed to be resent. This would make the osd eventually time out (at least with the current default of osd timeouts enabled). The correct approach is to scan requests on every map incremental. This patch refactors the kick code in a few ways: - all requests are either on req_lru (in flight), req_unsent (ready to send), or req_notarget (currently map to no up osd) - mapping always done by map_request (previous map_osds) - if the mapping changes, we requeue. requests are resent only after all map incrementals are processed. - some osd reset code is moved out of kick_requests into a separate function - the "kick this osd" functionality is moved to kick_osd_requests, as it is unrelated to scanning for request->pg->osd mapping changes Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-15libceph: Fix base64-decoding when input ends in newline.Tommi Virtanen1-1/+3
It used to return -EINVAL because it thought the end was not aligned to 4 bytes. Clean up superfluous src < end test in if, the while itself guarantees that. Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-04libceph: fix msgr standby handlingSage Weil1-8/+22
The standby logic used to be pretty dependent on the work requeueing behavior that changed when we switched to WQ_NON_REENTRANT. It was also very fragile. Restructure things so that: - We clear WRITE_PENDING when we set STANDBY. This ensures we will requeue work when we wake up later. - con_work backs off if STANDBY is set. There is nothing to do if we are in standby. - clear_standby() helper is called by both con_send() and con_keepalive(), the two actions that can wake us up again. Move the connect_seq++ logic here. Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-04libceph: fix msgr keepalive flagSage Weil1-5/+4
There was some broken keepalive code using a dead variable. Shift to using the proper bit flag. Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-04libceph: fix msgr backoffSage Weil1-2/+28
With commit f363e45f we replaced a bunch of hacky workqueue mutual exclusion logic with the WQ_NON_REENTRANT flag. One pieces of fallout is that the exponential backoff breaks in certain cases: * con_work attempts to connect. * we get an immediate failure, and the socket state change handler queues immediate work. * con_work calls con_fault, we decide to back off, but can't queue delayed work. In this case, we add a BACKOFF bit to make con_work reschedule delayed work next time it runs (which should be immediately). Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-03libceph: retry after authorization failureSage Weil1-2/+0
If we mark the connection CLOSED we will give up trying to reconnect to this server instance. That is appropriate for things like a protocol version mismatch that won't change until the server is restarted, at which point we'll get a new addr and reconnect. An authorization failure like this is probably due to the server not properly rotating it's secret keys, however, and should be treated as transient so that the normal backoff and retry behavior kicks in. Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-03libceph: fix handling of short returns from get_user_pagesSage Weil1-5/+13
get_user_pages() can return fewer pages than we ask for. We were returning a bogus pointer/error code in that case. Instead, loop until we get all the pages we want or get an error we can return to the caller. Signed-off-by: Sage Weil <sage@newdream.net>
2011-02-21Merge branch 'for-linus' of ↵Linus Torvalds1-31/+31
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: keep reference to parent inode on ceph_dentry ceph: queue cap_snaps once per realm libceph: fix socket write error handling libceph: fix socket read error handling
2011-01-25libceph: fix socket write error handlingSage Weil1-11/+12
Pass errors from writing to the socket up the stack. If we get -EAGAIN, return 0 from the helper to simplify the callers' checks. Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-25libceph: fix socket read error handlingSage Weil1-20/+19
If we get EAGAIN when trying to read from the socket, it is not an error. Return 0 from the helper in this case to simplify the error handling cases in the caller (indirectly, try_read). Fix try_read to pass any error to it's caller (con_work) instead of almost always returning 0. This let's us respond to things like socket disconnects. Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-13Merge branch 'for-linus' of ↵Linus Torvalds3-45/+8
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: rbd: fix cleanup when trying to mount inexistent image net/ceph: make ceph_msgr_wq non-reentrant ceph: fsc->*_wq's aren't used in memory reclaim path ceph: Always free allocated memory in osdmap_decode() ceph: Makefile: Remove unnessary code ceph: associate requests with opening sessions ceph: drop redundant r_mds field ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS ceph: add dir_layout to inode
2011-01-12net/ceph: make ceph_msgr_wq non-reentrantTejun Heo1-44/+2
ceph messenger code does a rather complex dancing around multithread workqueue to make sure the same work item isn't executed concurrently on different CPUs. This restriction can be provided by workqueue with WQ_NON_REENTRANT. Make ceph_msgr_wq non-reentrant workqueue with the default concurrency level and remove the QUEUED/BUSY logic. * This removes backoff handling in con_work() but it couldn't reliably block execution of con_work() to begin with - queue_con() can be called after the work started but before BUSY is set. It seems that it was an optimization for a rather cold path and can be safely removed. * The number of concurrent work items is bound by the number of connections and connetions are independent from each other. With the default concurrency level, different connections will be executed independently. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Sage Weil <sage@newdream.net> Cc: ceph-devel@vger.kernel.org Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12ceph: Always free allocated memory in osdmap_decode()Jesper Juhl1-1/+3
Always free memory allocated to 'pi' in net/ceph/osdmap.c::osdmap_decode(). Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12ceph: add dir_layout to inodeSage Weil1-0/+3
Add a ceph_dir_layout to the inode, and calculate dentry hash values based on the parent directory's specified dir_hash function. This is needed because the old default Linux dcache hash function is extremely week and leads to a poor distribution of files among dir fragments. Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-26Merge branch 'master' of ↵David S. Miller3-27/+35
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/ipv4/fib_frontend.c
2010-12-20Merge branch 'for-linus' of ↵Linus Torvalds2-11/+12
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: handle partial result from get_user_pages ceph: mark user pages dirty on direct-io reads ceph: fix null pointer dereference in ceph_init_dentry for nfs reexport ceph: fix direct-io on non-page-aligned buffers ceph: fix msgr_init error path
2010-12-17ceph: handle partial result from get_user_pagesHenry C Chang1-2/+2
The get_user_pages() helper can return fewer than the requested pages. Error out in that case, and clean up the partial result. Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17ceph: mark user pages dirty on direct-io readsHenry C Chang1-4/+7
For read operation, we have to set the argument _write_ of get_user_pages to 1 since we will write data to pages. Also, we need to SetPageDirty before releasing these pages. Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-13ceph: fix msgr_init error pathSage Weil1-5/+3
create_workqueue() returns NULL on failure. Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-08Merge branch 'master' of ↵David S. Miller2-23/+1
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/ath/ath9k/ar9003_eeprom.c net/llc/af_llc.c
2010-11-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds1-22/+0
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits) af_unix: limit recursion level pch_gbe driver: The wrong of initializer entry pch_gbe dreiver: chang author ucc_geth: fix ucc halt problem in half duplex mode inet: Fix __inet_inherit_port() to correctly increment bsockets and num_owners ehea: Add some info messages and fix an issue hso: fix disable_net NET: wan/x25_asy, move lapb_unregister to x25_asy_close_tty cxgb4vf: fix setting unicast/multicast addresses ... net, ppp: Report correct error code if unit allocation failed DECnet: don't leak uninitialized stack byte au1000_eth: fix invalid address accessing the MAC enable register dccp: fix error in updating the GAR tcp: restrict net.ipv4.tcp_adv_win_scale (#20312) netns: Don't leak others' openreq-s in proc Net: ceph: Makefile: Remove unnessary code vhost/net: fix rcu check usage econet: fix CVE-2010-3848 econet: fix CVE-2010-3850 econet: disallow NULL remote addr for sendmsg(), fixes CVE-2010-3849 ...
2010-11-27Net: ceph: Makefile: Remove unnessary codeTracey Dent1-22/+0
Remove the if and else conditional because the code is in mainline and there is no need in it being there. Signed-off-by: Tracey Dent <tdent48227@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds1-1/+1
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: of/phylib: Use device tree properties to initialize Marvell PHYs. phylib: Add support for Marvell 88E1149R devices. phylib: Use common page register definition for Marvell PHYs. qlge: Fix incorrect usage of module parameters and netdev msg level ipv6: fix missing in6_ifa_put in addrconf SuperH IrDA: correct Baud rate error correction atl1c: Fix hardware type check for enabling OTP CLK net: allow GFP_HIGHMEM in __vmalloc() bonding: change list contact to netdev@vger.kernel.org e1000: fix screaming IRQ
2010-11-22Net: ceph: Makefile: remove deprecated kbuild goal definitionsTracey Dent1-1/+1
Changed Makefile to use <modules>-y instead of <modules>-objs because -objs is deprecated and not mentioned in Documentation/kbuild/makefiles.txt. Signed-off-by: Tracey Dent <tdent48227@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-21net: allow GFP_HIGHMEM in __vmalloc()Eric Dumazet1-1/+1
We forgot to use __GFP_HIGHMEM in several __vmalloc() calls. In ceph, add the missing flag. In fib_trie.c, xfrm_hash.c and request_sock.c, using vzalloc() is cleaner and allows using HIGHMEM pages as well. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-09ceph: explicitly specify page alignment in network messagesSage Weil2-5/+8
The alignment used for reading data into or out of pages used to be taken from the data_off field in the message header. This only worked as long as the page alignment matched the object offset, breaking direct io to non-page aligned offsets. Instead, explicitly specify the page alignment next to the page vector in the ceph_msg struct, and use that instead of the message header (which probably shouldn't be trusted). The alloc_msg callback is responsible for filling in this field properly when it sets up the page vector. Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-09ceph: make page alignment explicit in osd interfaceSage Weil1-8/+14
We used to infer alignment of IOs within a page based on the file offset, which assumed they matched. This broke with direct IO that was not aligned to pages (e.g., 512-byte aligned IO). We were also trusting the alignment specified in the OSD reply, which could have been adjusted by the server. Explicitly specify the page alignment when setting up OSD IO requests. Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-09ceph: fix comment, remove extraneous argsSage Weil1-2/+1
The offset/length arguments aren't used. Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-01ceph: fix small seq message skippingSage Weil1-2/+1
If the client gets out of sync with the server message sequence number, we normally skip low seq messages (ones we already received). The skip code was also incrementing the expected seq, such that all subsequent messages also appeared old and got skipped, and an eventual timeout on the osd connection. This resulted in some lagging requests and console messages like [233480.882885] ceph: skipping osd22 10.138.138.13:6804 seq 2016, expected 2017 [233480.882919] ceph: skipping osd22 10.138.138.13:6804 seq 2017, expected 2018 [233480.882963] ceph: skipping osd22 10.138.138.13:6804 seq 2018, expected 2019 [233480.883488] ceph: skipping osd22 10.138.138.13:6804 seq 2019, expected 2020 [233485.219558] ceph: skipping osd22 10.138.138.13:6804 seq 2020, expected 2021 [233485.906595] ceph: skipping osd22 10.138.138.13:6804 seq 2021, expected 2022 [233490.379536] ceph: skipping osd22 10.138.138.13:6804 seq 2022, expected 2023 [233495.523260] ceph: skipping osd22 10.138.138.13:6804 seq 2023, expected 2024 [233495.923194] ceph: skipping osd22 10.138.138.13:6804 seq 2024, expected 2025 [233500.534614] ceph: tid 6023602 timed out on osd22, will reset osd Reported-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20ceph: fix num_pages_free accounting in pagelistSage Weil1-0/+1
Decrement the free page counter when removing a page from the free_list. Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20ceph: don't crash when passed bad mount optionsYehuda Sadeh1-1/+1
This only happened when parse_extra_token was not passed to ceph_parse_option() (hence, only happened in rbd). Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-10-20ceph: add pagelist_reserve, pagelist_truncate, pagelist_set_cursorGreg Farnum1-9/+97
These facilitate preallocation of pages so that we can encode into the pagelist in an atomic context. Signed-off-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20rbd: introduce rados block device (rbd), based on libcephYehuda Sadeh1-2/+1
The rados block device (rbd), based on osdblk, creates a block device that is backed by objects stored in the Ceph distributed object storage cluster. Each device consists of a single metadata object and data striped over many data objects. The rbd driver supports read-only snapshots. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20ceph: factor out libceph from Ceph file systemYehuda Sadeh27-0/+10660
This factors out protocol and low-level storage parts of ceph into a separate libceph module living in net/ceph and include/linux/ceph. This is mostly a matter of moving files around. However, a few key pieces of the interface change as well: - ceph_client becomes ceph_fs_client and ceph_client, where the latter captures the mon and osd clients, and the fs_client gets the mds client and file system specific pieces. - Mount option parsing and debugfs setup is correspondingly broken into two pieces. - The mon client gets a generic handler callback for otherwise unknown messages (mds map, in this case). - The basic supported/required feature bits can be expanded (and are by ceph_fs_client). No functional change, aside from some subtle error handling cases that got cleaned up in the refactoring process. Signed-off-by: Sage Weil <sage@newdream.net>