summaryrefslogtreecommitdiffstats
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2016-05-26libceph: support for subscribing to "mdsmap.<id>" mapsIlya Dryomov2-0/+3
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: replace ceph_monc_request_next_osdmap()Ilya Dryomov2-1/+1
... with a wrapper around maybe_request_map() - no need for two osdmap-specific functions. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: pool deletion detectionIlya Dryomov1-0/+6
This adds the "map check" infrastructure for sending osdmap version checks on CALC_TARGET_POOL_DNE and completing in-flight requests with -ENOENT if the target pool doesn't exist or has just been deleted. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: async MON client generic requestsIlya Dryomov1-3/+16
For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION messages asynchronously and get a callback on completion. Refactor MON client to allow firing off generic requests asynchronously and add an async variant of ceph_monc_get_version(). ceph_monc_do_statfs() is switched over and remains sync. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: support for checking on status of watchIlya Dryomov1-0/+4
Implement ceph_osdc_watch_check() to be able to check on status of watch. Note that the time it takes for a watch/notify event to get delivered through the notify_wq is taken into account. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: support for sending notifiesIlya Dryomov2-0/+23
Implement ceph_osdc_notify() for sending notifies. Due to the fact that the current messenger can't do read-in into pagelists (it can only do write-out from them), I had to go with a page vector for a NOTIFY_COMPLETE payload, for now. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph, rbd: ceph_osd_linger_request, watch/notify v2Ilya Dryomov3-44/+75
This adds support and switches rbd to a new, more reliable version of watch/notify protocol. As with the OSD client update, this is mostly about getting the right structures linked into the right places so that reconnects are properly sent when needed. watch/notify v2 also requires sending regular pings to the OSDs - send_linger_ping(). A major change from the old watch/notify implementation is the introduction of ceph_osd_linger_request - linger requests no longer piggy back on ceph_osd_request. ceph_osd_event has been merged into ceph_osd_linger_request. All the details are now hidden within libceph, the interface consists of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack(). ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep the lifetime management simple. ceph_osdc_notify_ack() accepts an optional data payload, which is relayed back to the notifier. Portions of this patch are loosely based on work by Douglas Fuller <dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: a major OSD client updateIlya Dryomov1-11/+7
This is a major sync up, up to ~Jewel. The highlights are: - per-session request trees (vs a global per-client tree) - per-session locking (vs a global per-client rwlock) - homeless OSD session - no ad-hoc global per-client lists - support for pool quotas - foundation for watch/notify v2 support - foundation for map check (pool deletion detection) support The switchover is incomplete: lingering requests can be setup and teared down but aren't ever reestablished. This functionality is restored with the introduction of the new lingering infrastructure (ceph_osd_linger_request, linger_work, etc) in a later commit. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: protect osdc->osd_lru list with a spinlockIlya Dryomov1-0/+1
OSD client is getting moved from the big per-client lock to a set of per-session locks. The big rwlock would only be held for read most of the time, so a global osdc->osd_lru needs additional protection. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: handle_one_map()Ilya Dryomov2-0/+3
Separate osdmap handling from decoding and iterating over a bag of maps in a fresh MOSDMap message. This sets up the scene for the updated OSD client. Of particular importance here is the addition of pi->was_full, which can be used to answer "did this pool go full -> not-full in this map?". This is the key bit for supporting pool quotas. We won't be able to downgrade map_sem for much longer, so drop downgrade_write(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: allocate dummy osdmap in ceph_osdc_init()Ilya Dryomov1-0/+1
This leads to a simpler osdmap handling code, particularly when dealing with pi->was_full, which is introduced in a later commit. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: redo callbacks and factor out MOSDOpReply decodingIlya Dryomov1-2/+3
If you specify ACK | ONDISK and set ->r_unsafe_callback, both ->r_callback and ->r_unsafe_callback(true) are called on ack. This is very confusing. Redo this so that only one of them is called: ->r_unsafe_callback(true), on ack ->r_unsafe_callback(false), on commit or ->r_callback, on ack|commit Decode everything in decode_MOSDOpReply() to reduce clutter. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: drop msg argument from ceph_osdc_callback_tIlya Dryomov1-2/+1
finish_read(), its only user, uses it to get to hdr.data_len, which is what ->r_result is set to on success. This gains us the ability to safely call callbacks from contexts other than reply, e.g. map check. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: switch to calc_target(), part 2Ilya Dryomov2-18/+18
The crux of this is getting rid of ceph_osdc_build_request(), so that MOSDOp can be encoded not before but after calc_target() calculates the actual target. Encoding now happens within ceph_osdc_start_request(). Also nuked is the accompanying bunch of pointers into the encoded buffer that was used to update fields on each send - instead, the entire front is re-encoded. If we want to support target->name_len != base->name_len in the future, there is no other way, because oid is surrounded by other fields in the encoded buffer. Encoding OSD ops and adding data items to the request message were mixed together in osd_req_encode_op(). While we want to re-encode OSD ops, we don't want to add duplicate data items to the message when resending, so all call to ceph_osdc_msg_data_add() are factored out into a new setup_request_data(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: switch to calc_target(), part 1Ilya Dryomov1-10/+5
Replace __calc_request_pg() and most of __map_request() with calc_target() and start using req->r_t. ceph_osdc_build_request() however still encodes base_oid, because it's called before calc_target() is and target_oid is empty at that point in time; a printf in osdc_show() also shows base_oid. This is fixed in "libceph: switch to calc_target(), part 2". Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: introduce ceph_osd_request_target, calc_target()Ilya Dryomov3-0/+62
Introduce ceph_osd_request_target, containing all mapping-related fields of ceph_osd_request and calc_target() for calculating mappings and populating it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: pi->min_size, pi->last_force_request_resendIlya Dryomov1-3/+6
Add and decode pi->min_size and pi->last_force_request_resend. These are going to be used by calc_target(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: make pgid_cmp() globalIlya Dryomov1-0/+2
calc_target() code is going to need to know how to compare PGs. Take lhs and rhs pgid by const * while at it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: rename ceph_calc_pg_primary()Ilya Dryomov1-2/+2
Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to emphasise that it returns acting primary. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: ceph_osds, ceph_pg_to_up_acting_osds()Ilya Dryomov1-3/+18
Knowning just acting set isn't enough, we need to be able to record up set as well to detect interval changes. This means returning (up[], up_len, up_primary, acting[], acting_len, acting_primary) and passing it around. Introduce and switch to ceph_osds to help with that. Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return both up and acting sets from it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: rename ceph_oloc_oid_to_pg()Ilya Dryomov1-5/+4
Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg(). Emphasise that returned is raw PG and return -ENOENT instead of -EIO if the pool doesn't exist. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: fix ceph_eversion encodingIlya Dryomov1-1/+1
eversion_t is version+epoch in userspace and is encoded in that order. ceph_eversion is defined as epoch+version in rados.h, yet we memcpy it in __send_request(). Reoder ceph_eversion fields. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: DEFINE_RB_FUNCS macroIlya Dryomov1-0/+57
Given struct foo { u64 id; struct rb_node bar_node; }; generate insert_bar(), erase_bar() and lookup_bar() functions with DEFINE_RB_FUNCS(bar, struct foo, id, bar_node) The key is assumed to be an integer (u64, int, etc), compared with < and >. nodefld has to be initialized with RB_CLEAR_NODE(). Start using it for MDS, MON and OSD requests and OSD sessions. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: nuke unused fields and functionsIlya Dryomov3-13/+2
Either unused or useless: osdmap->mkfs_epoch osd->o_marked_for_keepalive monc->num_generic_requests osdc->map_waiters osdc->last_requested_map osdc->timeout_tid osd_req_op_cls_response_data() osdmap_apply_incremental() @msgr arg Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: variable-sized ceph_object_idIlya Dryomov1-25/+37
Currently ceph_object_id can hold object names of up to 100 (CEPH_MAX_OID_NAME_LEN) characters. This is enough for all use cases, expect one - long rbd image names: - a format 1 header is named "<imgname>.rbd" - an object that points to a format 2 header is named "rbd_id.<imgname>" We operate on these potentially long-named objects during rbd map, and, for format 1 images, during header refresh. (A format 2 header name is a small system-generated string.) Lift this 100 character limit by making ceph_object_id be able to point to an externally-allocated string. Apart from being able to work with almost arbitrarily-long named objects, this allows us to reduce the size of ceph_object_id from >100 bytes to 64 bytes. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26libceph: move message allocation out of ceph_osdc_alloc_request()Ilya Dryomov1-0/+1
The size of ->r_request and ->r_reply messages depends on the size of the object name (ceph_object_id), while the size of ceph_osd_request is fixed. Move message allocation into a separate function that would have to be called after ceph_object_id and ceph_object_locator (which is also going to become variable in size with RADOS namespaces) have been filled in: req = ceph_osdc_alloc_request(...); <fill in req->r_base_oid> <fill in req->r_base_oloc> ceph_osdc_alloc_messages(req); Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-14Merge branch 'for-linus' of ↵Linus Torvalds3-0/+15
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Overlayfs fixes from Miklos, assorted fixes from me. Stable fodder of varying severity, all sat in -next for a while" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ovl: ignore permissions on underlying lookup vfs: add lookup_hash() helper vfs: rename: check backing inode being equal vfs: add vfs_select_inode() helper get_rock_ridge_filename(): handle malformed NM entries ecryptfs: fix handling of directory opening atomic_open(): fix the handling of create_error fix the copy vs. map logics in blk_rq_map_user_iov() do_splice_to(): cap the size before passing to ->splice_read()
2016-05-13Merge branch 'for-4.6-fixes' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "During v4.6-rc1 cgroup namespace support was merged. There is an issue where it's impossible to tell whether a given cgroup mount point is bind mounted or namespaced. Serge has been working on the issue but it took longer than expected to resolve, so the late pull request. Given that it's a completely new feature and the patches don't touch anything else, the risk seems acceptable. However, if this is too late, an alternative is plugging new cgroup ns creation for v4.6 and retrying for v4.7" * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: fix compile warning kernfs: kernfs_sop_show_path: don't return 0 after seq_dentry call cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces kernfs_path_from_node_locked: don't overwrite nlen
2016-05-13Merge tag 'regulator-fix-v4.6-rc7' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator fixes from Mark Brown: "A small collection of driver specific fixes for the regulator subsysetem: - Fix handling of probe deferral for GPIO regulators - Fix a typo in the module alias for DA9053 - Fix the definition of BUCK9 in the S2MPS11 driver. This change looks larger than it is because an irregularity in the hardware means that the macro used to define bucks 6-10 needs duplicating and tweaking to have a separate macro for 9 - Fix a series of errors in the definitions of the LDOs the AXP20x regulators, some of which had always been present and some of which were introduced in the merge window" * tag 'regulator-fix-v4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: regulator: da9063: Correct module alias prefix to fix module autoloading regulator: axp20x: Fix axp22x ldo_io registration error on cold boot regulator: axp20x: Fix axp22x ldo_io voltage ranges regulator: axp20x: Fix LDO4 linear voltage range regulator: s2mps11: Fix invalid selector mask and voltages for buck9 regulator: gpio: check return value of of_get_named_gpio
2016-05-13Merge remote-tracking branches 'regulator/fix/axp20x', ↵Mark Brown1-0/+2
'regulator/fix/da9063', 'regulator/fix/gpio' and 'regulator/fix/s2mps11' into regulator-linus
2016-05-12mm: thp: calculate the mapcount correctly for THP pages during WP faultsAndrea Arcangeli2-3/+12
This will provide fully accuracy to the mapcount calculation in the write protect faults, so page pinning will not get broken by false positive copy-on-writes. total_mapcount() isn't the right calculation needed in reuse_swap_page(), so this introduces a page_trans_huge_mapcount() that is effectively the full accurate return value for page_mapcount() if dealing with Transparent Hugepages, however we only use the page_trans_huge_mapcount() during COW faults where it strictly needed, due to its higher runtime cost. This also provide at practical zero cost the total_mapcount information which is needed to know if we can still relocate the page anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we can reuse the page no matter if it's a pte or a pmd_trans_huge triggering the fault, but we can only relocate the page anon_vma to the local vma->anon_vma if we're sure it's only this "vma" mapping the whole THP physical range. Kirill A. Shutemov discovered the problem with moving the page anon_vma to the local vma->anon_vma in a previous version of this patch and another problem in the way page_move_anon_rmap() was called. Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a previous version, because reuse_swap_page must be a macro to call page_trans_huge_mapcount from swap.h, so this uses a macro again instead of an inline function. With this change at least it's a less dangerous usage than it was before, because "page" is used only once now, while with the previous code reuse_swap_page(page++) would have called page_mapcount on page+1 and it would have increased page twice instead of just once. Dean Luick noticed an uninitialized variable that could result in a rmap inefficiency for the non-THP case in a previous version. Mike Marciniszyn said: : Our RDMA tests are seeing an issue with memory locking that bisects to : commit 61f5d698cc97 ("mm: re-enable THP") : : The test program registers two rather large MRs (512M) and RDMA : writes data to a passive peer using the first and RDMA reads it back : into the second MR and compares that data. The sizes are chosen randomly : between 0 and 1024 bytes. : : The test will get through a few (<= 4 iterations) and then gets a : compare error. : : Tracing indicates the kernel logical addresses associated with the individual : pages at registration ARE correct , the data in the "RDMA read response only" : packets ARE correct. : : The "corruption" occurs when the packet crosse two pages that are not physically : contiguous. The second page reads back as zero in the program. : : It looks like the user VA at the point of the compare error no longer points to : the same physical address as was registered. : : This patch totally resolves the issue! Link: http://lkml.kernel.org/r/1462547040-1737-2-git-send-email-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: "Kirill A. Shutemov" <kirill@shutemov.name> Reviewed-by: Dean Luick <dean.luick@intel.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Tested-by: Josh Collier <josh.d.collier@intel.com> Cc: Marc Haber <mh+linux-kernel@zugschlus.de> Cc: <stable@vger.kernel.org> [4.5] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-11Merge branch 'ovl-fixes' into for-linusAl Viro88-330/+634
2016-05-10vfs: add lookup_hash() helperMiklos Szeredi1-0/+2
Overlayfs needs lookup without inode_permission() and already has the name hash (in form of dentry->d_name on overlayfs dentry). It also doesn't support filesystems with d_op->d_hash() so basically it only needs the actual hashed lookup from lookup_one_len_unlocked() So add a new helper that does unlocked lookup of a hashed name. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-05-10vfs: add vfs_select_inode() helperMiklos Szeredi1-0/+12
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> # v4.2+
2016-05-10export tc ife uapi headerJamal Hadi Salim1-0/+1
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-09uapi glibc compat: fix compile errors when glibc net/if.h included before ↵Mikko Rapeli2-0/+72
linux/if.h glibc's net/if.h contains copies of definitions from linux/if.h and these conflict and cause build failures if both files are included by application source code. Changes in uapi headers, which fixed header file dependencies to include linux/if.h when it was needed, e.g. commit 1ffad83d, made the net/if.h and linux/if.h incompatibilities visible as build failures for userspace applications like iproute2 and xtables-addons. This patch fixes compile errors when glibc net/if.h is included before linux/if.h: ./linux/if.h:99:21: error: redeclaration of enumerator ‘IFF_NOARP’ ./linux/if.h:98:23: error: redeclaration of enumerator ‘IFF_RUNNING’ ./linux/if.h:97:26: error: redeclaration of enumerator ‘IFF_NOTRAILERS’ ./linux/if.h:96:27: error: redeclaration of enumerator ‘IFF_POINTOPOINT’ ./linux/if.h:95:24: error: redeclaration of enumerator ‘IFF_LOOPBACK’ ./linux/if.h:94:21: error: redeclaration of enumerator ‘IFF_DEBUG’ ./linux/if.h:93:25: error: redeclaration of enumerator ‘IFF_BROADCAST’ ./linux/if.h:92:19: error: redeclaration of enumerator ‘IFF_UP’ ./linux/if.h:252:8: error: redefinition of ‘struct ifconf’ ./linux/if.h:203:8: error: redefinition of ‘struct ifreq’ ./linux/if.h:169:8: error: redefinition of ‘struct ifmap’ ./linux/if.h:107:23: error: redeclaration of enumerator ‘IFF_DYNAMIC’ ./linux/if.h:106:25: error: redeclaration of enumerator ‘IFF_AUTOMEDIA’ ./linux/if.h:105:23: error: redeclaration of enumerator ‘IFF_PORTSEL’ ./linux/if.h:104:25: error: redeclaration of enumerator ‘IFF_MULTICAST’ ./linux/if.h:103:21: error: redeclaration of enumerator ‘IFF_SLAVE’ ./linux/if.h:102:22: error: redeclaration of enumerator ‘IFF_MASTER’ ./linux/if.h:101:24: error: redeclaration of enumerator ‘IFF_ALLMULTI’ ./linux/if.h:100:23: error: redeclaration of enumerator ‘IFF_PROMISC’ The cases where linux/if.h is included before net/if.h need a similar fix in the glibc side, or the order of include files can be changed userspace code as a workaround. This change was tested in x86 userspace on Debian unstable with scripts/headers_compile_test.sh: $ make headers_install && \ cd usr/include && ../../scripts/headers_compile_test.sh -l -k ... cc -Wall -c -nostdinc -I /usr/lib/gcc/i586-linux-gnu/5/include -I /usr/lib/gcc/i586-linux-gnu/5/include-fixed -I . -I /home/mcfrisk/src/linux-2.6/usr/headers_compile_test_include.2uX2zH -I /home/mcfrisk/src/linux-2.6/usr/headers_compile_test_include.2uX2zH/i586-linux-gnu -o /dev/null ./linux/if.h_libc_before_kernel.h PASSED libc before kernel test: ./linux/if.h Reported-by: Jan Engelhardt <jengelh@inai.de> Reported-by: Josh Boyer <jwboyer@fedoraproject.org> Reported-by: Stephen Hemminger <shemming@brocade.com> Reported-by: Waldemar Brodkorb <mail@waldemar-brodkorb.de> Cc: Gabriel Laskar <gabriel@lse.epita.fr> Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds4-10/+7
Pull networking fixes from David Miller: 1) Check klogctl failure correctly, from Colin Ian King. 2) Prevent OOM when under memory pressure in flowcache, from Steffen Klassert. 3) Fix info leak in llc and rtnetlink ifmap code, from Kangjie Lu. 4) Memory barrier and multicast handling fixes in bnxt_en, from Michael Chan. 5) Endianness bug in mlx5, from Daniel Jurgens. 6) Fix disconnect handling in VSOCK, from Ian Campbell. 7) Fix locking of netdev list walking in get_bridge_ifindices(), from Nikolay Aleksandrov. 8) Bridge multicast MLD parser can look at wrong packet offsets, fix from Linus Lüssing. 9) Fix chip hang in qede driver, from Sudarsana Reddy Kalluru. 10) Fix missing setting of encapsulation before inner handling completes in udp_offload code, from Jarno Rajahalme. 11) Missing rollbacks during LAG join and flood configuration failures in mlxsw driver, from Ido Schimmel. 12) Fix error code checks in netxen driver, from Dan Carpenter. 13) Fix key size in new macsec driver, from Sabrina Dubroca. 14) Fix mlx5/VXLAN dependencies, from Arnd Bergmann. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (29 commits) net/mlx5e: make VXLAN support conditional Revert "net/mlx5: Kconfig: Fix MLX5_EN/VXLAN build issue" macsec: key identifier is 128 bits, not 64 Documentation/networking: more accurate LCO explanation macvtap: segmented packet is consumed tools: bpf_jit_disasm: check for klogctl failure qede: uninitialized variable in qede_start_xmit() netxen: netxen_rom_fast_read() doesn't return -1 netxen: reversed condition in netxen_nic_set_link_parameters() netxen: fix error handling in netxen_get_flash_block() mlxsw: spectrum: Add missing rollback in flood configuration mlxsw: spectrum: Fix rollback order in LAG join failure udp_offload: Set encapsulation before inner completes. udp_tunnel: Remove redundant udp_tunnel_gro_complete(). qede: prevent chip hang when increasing channels net: ipv6: tcp reset, icmp need to consider L3 domain bridge: fix igmp / mld query parsing net: bridge: fix old ioctl unlocked net device walk VSOCK: do not disconnect socket when peer has shutdown SEND only net/mlx4_en: Fix endianness bug in IPV6 csum calculation ...
2016-05-09compiler-gcc: require gcc 4.8 for powerpc __builtin_bswap16()Josh Poimboeuf1-1/+1
gcc support for __builtin_bswap16() was supposedly added for powerpc in gcc 4.6, and was then later added for other architectures in gcc 4.8. However, Stephen Rothwell reported that attempting to use it on powerpc in gcc 4.6 fails with: lib/vsprintf.c:160:2: error: initializer element is not constant lib/vsprintf.c:160:2: error: (near initialization for 'decpair[0]') lib/vsprintf.c:160:2: error: initializer element is not constant lib/vsprintf.c:160:2: error: (near initialization for 'decpair[1]') ... I'm not entirely sure what those errors mean, but I don't see them on gcc 4.8. So let's consider gcc 4.8 to be the official starting point for __builtin_bswap16(). Arnd Bergmann adds: "I found the commit in gcc-4.8 that replaced the powerpc-specific implementation of __builtin_bswap16 with an architecture-independent one. Apparently the powerpc version (gcc-4.6 and 4.7) just mapped to the lhbrx/sthbrx instructions, so it ended up not being a constant, though the intent of the patch was mainly to add support for the builtin to x86: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52624 has the patch that went into gcc-4.8 and more information." Fixes: 7322dd755e7d ("byteswap: try to avoid __builtin_constant_p gcc bug") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Tested-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-09cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespacesSerge E. Hallyn1-0/+2
Patch summary: When showing a cgroupfs entry in mountinfo, show the path of the mount root dentry relative to the reader's cgroup namespace root. Short explanation (courtesy of mkerrisk): If we create a new cgroup namespace, then we want both /proc/self/cgroup and /proc/self/mountinfo to show cgroup paths that are correctly virtualized with respect to the cgroup mount point. Previous to this patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo does not. Long version: When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup namespace, and then mounts a new instance of the freezer cgroup, the new mount will be rooted at /a/b. The root dentry field of the mountinfo entry will show '/a/b'. cat > /tmp/do1 << EOF mount -t cgroup -o freezer freezer /mnt grep freezer /proc/self/mountinfo EOF unshare -Gm bash /tmp/do1 > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer The task's freezer cgroup entry in /proc/self/cgroup will simply show '/': grep freezer /proc/self/cgroup 9:freezer:/ If instead the same task simply bind mounts the /a/b cgroup directory, the resulting mountinfo entry will again show /a/b for the dentry root. However in this case the task will find its own cgroup at /mnt/a/b, not at /mnt: mount --bind /sys/fs/cgroup/freezer/a/b /mnt 130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer In other words, there is no way for the task to know, based on what is in mountinfo, which cgroup directory is its own. Example (by mkerrisk): First, a little script to save some typing and verbiage: echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)" cat /proc/self/mountinfo | grep freezer | awk '{print "\tmountinfo:\t\t" $4 "\t" $5}' Create cgroup, place this shell into the cgroup, and look at the state of the /proc files: 2653 2653 # Our shell 14254 # cat(1) /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer Create a shell in new cgroup and mount namespaces. The act of creating a new cgroup namespace causes the process's current cgroups directories to become its cgroup root directories. (Here, I'm using my own version of the "unshare" utility, which takes the same options as the util-linux version): Look at the state of the /proc files: /proc/self/cgroup: 10:freezer:/ mountinfo: / /sys/fs/cgroup/freezer The third entry in /proc/self/cgroup (the pathname of the cgroup inside the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which is rooted at /a/b in the outer namespace. However, the info in /proc/self/mountinfo is not for this cgroup namespace, since we are seeing a duplicate of the mount from the old mount namespace, and the info there does not correspond to the new cgroup namespace. However, trying to create a new mount still doesn't show us the right information in mountinfo: # propagating to other mountns /proc/self/cgroup: 7:freezer:/ mountinfo: /a/b /mnt/freezer The act of creating a new cgroup namespace caused the process's current freezer directory, "/a/b", to become its cgroup freezer root directory. In other words, the pathname directory of the directory within the newly mounted cgroup filesystem should be "/", but mountinfo wrongly shows us "/a/b". The consequence of this is that the process in the cgroup namespace cannot correctly construct the pathname of its cgroup root directory from the information in /proc/PID/mountinfo. With this patch, the dentry root field in mountinfo is shown relative to the reader's cgroup namespace. So the same steps as above: /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: /../.. /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: / /mnt/freezer cgroup.clone_children freezer.parent_freezing freezer.state tasks cgroup.procs freezer.self_freezing notify_on_release 3164 2653 # First shell that placed in this cgroup 3164 # Shell started by 'unshare' 14197 # cat(1) Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com> Tested-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-09macsec: key identifier is 128 bits, not 64Sabrina Dubroca1-1/+3
The MACsec standard mentions a key identifier for each key, but doesn't specify anything about it, so I arbitrarily chose 64 bits. IEEE 802.1X-2010 specifies MKA (MACsec Key Agreement), and defines the key identifier to be 128 bits (96 bits "member identifier" + 32 bits "key number"). Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06udp_offload: Set encapsulation before inner completes.Jarno Rajahalme1-0/+3
UDP tunnel segmentation code relies on the inner offsets being set for an UDP tunnel GSO packet, but the inner *_complete() functions will set the inner offsets only if 'encapsulation' is set before calling them. Currently, udp_gro_complete() sets 'encapsulation' only after the inner *_complete() functions are done. This causes the inner offsets having invalid values after udp_gro_complete() returns, which in turn will make it impossible to properly segment the packet in case it needs to be forwarded, which would be visible to the user either as invalid packets being sent or as packet loss. This patch fixes this by setting skb's 'encapsulation' in udp_gro_complete() before calling into the inner complete functions, and by making each possible UDP tunnel gro_complete() callback set the inner_mac_header to the beginning of the tunnel payload. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Reviewed-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06udp_tunnel: Remove redundant udp_tunnel_gro_complete().Jarno Rajahalme1-9/+0
The setting of the UDP tunnel GSO type is already performed by udp[46]_gro_complete(). Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06Merge tag 'pm+acpi-4.6-rc7' of ↵Linus Torvalds2-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management and ACPI fixes from Rafael Wysocki: "Fixes for problems introduced or discovered recently (intel_pstate, sti-cpufreq, ARM64 cpuidle, Operating Performance Points framework, generic device properties framework) and one fix for a hotplug-related deadlock in ACPICA that's been there forever, but is nasty enough. Specifics: - Fix for a recent regression in the intel_pstate driver causing it to fail to restore the HWP (HW-managed P-states) configuration of the boot CPU after suspend-to-RAM (Rafael Wysocki). - Fix for two recent regressions in the intel_pstate driver, one that can trigger a divide by zero if the driver is accessed via sysfs before it manages to take the first sample and one causing it to fail to update a structure field used in a trace point, so the information coming from it is less useful (Rafael Wysocki). - Fix for a problem in the sti-cpufreq driver introduced during the 4.5 cycle that causes it to break CPU PM in multi-platform kernels by registering cpufreq-dt (which subsequently doesn't work) unconditionally and preventing the driver that would actually work from registering (Sudeep Holla). - Stable-candidate fix for an ARM64 cpuidle issue causing idle state usage counters to be incorrectly updated for idle states that were not entered due to errors (James Morse). - Fix for a recently introduced issue in the OPP (Operating Performance Points) framework causing it to print bogus error messages for missing optional regulators (Viresh Kumar). - Fix for a recently introduced issue in the generic device properties framework that may cause it to attempt to dereferece and invalid pointer in some cases (Heikki Krogerus). - Fix for a deadlock in the ACPICA core that may be triggered by device (eg Thunderbolt) hotplug (Prarit Bhargava)" * tag 'pm+acpi-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM / OPP: Remove useless check ACPICA: Dispatcher: Update thread ID for recursive method calls intel_pstate: Fix intel_pstate_get() cpufreq: intel_pstate: Fix HWP on boot CPU after system resume cpufreq: st: enable selective initialization based on the platform ARM: cpuidle: Pass on arm_cpuidle_suspend()'s return value device property: Avoid potential dereferences of invalid pointers
2016-05-06Merge branches 'acpica-fixes' and 'device-properties-fixes'Rafael J. Wysocki2-3/+3
* acpica-fixes: ACPICA: Dispatcher: Update thread ID for recursive method calls * device-properties-fixes: device property: Avoid potential dereferences of invalid pointers
2016-05-05Merge branch 'akpm' (patches from Andrew)Linus Torvalds4-78/+116
Merge fixes from Andrew Morton: "14 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: byteswap: try to avoid __builtin_constant_p gcc bug lib/stackdepot: avoid to return 0 handle mm: fix kcompactd hang during memory offlining modpost: fix module autoloading for OF devices with generic compatible property proc: prevent accessing /proc/<PID>/environ until it's ready mm/zswap: provide unique zpool name mm: thp: kvm: fix memory corruption in KVM with THP enabled MAINTAINERS: fix Rajendra Nayak's address mm, cma: prevent nr_isolated_* counters from going negative mm: update min_free_kbytes from khugepaged after core initialization huge pagecache: mmap_sem is unlocked when truncation splits pmd rapidio/mport_cdev: fix uapi type definitions mm: memcontrol: let v2 cgroups follow changes in system swappiness mm: thp: correct split_huge_pages file permission
2016-05-05byteswap: try to avoid __builtin_constant_p gcc bugArnd Bergmann1-9/+15
This is another attempt to avoid a regression in wwn_to_u64() after that started using get_unaligned_be64(), which in turn ran into a bug on gcc-4.9 through 6.1. The regression got introduced due to the combination of two separate workarounds (commits e3bde9568d99: "include/linux/unaligned: force inlining of byteswap operations" and ef3fb2422ffe: "scsi: fc: use get/put_unaligned64 for wwn access") that each try to sidestep distinct problems with gcc behavior (code growth and increased stack usage). Unfortunately after both have been applied, a more serious gcc bug has been uncovered, leading to incorrect object code that discards part of a function and causes undefined behavior. As part of this problem is how __builtin_constant_p gets evaluated on an argument passed by reference into an inline function, this avoids the use of __builtin_constant_p() for all architectures that set CONFIG_ARCH_USE_BUILTIN_BSWAP. Most architectures do not set ARCH_SUPPORTS_OPTIMIZED_INLINING, which means they probably do not suffer from the problem in the qla2xxx driver, but they might still run into it elsewhere. Both of the original workarounds were only merged in the 4.6 kernel, and the bug that is fixed by this patch should only appear if both are there, so we probably don't need to backport the fix. On the other hand, it works by simplifying the code path and should not have any negative effects. [arnd@arndb.de: fix older gcc warnings] (http://lkml.kernel.org/r/12243652.bxSxEgjgfk@wuerfel) Link: https://lkml.org/lkml/headers/2016/4/12/1103 Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122 Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70232 Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646 Fixes: e3bde9568d99 ("include/linux/unaligned: force inlining of byteswap operations") Fixes: ef3fb2422ffe ("scsi: fc: use get/put_unaligned64 for wwn access") Link: http://lkml.kernel.org/r/1780465.XdtPJpi8Tt@wuerfel Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Tested-by: Josh Poimboeuf <jpoimboe@redhat.com> # on gcc-5.3 Tested-by: Quinn Tran <quinn.tran@qlogic.com> Cc: Martin Jambor <mjambor@suse.cz> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Thomas Graf <tgraf@suug.ch> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Himanshu Madhani <himanshu.madhani@qlogic.com> Cc: Jan Hubicka <hubicka@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-05mm: thp: kvm: fix memory corruption in KVM with THP enabledAndrea Arcangeli1-0/+22
After the THP refcounting change, obtaining a compound pages from get_user_pages() no longer allows us to assume the entire compound page is immediately mappable from a secondary MMU. A secondary MMU doesn't want to call get_user_pages() more than once for each compound page, in order to know if it can map the whole compound page. So a secondary MMU needs to know from a single get_user_pages() invocation when it can map immediately the entire compound page to avoid a flood of unnecessary secondary MMU faults and spurious atomic_inc()/atomic_dec() (pages don't have to be pinned by MMU notifier users). Ideally instead of the page->_mapcount < 1 check, get_user_pages() should return the granularity of the "page" mapping in the "mm" passed to get_user_pages(). However it's non trivial change to pass the "pmd" status belonging to the "mm" walked by get_user_pages up the stack (up to the caller of get_user_pages). So the fix just checks if there is not a single pte mapping on the page returned by get_user_pages, and in turn if the caller can assume that the whole compound page is mapped in the current "mm" (in a pmd_trans_huge()). In such case the entire compound page is safe to map into the secondary MMU without additional get_user_pages() calls on the surrounding tail/head pages. In addition of being faster, not having to run other get_user_pages() calls also reduces the memory footprint of the secondary MMU fault in case the pmd split happened as result of memory pressure. Without this fix after a MADV_DONTNEED (like invoked by QEMU during postcopy live migration or balloning) or after generic swapping (with a failure in split_huge_page() that would only result in pmd splitting and not a physical page split), KVM would map the whole compound page into the shadow pagetables, despite regular faults or userfaults (like UFFDIO_COPY) may map regular pages into the primary MMU as result of the pte faults, leading to the guest mode and userland mode going out of sync and not working on the same memory at all times. Any other secondary MMU notifier manager (KVM is just one of the many MMU notifier users) will need the same information if it doesn't want to run a flood of get_user_pages_fast and it can support multiple granularity in the secondary MMU mappings, so I think it is justified to be exposed not just to KVM. The other option would be to move transparent_hugepage_adjust to mm/huge_memory.c but that currently has all kind of KVM data structures in it, so it's definitely not a cut-and-paste work, so I couldn't do a fix as cleaner as this one for 4.6. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: "Li, Liang Z" <liang.z.li@intel.com> Cc: Amit Shah <amit.shah@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-05rapidio/mport_cdev: fix uapi type definitionsAlexandre Bounine1-69/+75
Fix problems in uapi definitions reported by Gabriel Laskar: (see https://lkml.org/lkml/2016/4/5/205 for details) - move public header file rio_mport_cdev.h to include/uapi/linux directory - change types in data structures passed as IOCTL parameters - improve parameter checking in some IOCTL service routines Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Reported-by: Gabriel Laskar <gabriel@lse.epita.fr> Tested-by: Barry Wood <barry.wood@idt.com> Cc: Gabriel Laskar <gabriel@lse.epita.fr> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Aurelien Jacquiot <a-jacquiot@ti.com> Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com> Cc: Barry Wood <barry.wood@idt.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-05mm: memcontrol: let v2 cgroups follow changes in system swappinessJohannes Weiner1-0/+4
Cgroup2 currently doesn't have a per-cgroup swappiness setting. We might want to add one later - that's a different discussion - but until we do, the cgroups should always follow the system setting. Otherwise it will be unchangeably set to whatever the ancestor inherited from the system setting at the time of cgroup creation. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: <stable@vger.kernel.org> [4.5] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-05Merge tag 'asm-generic-4.6' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull asm-generic syscall fix from Arnd Bergmann: "My last pull request for asm-generic had just one patch that added two new system calls to asm/unistd.h, but unfortunately it turned out to be wrong, pointing arch/tile compat mode at the native handlers rather than the compat ones. This was spotted by Yury Norov, who is working on ILP32 mode for arch/arm64, which would have the same problem when merged. This fixes the table to use the correct compat syscalls, like the other 64-bit architectures do. I'll try to find the time to come up with a solution that prevents this problem from happening again, by allowing all future system calls to just get added in a single file for use by all architectures" * tag 'asm-generic-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: asm-generic: use compat version for preadv2 and pwritev2