10 files changed, 2751 insertions, 0 deletions
diff --git a/Documentation/staging/crc32.rst b/Documentation/staging/crc32.rst
new file mode 100644
index 000000000000..8a6860f33b4e
--- /dev/null
+++ b/Documentation/staging/crc32.rst
@@ -0,0 +1,189 @@
+=================================
+brief tutorial on CRC computation
+=================================
+
+A CRC is a long-division remainder.  You add the CRC to the message,
+and the whole thing (message+CRC) is a multiple of the given
+CRC polynomial.  To check the CRC, you can either check that the
+CRC matches the recomputed value, *or* you can check that the
+remainder computed on the message+CRC is 0.  This latter approach
+is used by a lot of hardware implementations, and is why so many
+protocols put the end-of-frame flag after the CRC.
+
+It's actually the same long division you learned in school, except that:
+
+- We're working in binary, so the digits are only 0 and 1, and
+- When dividing polynomials, there are no carries.  Rather than add and
+  subtract, we just xor.  Thus, we tend to get a bit sloppy about
+  the difference between adding and subtracting.
+
+Like all division, the remainder is always smaller than the divisor.
+To produce a 32-bit CRC, the divisor is actually a 33-bit CRC polynomial.
+Since it's 33 bits long, bit 32 is always going to be set, so usually the
+CRC is written in hex with the most significant bit omitted.  (If you're
+familiar with the IEEE 754 floating-point format, it's the same idea.)
+
+Note that a CRC is computed over a string of *bits*, so you have
+to decide on the endianness of the bits within each byte.  To get
+the best error-detecting properties, this should correspond to the
+order they're actually sent.  For example, standard RS-232 serial is
+little-endian; the most significant bit (sometimes used for parity)
+is sent last.  And when appending a CRC word to a message, you should
+do it in the right order, matching the endianness.
+
+Just like with ordinary division, you proceed one digit (bit) at a time.
+Each step of the division you take one more digit (bit) of the dividend
+and append it to the current remainder.  Then you figure out the
+appropriate multiple of the divisor to subtract to being the remainder
+back into range.  In binary, this is easy - it has to be either 0 or 1,
+and to make the XOR cancel, it's just a copy of bit 32 of the remainder.
+
+When computing a CRC, we don't care about the quotient, so we can
+throw the quotient bit away, but subtract the appropriate multiple of
+the polynomial from the remainder and we're back to where we started,
+ready to process the next bit.
+
+A big-endian CRC written this way would be coded like::
+
+	for (i = 0; i < input_bits; i++) {
+		multiple = remainder & 0x80000000 ? CRCPOLY : 0;
+		remainder = (remainder << 1 | next_input_bit()) ^ multiple;
+	}
+
+Notice how, to get at bit 32 of the shifted remainder, we look
+at bit 31 of the remainder *before* shifting it.
+
+But also notice how the next_input_bit() bits we're shifting into
+the remainder don't actually affect any decision-making until
+32 bits later.  Thus, the first 32 cycles of this are pretty boring.
+Also, to add the CRC to a message, we need a 32-bit-long hole for it at
+the end, so we have to add 32 extra cycles shifting in zeros at the
+end of every message.
+
+These details lead to a standard trick: rearrange merging in the
+next_input_bit() until the moment it's needed.  Then the first 32 cycles
+can be precomputed, and merging in the final 32 zero bits to make room
+for the CRC can be skipped entirely.  This changes the code to::
+
+	for (i = 0; i < input_bits; i++) {
+		remainder ^= next_input_bit() << 31;
+		multiple = (remainder & 0x80000000) ? CRCPOLY : 0;
+		remainder = (remainder << 1) ^ multiple;
+	}
+
+With this optimization, the little-endian code is particularly simple::
+
+	for (i = 0; i < input_bits; i++) {
+		remainder ^= next_input_bit();
+		multiple = (remainder & 1) ? CRCPOLY : 0;
+		remainder = (remainder >> 1) ^ multiple;
+	}
+
+The most significant coefficient of the remainder polynomial is stored
+in the least significant bit of the binary "remainder" variable.
+The other details of endianness have been hidden in CRCPOLY (which must
+be bit-reversed) and next_input_bit().
+
+As long as next_input_bit is returning the bits in a sensible order, we don't
+*have* to wait until the last possible moment to merge in additional bits.
+We can do it 8 bits at a time rather than 1 bit at a time::
+
+	for (i = 0; i < input_bytes; i++) {
+		remainder ^= next_input_byte() << 24;
+		for (j = 0; j < 8; j++) {
+			multiple = (remainder & 0x80000000) ? CRCPOLY : 0;
+			remainder = (remainder << 1) ^ multiple;
+		}
+	}
+
+Or in little-endian::
+
+	for (i = 0; i < input_bytes; i++) {
+		remainder ^= next_input_byte();
+		for (j = 0; j < 8; j++) {
+			multiple = (remainder & 1) ? CRCPOLY : 0;
+			remainder = (remainder >> 1) ^ multiple;
+		}
+	}
+
+If the input is a multiple of 32 bits, you can even XOR in a 32-bit
+word at a time and increase the inner loop count to 32.
+
+You can also mix and match the two loop styles, for example doing the
+bulk of a message byte-at-a-time and adding bit-at-a-time processing
+for any fractional bytes at the end.
+
+To reduce the number of conditional branches, software commonly uses
+the byte-at-a-time table method, popularized by Dilip V. Sarwate,
+"Computation of Cyclic Redundancy Checks via Table Look-Up", Comm. ACM
+v.31 no.8 (August 1998) p. 1008-1013.
+
+Here, rather than just shifting one bit of the remainder to decide
+in the correct multiple to subtract, we can shift a byte at a time.
+This produces a 40-bit (rather than a 33-bit) intermediate remainder,
+and the correct multiple of the polynomial to subtract is found using
+a 256-entry lookup table indexed by the high 8 bits.
+
+(The table entries are simply the CRC-32 of the given one-byte messages.)
+
+When space is more constrained, smaller tables can be used, e.g. two
+4-bit shifts followed by a lookup in a 16-entry table.
+
+It is not practical to process much more than 8 bits at a time using this
+technique, because tables larger than 256 entries use too much memory and,
+more importantly, too much of the L1 cache.
+
+To get higher software performance, a "slicing" technique can be used.
+See "High Octane CRC Generation with the Intel Slicing-by-8 Algorithm",
+ftp://download.intel.com/technology/comms/perfnet/download/slicing-by-8.pdf
+
+This does not change the number of table lookups, but does increase
+the parallelism.  With the classic Sarwate algorithm, each table lookup
+must be completed before the index of the next can be computed.
+
+A "slicing by 2" technique would shift the remainder 16 bits at a time,
+producing a 48-bit intermediate remainder.  Rather than doing a single
+lookup in a 65536-entry table, the two high bytes are looked up in
+two different 256-entry tables.  Each contains the remainder required
+to cancel out the corresponding byte.  The tables are different because the
+polynomials to cancel are different.  One has non-zero coefficients from
+x^32 to x^39, while the other goes from x^40 to x^47.
+
+Since modern processors can handle many parallel memory operations, this
+takes barely longer than a single table look-up and thus performs almost
+twice as fast as the basic Sarwate algorithm.
+
+This can be extended to "slicing by 4" using 4 256-entry tables.
+Each step, 32 bits of data is fetched, XORed with the CRC, and the result
+broken into bytes and looked up in the tables.  Because the 32-bit shift
+leaves the low-order bits of the intermediate remainder zero, the
+final CRC is simply the XOR of the 4 table look-ups.
+
+But this still enforces sequential execution: a second group of table
+look-ups cannot begin until the previous groups 4 table look-ups have all
+been completed.  Thus, the processor's load/store unit is sometimes idle.
+
+To make maximum use of the processor, "slicing by 8" performs 8 look-ups
+in parallel.  Each step, the 32-bit CRC is shifted 64 bits and XORed
+with 64 bits of input data.  What is important to note is that 4 of
+those 8 bytes are simply copies of the input data; they do not depend
+on the previous CRC at all.  Thus, those 4 table look-ups may commence
+immediately, without waiting for the previous loop iteration.
+
+By always having 4 loads in flight, a modern superscalar processor can
+be kept busy and make full use of its L1 cache.
+
+Two more details about CRC implementation in the real world:
+
+Normally, appending zero bits to a message which is already a multiple
+of a polynomial produces a larger multiple of that polynomial.  Thus,
+a basic CRC will not detect appended zero bits (or bytes).  To enable
+a CRC to detect this condition, it's common to invert the CRC before
+appending it.  This makes the remainder of the message+crc come out not
+as zero, but some fixed non-zero value.  (The CRC of the inversion
+pattern, 0xffffffff.)
+
+The same problem applies to zero bits prepended to the message, and a
+similar solution is used.  Instead of starting the CRC computation with
+a remainder of 0, an initial remainder of all ones is used.  As long as
+you start the same way on decoding, it doesn't make a difference.
diff --git a/Documentation/staging/index.rst b/Documentation/staging/index.rst
new file mode 100644
index 000000000000..8e98517675ca
--- /dev/null
+++ b/Documentation/staging/index.rst
@@ -0,0 +1,32 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Unsorted Documentation
+======================
+
+.. toctree::
+   :maxdepth: 2
+
+   crc32
+   kprobes
+   lzo
+   remoteproc
+   rpmsg
+   speculation
+   static-keys
+   tee
+   xz
+
+Atomic Types
+============
+
+.. literalinclude:: ../atomic_t.txt
+
+Atomic bitops
+=============
+
+.. literalinclude:: ../atomic_bitops.txt
+
+Memory Barriers
+===============
+
+.. literalinclude:: ../memory-barriers.txt
diff --git a/Documentation/staging/kprobes.rst b/Documentation/staging/kprobes.rst
new file mode 100644
index 000000000000..8baab8832c5b
--- /dev/null
+++ b/Documentation/staging/kprobes.rst
@@ -0,0 +1,801 @@
+=======================
+Kernel Probes (Kprobes)
+=======================
+
+:Author: Jim Keniston <jkenisto@us.ibm.com>
+:Author: Prasanna S Panchamukhi <prasanna.panchamukhi@gmail.com>
+:Author: Masami Hiramatsu <mhiramat@redhat.com>
+
+.. CONTENTS
+
+  1. Concepts: Kprobes, and Return Probes
+  2. Architectures Supported
+  3. Configuring Kprobes
+  4. API Reference
+  5. Kprobes Features and Limitations
+  6. Probe Overhead
+  7. TODO
+  8. Kprobes Example
+  9. Kretprobes Example
+  10. Deprecated Features
+  Appendix A: The kprobes debugfs interface
+  Appendix B: The kprobes sysctl interface
+
+Concepts: Kprobes and Return Probes
+=========================================
+
+Kprobes enables you to dynamically break into any kernel routine and
+collect debugging and performance information non-disruptively. You
+can trap at almost any kernel code address [1]_, specifying a handler
+routine to be invoked when the breakpoint is hit.
+
+.. [1] some parts of the kernel code can not be trapped, see
+       :ref:`kprobes_blacklist`)
+
+There are currently two types of probes: kprobes, and kretprobes
+(also called return probes).  A kprobe can be inserted on virtually
+any instruction in the kernel.  A return probe fires when a specified
+function returns.
+
+In the typical case, Kprobes-based instrumentation is packaged as
+a kernel module.  The module's init function installs ("registers")
+one or more probes, and the exit function unregisters them.  A
+registration function such as register_kprobe() specifies where
+the probe is to be inserted and what handler is to be called when
+the probe is hit.
+
+There are also ``register_/unregister_*probes()`` functions for batch
+registration/unregistration of a group of ``*probes``. These functions
+can speed up unregistration process when you have to unregister
+a lot of probes at once.
+
+The next four subsections explain how the different types of
+probes work and how jump optimization works.  They explain certain
+things that you'll need to know in order to make the best use of
+Kprobes -- e.g., the difference between a pre_handler and
+a post_handler, and how to use the maxactive and nmissed fields of
+a kretprobe.  But if you're in a hurry to start using Kprobes, you
+can skip ahead to :ref:`kprobes_archs_supported`.
+
+How Does a Kprobe Work?
+-----------------------
+
+When a kprobe is registered, Kprobes makes a copy of the probed
+instruction and replaces the first byte(s) of the probed instruction
+with a breakpoint instruction (e.g., int3 on i386 and x86_64).
+
+When a CPU hits the breakpoint instruction, a trap occurs, the CPU's
+registers are saved, and control passes to Kprobes via the
+notifier_call_chain mechanism.  Kprobes executes the "pre_handler"
+associated with the kprobe, passing the handler the addresses of the
+kprobe struct and the saved registers.
+
+Next, Kprobes single-steps its copy of the probed instruction.
+(It would be simpler to single-step the actual instruction in place,
+but then Kprobes would have to temporarily remove the breakpoint
+instruction.  This would open a small time window when another CPU
+could sail right past the probepoint.)
+
+After the instruction is single-stepped, Kprobes executes the
+"post_handler," if any, that is associated with the kprobe.
+Execution then continues with the instruction following the probepoint.
+
+Changing Execution Path
+-----------------------
+
+Since kprobes can probe into a running kernel code, it can change the
+register set, including instruction pointer. This operation requires
+maximum care, such as keeping the stack frame, recovering the execution
+path etc. Since it operates on a running kernel and needs deep knowledge
+of computer architecture and concurrent computing, you can easily shoot
+your foot.
+
+If you change the instruction pointer (and set up other related
+registers) in pre_handler, you must return !0 so that kprobes stops
+single stepping and just returns to the given address.
+This also means post_handler should not be called anymore.
+
+Note that this operation may be harder on some architectures which use
+TOC (Table of Contents) for function call, since you have to setup a new
+TOC for your function in your module, and recover the old one after
+returning from it.
+
+Return Probes
+-------------
+
+How Does a Return Probe Work?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When you call register_kretprobe(), Kprobes establishes a kprobe at
+the entry to the function.  When the probed function is called and this
+probe is hit, Kprobes saves a copy of the return address, and replaces
+the return address with the address of a "trampoline."  The trampoline
+is an arbitrary piece of code -- typically just a nop instruction.
+At boot time, Kprobes registers a kprobe at the trampoline.
+
+When the probed function executes its return instruction, control
+passes to the trampoline and that probe is hit.  Kprobes' trampoline
+handler calls the user-specified return handler associated with the
+kretprobe, then sets the saved instruction pointer to the saved return
+address, and that's where execution resumes upon return from the trap.
+
+While the probed function is executing, its return address is
+stored in an object of type kretprobe_instance.  Before calling
+register_kretprobe(), the user sets the maxactive field of the
+kretprobe struct to specify how many instances of the specified
+function can be probed simultaneously.  register_kretprobe()
+pre-allocates the indicated number of kretprobe_instance objects.
+
+For example, if the function is non-recursive and is called with a
+spinlock held, maxactive = 1 should be enough.  If the function is
+non-recursive and can never relinquish the CPU (e.g., via a semaphore
+or preemption), NR_CPUS should be enough.  If maxactive <= 0, it is
+set to a default value.  If CONFIG_PREEMPT is enabled, the default
+is max(10, 2*NR_CPUS).  Otherwise, the default is NR_CPUS.
+
+It's not a disaster if you set maxactive too low; you'll just miss
+some probes.  In the kretprobe struct, the nmissed field is set to
+zero when the return probe is registered, and is incremented every
+time the probed function is entered but there is no kretprobe_instance
+object available for establishing the return probe.
+
+Kretprobe entry-handler
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Kretprobes also provides an optional user-specified handler which runs
+on function entry. This handler is specified by setting the entry_handler
+field of the kretprobe struct. Whenever the kprobe placed by kretprobe at the
+function entry is hit, the user-defined entry_handler, if any, is invoked.
+If the entry_handler returns 0 (success) then a corresponding return handler
+is guaranteed to be called upon function return. If the entry_handler
+returns a non-zero error then Kprobes leaves the return address as is, and
+the kretprobe has no further effect for that particular function instance.
+
+Multiple entry and return handler invocations are matched using the unique
+kretprobe_instance object associated with them. Additionally, a user
+may also specify per return-instance private data to be part of each
+kretprobe_instance object. This is especially useful when sharing private
+data between corresponding user entry and return handlers. The size of each
+private data object can be specified at kretprobe registration time by
+setting the data_size field of the kretprobe struct. This data can be
+accessed through the data field of each kretprobe_instance object.
+
+In case probed function is entered but there is no kretprobe_instance
+object available, then in addition to incrementing the nmissed count,
+the user entry_handler invocation is also skipped.
+
+.. _kprobes_jump_optimization:
+
+How Does Jump Optimization Work?
+--------------------------------
+
+If your kernel is built with CONFIG_OPTPROBES=y (currently this flag
+is automatically set 'y' on x86/x86-64, non-preemptive kernel) and
+the "debug.kprobes_optimization" kernel parameter is set to 1 (see
+sysctl(8)), Kprobes tries to reduce probe-hit overhead by using a jump
+instruction instead of a breakpoint instruction at each probepoint.
+
+Init a Kprobe
+^^^^^^^^^^^^^
+
+When a probe is registered, before attempting this optimization,
+Kprobes inserts an ordinary, breakpoint-based kprobe at the specified
+address. So, even if it's not possible to optimize this particular
+probepoint, there'll be a probe there.
+
+Safety Check
+^^^^^^^^^^^^
+
+Before optimizing a probe, Kprobes performs the following safety checks:
+
+- Kprobes verifies that the region that will be replaced by the jump
+  instruction (the "optimized region") lies entirely within one function.
+  (A jump instruction is multiple bytes, and so may overlay multiple
+  instructions.)
+
+- Kprobes analyzes the entire function and verifies that there is no
+  jump into the optimized region.  Specifically:
+
+  - the function contains no indirect jump;
+  - the function contains no instruction that causes an exception (since
+    the fixup code triggered by the exception could jump back into the
+    optimized region -- Kprobes checks the exception tables to verify this);
+  - there is no near jump to the optimized region (other than to the first
+    byte).
+
+- For each instruction in the optimized region, Kprobes verifies that
+  the instruction can be executed out of line.
+
+Preparing Detour Buffer
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Next, Kprobes prepares a "detour" buffer, which contains the following
+instruction sequence:
+
+- code to push the CPU's registers (emulating a breakpoint trap)
+- a call to the trampoline code which calls user's probe handlers.
+- code to restore registers
+- the instructions from the optimized region
+- a jump back to the original execution path.
+
+Pre-optimization
+^^^^^^^^^^^^^^^^
+
+After preparing the detour buffer, Kprobes verifies that none of the
+following situations exist:
+
+- The probe has a post_handler.
+- Other instructions in the optimized region are probed.
+- The probe is disabled.
+
+In any of the above cases, Kprobes won't start optimizing the probe.
+Since these are temporary situations, Kprobes tries to start
+optimizing it again if the situation is changed.
+
+If the kprobe can be optimized, Kprobes enqueues the kprobe to an
+optimizing list, and kicks the kprobe-optimizer workqueue to optimize
+it.  If the to-be-optimized probepoint is hit before being optimized,
+Kprobes returns control to the original instruction path by setting
+the CPU's instruction pointer to the copied code in the detour buffer
+-- thus at least avoiding the single-step.
+
+Optimization
+^^^^^^^^^^^^
+
+The Kprobe-optimizer doesn't insert the jump instruction immediately;
+rather, it calls synchronize_rcu() for safety first, because it's
+possible for a CPU to be interrupted in the middle of executing the
+optimized region [3]_.  As you know, synchronize_rcu() can ensure
+that all interruptions that were active when synchronize_rcu()
+was called are done, but only if CONFIG_PREEMPT=n.  So, this version
+of kprobe optimization supports only kernels with CONFIG_PREEMPT=n [4]_.
+
+After that, the Kprobe-optimizer calls stop_machine() to replace
+the optimized region with a jump instruction to the detour buffer,
+using text_poke_smp().
+
+Unoptimization
+^^^^^^^^^^^^^^
+
+When an optimized kprobe is unregistered, disabled, or blocked by
+another kprobe, it will be unoptimized.  If this happens before
+the optimization is complete, the kprobe is just dequeued from the
+optimized list.  If the optimization has been done, the jump is
+replaced with the original code (except for an int3 breakpoint in
+the first byte) by using text_poke_smp().
+
+.. [3] Please imagine that the 2nd instruction is interrupted and then
+   the optimizer replaces the 2nd instruction with the jump *address*
+   while the interrupt handler is running. When the interrupt
+   returns to original address, there is no valid instruction,
+   and it causes an unexpected result.
+
+.. [4] This optimization-safety checking may be replaced with the
+   stop-machine method that ksplice uses for supporting a CONFIG_PREEMPT=y
+   kernel.
+
+NOTE for geeks:
+The jump optimization changes the kprobe's pre_handler behavior.
+Without optimization, the pre_handler can change the kernel's execution
+path by changing regs->ip and returning 1.  However, when the probe
+is optimized, that modification is ignored.  Thus, if you want to
+tweak the kernel's execution path, you need to suppress optimization,
+using one of the following techniques:
+
+- Specify an empty function for the kprobe's post_handler.
+
+or
+
+- Execute 'sysctl -w debug.kprobes_optimization=n'
+
+.. _kprobes_blacklist:
+
+Blacklist
+---------
+
+Kprobes can probe most of the kernel except itself. This means
+that there are some functions where kprobes cannot probe. Probing
+(trapping) such functions can cause a recursive trap (e.g. double
+fault) or the nested probe handler may never be called.
+Kprobes manages such functions as a blacklist.
+If you want to add a function into the blacklist, you just need
+to (1) include linux/kprobes.h and (2) use NOKPROBE_SYMBOL() macro
+to specify a blacklisted function.
+Kprobes checks the given probe address against the blacklist and
+rejects registering it, if the given address is in the blacklist.
+
+.. _kprobes_archs_supported:
+
+Architectures Supported
+=======================
+
+Kprobes and return probes are implemented on the following
+architectures:
+
+- i386 (Supports jump optimization)
+- x86_64 (AMD-64, EM64T) (Supports jump optimization)
+- ppc64
+- ia64 (Does not support probes on instruction slot1.)
+- sparc64 (Return probes not yet implemented.)
+- arm
+- ppc
+- mips
+- s390
+- parisc
+
+Configuring Kprobes
+===================
+
+When configuring the kernel using make menuconfig/xconfig/oldconfig,
+ensure that CONFIG_KPROBES is set to "y". Under "General setup", look
+for "Kprobes".
+
+So that you can load and unload Kprobes-based instrumentation modules,
+make sure "Loadable module support" (CONFIG_MODULES) and "Module
+unloading" (CONFIG_MODULE_UNLOAD) are set to "y".
+
+Also make sure that CONFIG_KALLSYMS and perhaps even CONFIG_KALLSYMS_ALL
+are set to "y", since kallsyms_lookup_name() is used by the in-kernel
+kprobe address resolution code.
+
+If you need to insert a probe in the middle of a function, you may find
+it useful to "Compile the kernel with debug info" (CONFIG_DEBUG_INFO),
+so you can use "objdump -d -l vmlinux" to see the source-to-object
+code mapping.
+
+API Reference
+=============
+
+The Kprobes API includes a "register" function and an "unregister"
+function for each type of probe. The API also includes "register_*probes"
+and "unregister_*probes" functions for (un)registering arrays of probes.
+Here are terse, mini-man-page specifications for these functions and
+the associated probe handlers that you'll write. See the files in the
+samples/kprobes/ sub-directory for examples.
+
+register_kprobe
+---------------
+
+::
+
+	#include <linux/kprobes.h>
+	int register_kprobe(struct kprobe *kp);
+
+Sets a breakpoint at the address kp->addr.  When the breakpoint is
+hit, Kprobes calls kp->pre_handler.  After the probed instruction
+is single-stepped, Kprobe calls kp->post_handler.  If a fault
+occurs during execution of kp->pre_handler or kp->post_handler,
+or during single-stepping of the probed instruction, Kprobes calls
+kp->fault_handler.  Any or all handlers can be NULL. If kp->flags
+is set KPROBE_FLAG_DISABLED, that kp will be registered but disabled,
+so, its handlers aren't hit until calling enable_kprobe(kp).
+
+.. note::
+
+   1. With the introduction of the "symbol_name" field to struct kprobe,
+      the probepoint address resolution will now be taken care of by the kernel.
+      The following will now work::
+
+	kp.symbol_name = "symbol_name";
+
+      (64-bit powerpc intricacies such as function descriptors are handled
+      transparently)
+
+   2. Use the "offset" field of struct kprobe if the offset into the symbol
+      to install a probepoint is known. This field is used to calculate the
+      probepoint.
+
+   3. Specify either the kprobe "symbol_name" OR the "addr". If both are
+      specified, kprobe registration will fail with -EINVAL.
+
+   4. With CISC architectures (such as i386 and x86_64), the kprobes code
+      does not validate if the kprobe.addr is at an instruction boundary.
+      Use "offset" with caution.
+
+register_kprobe() returns 0 on success, or a negative errno otherwise.
+
+User's pre-handler (kp->pre_handler)::
+
+	#include <linux/kprobes.h>
+	#include <linux/ptrace.h>
+	int pre_handler(struct kprobe *p, struct pt_regs *regs);
+
+Called with p pointing to the kprobe associated with the breakpoint,
+and regs pointing to the struct containing the registers saved when
+the breakpoint was hit.  Return 0 here unless you're a Kprobes geek.
+
+User's post-handler (kp->post_handler)::
+
+	#include <linux/kprobes.h>
+	#include <linux/ptrace.h>
+	void post_handler(struct kprobe *p, struct pt_regs *regs,
+			  unsigned long flags);
+
+p and regs are as described for the pre_handler.  flags always seems
+to be zero.
+
+User's fault-handler (kp->fault_handler)::
+
+	#include <linux/kprobes.h>
+	#include <linux/ptrace.h>
+	int fault_handler(struct kprobe *p, struct pt_regs *regs, int trapnr);
+
+p and regs are as described for the pre_handler.  trapnr is the
+architecture-specific trap number associated with the fault (e.g.,
+on i386, 13 for a general protection fault or 14 for a page fault).
+Returns 1 if it successfully handled the exception.
+
+register_kretprobe
+------------------
+
+::
+
+	#include <linux/kprobes.h>
+	int register_kretprobe(struct kretprobe *rp);
+
+Establishes a return probe for the function whose address is
+rp->kp.addr.  When that function returns, Kprobes calls rp->handler.
+You must set rp->maxactive appropriately before you call
+register_kretprobe(); see "How Does a Return Probe Work?" for details.
+
+register_kretprobe() returns 0 on success, or a negative errno
+otherwise.
+
+User's return-probe handler (rp->handler)::
+
+	#include <linux/kprobes.h>
+	#include <linux/ptrace.h>
+	int kretprobe_handler(struct kretprobe_instance *ri,
+			      struct pt_regs *regs);
+
+regs is as described for kprobe.pre_handler.  ri points to the
+kretprobe_instance object, of which the following fields may be
+of interest:
+
+- ret_addr: the return address
+- rp: points to the corresponding kretprobe object
+- task: points to the corresponding task struct
+- data: points to per return-instance private data; see "Kretprobe
+	entry-handler" for details.
+
+The regs_return_value(regs) macro provides a simple abstraction to
+extract the return value from the appropriate register as defined by
+the architecture's ABI.
+
+The handler's return value is currently ignored.
+
+unregister_*probe
+------------------
+
+::
+
+	#include <linux/kprobes.h>
+	void unregister_kprobe(struct kprobe *kp);
+	void unregister_kretprobe(struct kretprobe *rp);
+
+Removes the specified probe.  The unregister function can be called
+at any time after the probe has been registered.
+
+.. note::
+
+   If the functions find an incorrect probe (ex. an unregistered probe),
+   they clear the addr field of the probe.
+
+register_*probes
+----------------
+
+::
+
+	#include <linux/kprobes.h>
+	int register_kprobes(struct kprobe **kps, int num);
+	int register_kretprobes(struct kretprobe **rps, int num);
+
+Registers each of the num probes in the specified array.  If any
+error occurs during registration, all probes in the array, up to
+the bad probe, are safely unregistered before the register_*probes
+function returns.
+
+- kps/rps: an array of pointers to ``*probe`` data structures
+- num: the number of the array entries.
+
+.. note::
+
+   You have to allocate(or define) an array of pointers and set all
+   of the array entries before using these functions.
+
+unregister_*probes
+------------------
+
+::
+
+	#include <linux/kprobes.h>
+	void unregister_kprobes(struct kprobe **kps, int num);
+	void unregister_kretprobes(struct kretprobe **rps, int num);
+
+Removes each of the num probes in the specified array at once.
+
+.. note::
+
+   If the functions find some incorrect probes (ex. unregistered
+   probes) in the specified array, they clear the addr field of those
+   incorrect probes. However, other probes in the array are
+   unregistered correctly.
+
+disable_*probe
+--------------
+
+::
+
+	#include <linux/kprobes.h>
+	int disable_kprobe(struct kprobe *kp);
+	int disable_kretprobe(struct kretprobe *rp);
+
+Temporarily disables the specified ``*probe``. You can enable it again by using
+enable_*probe(). You must specify the probe which has been registered.
+
+enable_*probe
+-------------
+
+::
+
+	#include <linux/kprobes.h>
+	int enable_kprobe(struct kprobe *kp);
+	int enable_kretprobe(struct kretprobe *rp);
+
+Enables ``*probe`` which has been disabled by disable_*probe(). You must specify
+the probe which has been registered.
+
+Kprobes Features and Limitations
+================================
+
+Kprobes allows multiple probes at the same address. Also,
+a probepoint for which there is a post_handler cannot be optimized.
+So if you install a kprobe with a post_handler, at an optimized
+probepoint, the probepoint will be unoptimized automatically.
+
+In general, you can install a probe anywhere in the kernel.
+In particular, you can probe interrupt handlers.  Known exceptions
+are discussed in this section.
+
+The register_*probe functions will return -EINVAL if you attempt
+to install a probe in the code that implements Kprobes (mostly
+kernel/kprobes.c and ``arch/*/kernel/kprobes.c``, but also functions such
+as do_page_fault and notifier_call_chain).
+
+If you install a probe in an inline-able function, Kprobes makes
+no attempt to chase down all inline instances of the function and
+install probes there.  gcc may inline a function without being asked,
+so keep this in mind if you're not seeing the probe hits you expect.
+
+A probe handler can modify the environment of the probed function
+-- e.g., by modifying kernel data structures, or by modifying the
+contents of the pt_regs struct (which are restored to the registers
+upon return from the breakpoint).  So Kprobes can be used, for example,
+to install a bug fix or to inject faults for testing.  Kprobes, of
+course, has no way to distinguish the deliberately injected faults
+from the accidental ones.  Don't drink and probe.
+
+Kprobes makes no attempt to prevent probe handlers from stepping on
+each other -- e.g., probing printk() and then calling printk() from a
+probe handler.  If a probe handler hits a probe, that second probe's
+handlers won't be run in that instance, and the kprobe.nmissed member
+of the second probe will be incremented.
+
+As of Linux v2.6.15-rc1, multiple handlers (or multiple instances of
+the same handler) may run concurrently on different CPUs.
+
+Kprobes does not use mutexes or allocate memory except during
+registration and unregistration.
+
+Probe handlers are run with preemption disabled or interrupt disabled,
+which depends on the architecture and optimization state.  (e.g.,
+kretprobe handlers and optimized kprobe handlers run without interrupt
+disabled on x86/x86-64).  In any case, your handler should not yield
+the CPU (e.g., by attempting to acquire a semaphore, or waiting I/O).
+
+Since a return probe is implemented by replacing the return
+address with the trampoline's address, stack backtraces and calls
+to __builtin_return_address() will typically yield the trampoline's
+address instead of the real return address for kretprobed functions.
+(As far as we can tell, __builtin_return_address() is used only
+for instrumentation and error reporting.)
+
+If the number of times a function is called does not match the number
+of times it returns, registering a return probe on that function may
+produce undesirable results. In such a case, a line:
+kretprobe BUG!: Processing kretprobe d000000000041aa8 @ c00000000004f48c
+gets printed. With this information, one will be able to correlate the
+exact instance of the kretprobe that caused the problem. We have the
+do_exit() case covered. do_execve() and do_fork() are not an issue.
+We're unaware of other specific cases where this could be a problem.
+
+If, upon entry to or exit from a function, the CPU is running on
+a stack other than that of the current task, registering a return
+probe on that function may produce undesirable results.  For this
+reason, Kprobes doesn't support return probes (or kprobes)
+on the x86_64 version of __switch_to(); the registration functions
+return -EINVAL.
+
+On x86/x86-64, since the Jump Optimization of Kprobes modifies
+instructions widely, there are some limitations to optimization. To
+explain it, we introduce some terminology. Imagine a 3-instruction
+sequence consisting of a two 2-byte instructions and one 3-byte
+instruction.
+
+::
+
+		IA
+		|
+	[-2][-1][0][1][2][3][4][5][6][7]
+		[ins1][ins2][  ins3 ]
+		[<-     DCR       ->]
+		[<- JTPR ->]
+
+	ins1: 1st Instruction
+	ins2: 2nd Instruction
+	ins3: 3rd Instruction
+	IA:  Insertion Address
+	JTPR: Jump Target Prohibition Region
+	DCR: Detoured Code Region
+
+The instructions in DCR are copied to the out-of-line buffer
+of the kprobe, because the bytes in DCR are replaced by
+a 5-byte jump instruction. So there are several limitations.
+
+a) The instructions in DCR must be relocatable.
+b) The instructions in DCR must not include a call instruction.
+c) JTPR must not be targeted by any jump or call instruction.
+d) DCR must not straddle the border between functions.
+
+Anyway, these limitations are checked by the in-kernel instruction
+decoder, so you don't need to worry about that.
+
+Probe Overhead
+==============
+
+On a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0
+microseconds to process.  Specifically, a benchmark that hits the same
+probepoint repeatedly, firing a simple handler each time, reports 1-2
+million hits per second, depending on the architecture.  A return-probe
+hit typically takes 50-75% longer than a kprobe hit.
+When you have a return probe set on a function, adding a kprobe at
+the entry to that function adds essentially no overhead.
+
+Here are sample overhead figures (in usec) for different architectures::
+
+  k = kprobe; r = return probe; kr = kprobe + return probe
+  on same function
+
+  i386: Intel Pentium M, 1495 MHz, 2957.31 bogomips
+  k = 0.57 usec; r = 0.92; kr = 0.99
+
+  x86_64: AMD Opteron 246, 1994 MHz, 3971.48 bogomips
+  k = 0.49 usec; r = 0.80; kr = 0.82
+
+  ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU)
+  k = 0.77 usec; r = 1.26; kr = 1.45
+
+Optimized Probe Overhead
+------------------------
+
+Typically, an optimized kprobe hit takes 0.07 to 0.1 microseconds to
+process. Here are sample overhead figures (in usec) for x86 architectures::
+
+  k = unoptimized kprobe, b = boosted (single-step skipped), o = optimized kprobe,
+  r = unoptimized kretprobe, rb = boosted kretprobe, ro = optimized kretprobe.
+
+  i386: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips
+  k = 0.80 usec; b = 0.33; o = 0.05; r = 1.10; rb = 0.61; ro = 0.33
+
+  x86-64: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips
+  k = 0.99 usec; b = 0.43; o = 0.06; r = 1.24; rb = 0.68; ro = 0.30
+
+TODO
+====
+
+a. SystemTap (http://sourceware.org/systemtap): Provides a simplified
+   programming interface for probe-based instrumentation.  Try it out.
+b. Kernel return probes for sparc64.
+c. Support for other architectures.
+d. User-space probes.
+e. Watchpoint probes (which fire on data references).
+
+Kprobes Example
+===============
+
+See samples/kprobes/kprobe_example.c
+
+Kretprobes Example
+==================
+
+See samples/kprobes/kretprobe_example.c
+
+For additional information on Kprobes, refer to the following URLs:
+
+- http://www-106.ibm.com/developerworks/library/l-kprobes.html?ca=dgr-lnxw42Kprobe
+- http://www.redhat.com/magazine/005mar05/features/kprobes/
+- http://www-users.cs.umn.edu/~boutcher/kprobes/
+- http://www.linuxsymposium.org/2006/linuxsymposium_procv2.pdf (pages 101-115)
+
+Deprecated Features
+===================
+
+Jprobes is now a deprecated feature. People who are depending on it should
+migrate to other tracing features or use older kernels. Please consider to
+migrate your tool to one of the following options:
+
+- Use trace-event to trace target function with arguments.
+
+  trace-event is a low-overhead (and almost no visible overhead if it
+  is off) statically defined event interface. You can define new events
+  and trace it via ftrace or any other tracing tools.
+
+  See the following urls:
+
+    - https://lwn.net/Articles/379903/
+    - https://lwn.net/Articles/381064/
+    - https://lwn.net/Articles/383362/
+
+- Use ftrace dynamic events (kprobe event) with perf-probe.
+
+  If you build your kernel with debug info (CONFIG_DEBUG_INFO=y), you can
+  find which register/stack is assigned to which local variable or arguments
+  by using perf-probe and set up new event to trace it.
+
+  See following documents:
+
+  - Documentation/trace/kprobetrace.rst
+  - Documentation/trace/events.rst
+  - tools/perf/Documentation/perf-probe.txt
+
+
+The kprobes debugfs interface
+=============================
+
+
+With recent kernels (> 2.6.20) the list of registered kprobes is visible
+under the /sys/kernel/debug/kprobes/ directory (assuming debugfs is mounted at //sys/kernel/debug).
+
+/sys/kernel/debug/kprobes/list: Lists all registered probes on the system::
+
+	c015d71a  k  vfs_read+0x0
+	c03dedc5  r  tcp_v4_rcv+0x0
+
+The first column provides the kernel address where the probe is inserted.
+The second column identifies the type of probe (k - kprobe and r - kretprobe)
+while the third column specifies the symbol+offset of the probe.
+If the probed function belongs to a module, the module name is also
+specified. Following columns show probe status. If the probe is on
+a virtual address that is no longer valid (module init sections, module
+virtual addresses that correspond to modules that've been unloaded),
+such probes are marked with [GONE]. If the probe is temporarily disabled,
+such probes are marked with [DISABLED]. If the probe is optimized, it is
+marked with [OPTIMIZED]. If the probe is ftrace-based, it is marked with
+[FTRACE].
+
+/sys/kernel/debug/kprobes/enabled: Turn kprobes ON/OFF forcibly.
+
+Provides a knob to globally and forcibly turn registered kprobes ON or OFF.
+By default, all kprobes are enabled. By echoing "0" to this file, all
+registered probes will be disarmed, till such time a "1" is echoed to this
+file. Note that this knob just disarms and arms all kprobes and doesn't
+change each probe's disabling state. This means that disabled kprobes (marked
+[DISABLED]) will be not enabled if you turn ON all kprobes by this knob.
+
+
+The kprobes sysctl interface
+============================
+
+/proc/sys/debug/kprobes-optimization: Turn kprobes optimization ON/OFF.
+
+When CONFIG_OPTPROBES=y, this sysctl interface appears and it provides
+a knob to globally and forcibly turn jump optimization (see section
+:ref:`kprobes_jump_optimization`) ON or OFF. By default, jump optimization
+is allowed (ON). If you echo "0" to this file or set
+"debug.kprobes_optimization" to 0 via sysctl, all optimized probes will be
+unoptimized, and any new probes registered after that will not be optimized.
+
+Note that this knob *changes* the optimized state. This means that optimized
+probes (marked [OPTIMIZED]) will be unoptimized ([OPTIMIZED] tag will be
+removed). If the knob is turned on, they will be optimized again.
+
diff --git a/Documentation/staging/lzo.rst b/Documentation/staging/lzo.rst
new file mode 100644
index 000000000000..f65b51523014
--- /dev/null
+++ b/Documentation/staging/lzo.rst
@@ -0,0 +1,202 @@
+===========================================================
+LZO stream format as understood by Linux's LZO decompressor
+===========================================================
+
+Introduction
+============
+
+  This is not a specification. No specification seems to be publicly available
+  for the LZO stream format. This document describes what input format the LZO
+  decompressor as implemented in the Linux kernel understands. The file subject
+  of this analysis is lib/lzo/lzo1x_decompress_safe.c. No analysis was made on
+  the compressor nor on any other implementations though it seems likely that
+  the format matches the standard one. The purpose of this document is to
+  better understand what the code does in order to propose more efficient fixes
+  for future bug reports.
+
+Description
+===========
+
+  The stream is composed of a series of instructions, operands, and data. The
+  instructions consist in a few bits representing an opcode, and bits forming
+  the operands for the instruction, whose size and position depend on the
+  opcode and on the number of literals copied by previous instruction. The
+  operands are used to indicate:
+
+    - a distance when copying data from the dictionary (past output buffer)
+    - a length (number of bytes to copy from dictionary)
+    - the number of literals to copy, which is retained in variable "state"
+      as a piece of information for next instructions.
+
+  Optionally depending on the opcode and operands, extra data may follow. These
+  extra data can be a complement for the operand (eg: a length or a distance
+  encoded on larger values), or a literal to be copied to the output buffer.
+
+  The first byte of the block follows a different encoding from other bytes, it
+  seems to be optimized for literal use only, since there is no dictionary yet
+  prior to that byte.
+
+  Lengths are always encoded on a variable size starting with a small number
+  of bits in the operand. If the number of bits isn't enough to represent the
+  length, up to 255 may be added in increments by consuming more bytes with a
+  rate of at most 255 per extra byte (thus the compression ratio cannot exceed
+  around 255:1). The variable length encoding using #bits is always the same::
+
+       length = byte & ((1 << #bits) - 1)
+       if (!length) {
+               length = ((1 << #bits) - 1)
+               length += 255*(number of zero bytes)
+               length += first-non-zero-byte
+       }
+       length += constant (generally 2 or 3)
+
+  For references to the dictionary, distances are relative to the output
+  pointer. Distances are encoded using very few bits belonging to certain
+  ranges, resulting in multiple copy instructions using different encodings.
+  Certain encodings involve one extra byte, others involve two extra bytes
+  forming a little-endian 16-bit quantity (marked LE16 below).
+
+  After any instruction except the large literal copy, 0, 1, 2 or 3 literals
+  are copied before starting the next instruction. The number of literals that
+  were copied may change the meaning and behaviour of the next instruction. In
+  practice, only one instruction needs to know whether 0, less than 4, or more
+  literals were copied. This is the information stored in the <state> variable
+  in this implementation. This number of immediate literals to be copied is
+  generally encoded in the last two bits of the instruction but may also be
+  taken from the last two bits of an extra operand (eg: distance).
+
+  End of stream is declared when a block copy of distance 0 is seen. Only one
+  instruction may encode this distance (0001HLLL), it takes one LE16 operand
+  for the distance, thus requiring 3 bytes.
+
+  .. important::
+
+     In the code some length checks are missing because certain instructions
+     are called under the assumption that a certain number of bytes follow
+     because it has already been guaranteed before parsing the instructions.
+     They just have to "refill" this credit if they consume extra bytes. This
+     is an implementation design choice independent on the algorithm or
+     encoding.
+
+Versions
+
+0: Original version
+1: LZO-RLE
+
+Version 1 of LZO implements an extension to encode runs of zeros using run
+length encoding. This improves speed for data with many zeros, which is a
+common case for zram. This modifies the bitstream in a backwards compatible way
+(v1 can correctly decompress v0 compressed data, but v0 cannot read v1 data).
+
+For maximum compatibility, both versions are available under different names
+(lzo and lzo-rle). Differences in the encoding are noted in this document with
+e.g.: version 1 only.
+
+Byte sequences
+==============
+
+  First byte encoding::
+
+      0..16   : follow regular instruction encoding, see below. It is worth
+                noting that code 16 will represent a block copy from the
+                dictionary which is empty, and that it will always be
+                invalid at this place.
+
+      17      : bitstream version. If the first byte is 17, and compressed
+                stream length is at least 5 bytes (length of shortest possible
+                versioned bitstream), the next byte gives the bitstream version
+                (version 1 only).
+                Otherwise, the bitstream version is 0.
+
+      18..21  : copy 0..3 literals
+                state = (byte - 17) = 0..3  [ copy <state> literals ]
+                skip byte
+
+      22..255 : copy literal string
+                length = (byte - 17) = 4..238
+                state = 4 [ don't copy extra literals ]
+                skip byte
+
+  Instruction encoding::
+
+      0 0 0 0 X X X X  (0..15)
+        Depends on the number of literals copied by the last instruction.
+        If last instruction did not copy any literal (state == 0), this
+        encoding will be a copy of 4 or more literal, and must be interpreted
+        like this :
+
+           0 0 0 0 L L L L  (0..15)  : copy long literal string
+           length = 3 + (L ?: 15 + (zero_bytes * 255) + non_zero_byte)
+           state = 4  (no extra literals are copied)
+
+        If last instruction used to copy between 1 to 3 literals (encoded in
+        the instruction's opcode or distance), the instruction is a copy of a
+        2-byte block from the dictionary within a 1kB distance. It is worth
+        noting that this instruction provides little savings since it uses 2
+        bytes to encode a copy of 2 other bytes but it encodes the number of
+        following literals for free. It must be interpreted like this :
+
+           0 0 0 0 D D S S  (0..15)  : copy 2 bytes from <= 1kB distance
+           length = 2
+           state = S (copy S literals after this block)
+         Always followed by exactly one byte : H H H H H H H H
+           distance = (H << 2) + D + 1
+
+        If last instruction used to copy 4 or more literals (as detected by
+        state == 4), the instruction becomes a copy of a 3-byte block from the
+        dictionary from a 2..3kB distance, and must be interpreted like this :
+
+           0 0 0 0 D D S S  (0..15)  : copy 3 bytes from 2..3 kB distance
+           length = 3
+           state = S (copy S literals after this block)
+         Always followed by exactly one byte : H H H H H H H H
+           distance = (H << 2) + D + 2049
+
+      0 0 0 1 H L L L  (16..31)
+           Copy of a block within 16..48kB distance (preferably less than 10B)
+           length = 2 + (L ?: 7 + (zero_bytes * 255) + non_zero_byte)
+        Always followed by exactly one LE16 :  D D D D D D D D : D D D D D D S S
+           distance = 16384 + (H << 14) + D
+           state = S (copy S literals after this block)
+           End of stream is reached if distance == 16384
+           In version 1 only, to prevent ambiguity with the RLE case when
+           ((distance & 0x803f) == 0x803f) && (261 <= length <= 264), the
+           compressor must not emit block copies where distance and length
+           meet these conditions.
+
+        In version 1 only, this instruction is also used to encode a run of
+           zeros if distance = 0xbfff, i.e. H = 1 and the D bits are all 1.
+           In this case, it is followed by a fourth byte, X.
+           run length = ((X << 3) | (0 0 0 0 0 L L L)) + 4
+
+      0 0 1 L L L L L  (32..63)
+           Copy of small block within 16kB distance (preferably less than 34B)
+           length = 2 + (L ?: 31 + (zero_bytes * 255) + non_zero_byte)
+        Always followed by exactly one LE16 :  D D D D D D D D : D D D D D D S S
+           distance = D + 1
+           state = S (copy S literals after this block)
+
+      0 1 L D D D S S  (64..127)
+           Copy 3-4 bytes from block within 2kB distance
+           state = S (copy S literals after this block)
+           length = 3 + L
+         Always followed by exactly one byte : H H H H H H H H
+           distance = (H << 3) + D + 1
+
+      1 L L D D D S S  (128..255)
+           Copy 5-8 bytes from block within 2kB distance
+           state = S (copy S literals after this block)
+           length = 5 + L
+         Always followed by exactly one byte : H H H H H H H H
+           distance = (H << 3) + D + 1
+
+Authors
+=======
+
+  This document was written by Willy Tarreau <w@1wt.eu> on 2014/07/19 during an
+  analysis of the decompression code available in Linux 3.16-rc5, and updated
+  by Dave Rodgman <dave.rodgman@arm.com> on 2018/10/30 to introduce run-length
+  encoding. The code is tricky, it is possible that this document contains
+  mistakes or that a few corner cases were overlooked. In any case, please
+  report any doubt, fix, or proposed updates to the author(s) so that the
+  document can be updated.
diff --git a/Documentation/staging/remoteproc.rst b/Documentation/staging/remoteproc.rst
new file mode 100644
index 000000000000..9cccd3dd6a4b
--- /dev/null
+++ b/Documentation/staging/remoteproc.rst
@@ -0,0 +1,359 @@
+==========================
+Remote Processor Framework
+==========================
+
+Introduction
+============
+
+Modern SoCs typically have heterogeneous remote processor devices in asymmetric
+multiprocessing (AMP) configurations, which may be running different instances
+of operating system, whether it's Linux or any other flavor of real-time OS.
+
+OMAP4, for example, has dual Cortex-A9, dual Cortex-M3 and a C64x+ DSP.
+In a typical configuration, the dual cortex-A9 is running Linux in a SMP
+configuration, and each of the other three cores (two M3 cores and a DSP)
+is running its own instance of RTOS in an AMP configuration.
+
+The remoteproc framework allows different platforms/architectures to
+control (power on, load firmware, power off) those remote processors while
+abstracting the hardware differences, so the entire driver doesn't need to be
+duplicated. In addition, this framework also adds rpmsg virtio devices
+for remote processors that supports this kind of communication. This way,
+platform-specific remoteproc drivers only need to provide a few low-level
+handlers, and then all rpmsg drivers will then just work
+(for more information about the virtio-based rpmsg bus and its drivers,
+please read Documentation/staging/rpmsg.rst).
+Registration of other types of virtio devices is now also possible. Firmwares
+just need to publish what kind of virtio devices do they support, and then
+remoteproc will add those devices. This makes it possible to reuse the
+existing virtio drivers with remote processor backends at a minimal development
+cost.
+
+User API
+========
+
+::
+
+  int rproc_boot(struct rproc *rproc)
+
+Boot a remote processor (i.e. load its firmware, power it on, ...).
+
+If the remote processor is already powered on, this function immediately
+returns (successfully).
+
+Returns 0 on success, and an appropriate error value otherwise.
+Note: to use this function you should already have a valid rproc
+handle. There are several ways to achieve that cleanly (devres, pdata,
+the way remoteproc_rpmsg.c does this, or, if this becomes prevalent, we
+might also consider using dev_archdata for this).
+
+::
+
+  void rproc_shutdown(struct rproc *rproc)
+
+Power off a remote processor (previously booted with rproc_boot()).
+In case @rproc is still being used by an additional user(s), then
+this function will just decrement the power refcount and exit,
+without really powering off the device.
+
+Every call to rproc_boot() must (eventually) be accompanied by a call
+to rproc_shutdown(). Calling rproc_shutdown() redundantly is a bug.
+
+.. note::
+
+  we're not decrementing the rproc's refcount, only the power refcount.
+  which means that the @rproc handle stays valid even after
+  rproc_shutdown() returns, and users can still use it with a subsequent
+  rproc_boot(), if needed.
+
+::
+
+  struct rproc *rproc_get_by_phandle(phandle phandle)
+
+Find an rproc handle using a device tree phandle. Returns the rproc
+handle on success, and NULL on failure. This function increments
+the remote processor's refcount, so always use rproc_put() to
+decrement it back once rproc isn't needed anymore.
+
+Typical usage
+=============
+
+::
+
+  #include <linux/remoteproc.h>
+
+  /* in case we were given a valid 'rproc' handle */
+  int dummy_rproc_example(struct rproc *my_rproc)
+  {
+	int ret;
+
+	/* let's power on and boot our remote processor */
+	ret = rproc_boot(my_rproc);
+	if (ret) {
+		/*
+		 * something went wrong. handle it and leave.
+		 */
+	}
+
+	/*
+	 * our remote processor is now powered on... give it some work
+	 */
+
+	/* let's shut it down now */
+	rproc_shutdown(my_rproc);
+  }
+
+API for implementors
+====================
+
+::
+
+  struct rproc *rproc_alloc(struct device *dev, const char *name,
+				const struct rproc_ops *ops,
+				const char *firmware, int len)
+
+Allocate a new remote processor handle, but don't register
+it yet. Required parameters are the underlying device, the
+name of this remote processor, platform-specific ops handlers,
+the name of the firmware to boot this rproc with, and the
+length of private data needed by the allocating rproc driver (in bytes).
+
+This function should be used by rproc implementations during
+initialization of the remote processor.
+
+After creating an rproc handle using this function, and when ready,
+implementations should then call rproc_add() to complete
+the registration of the remote processor.
+
+On success, the new rproc is returned, and on failure, NULL.
+
+.. note::
+
+  **never** directly deallocate @rproc, even if it was not registered
+  yet. Instead, when you need to unroll rproc_alloc(), use rproc_free().
+
+::
+
+  void rproc_free(struct rproc *rproc)
+
+Free an rproc handle that was allocated by rproc_alloc.
+
+This function essentially unrolls rproc_alloc(), by decrementing the
+rproc's refcount. It doesn't directly free rproc; that would happen
+only if there are no other references to rproc and its refcount now
+dropped to zero.
+
+::
+
+  int rproc_add(struct rproc *rproc)
+
+Register @rproc with the remoteproc framework, after it has been
+allocated with rproc_alloc().
+
+This is called by the platform-specific rproc implementation, whenever
+a new remote processor device is probed.
+
+Returns 0 on success and an appropriate error code otherwise.
+Note: this function initiates an asynchronous firmware loading
+context, which will look for virtio devices supported by the rproc's
+firmware.
+
+If found, those virtio devices will be created and added, so as a result
+of registering this remote processor, additional virtio drivers might get
+probed.
+
+::
+
+  int rproc_del(struct rproc *rproc)
+
+Unroll rproc_add().
+
+This function should be called when the platform specific rproc
+implementation decides to remove the rproc device. it should
+_only_ be called if a previous invocation of rproc_add()
+has completed successfully.
+
+After rproc_del() returns, @rproc is still valid, and its
+last refcount should be decremented by calling rproc_free().
+
+Returns 0 on success and -EINVAL if @rproc isn't valid.
+
+::
+
+  void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type)
+
+Report a crash in a remoteproc
+
+This function must be called every time a crash is detected by the
+platform specific rproc implementation. This should not be called from a
+non-remoteproc driver. This function can be called from atomic/interrupt
+context.
+
+Implementation callbacks
+========================
+
+These callbacks should be provided by platform-specific remoteproc
+drivers::
+
+  /**
+   * struct rproc_ops - platform-specific device handlers
+   * @start:	power on the device and boot it
+   * @stop:	power off the device
+   * @kick:	kick a virtqueue (virtqueue id given as a parameter)
+   */
+  struct rproc_ops {
+	int (*start)(struct rproc *rproc);
+	int (*stop)(struct rproc *rproc);
+	void (*kick)(struct rproc *rproc, int vqid);
+  };
+
+Every remoteproc implementation should at least provide the ->start and ->stop
+handlers. If rpmsg/virtio functionality is also desired, then the ->kick handler
+should be provided as well.
+
+The ->start() handler takes an rproc handle and should then power on the
+device and boot it (use rproc->priv to access platform-specific private data).
+The boot address, in case needed, can be found in rproc->bootaddr (remoteproc
+core puts there the ELF entry point).
+On success, 0 should be returned, and on failure, an appropriate error code.
+
+The ->stop() handler takes an rproc handle and powers the device down.
+On success, 0 is returned, and on failure, an appropriate error code.
+
+The ->kick() handler takes an rproc handle, and an index of a virtqueue
+where new message was placed in. Implementations should interrupt the remote
+processor and let it know it has pending messages. Notifying remote processors
+the exact virtqueue index to look in is optional: it is easy (and not
+too expensive) to go through the existing virtqueues and look for new buffers
+in the used rings.
+
+Binary Firmware Structure
+=========================
+
+At this point remoteproc supports ELF32 and ELF64 firmware binaries. However,
+it is quite expected that other platforms/devices which we'd want to
+support with this framework will be based on different binary formats.
+
+When those use cases show up, we will have to decouple the binary format
+from the framework core, so we can support several binary formats without
+duplicating common code.
+
+When the firmware is parsed, its various segments are loaded to memory
+according to the specified device address (might be a physical address
+if the remote processor is accessing memory directly).
+
+In addition to the standard ELF segments, most remote processors would
+also include a special section which we call "the resource table".
+
+The resource table contains system resources that the remote processor
+requires before it should be powered on, such as allocation of physically
+contiguous memory, or iommu mapping of certain on-chip peripherals.
+Remotecore will only power up the device after all the resource table's
+requirement are met.
+
+In addition to system resources, the resource table may also contain
+resource entries that publish the existence of supported features
+or configurations by the remote processor, such as trace buffers and
+supported virtio devices (and their configurations).
+
+The resource table begins with this header::
+
+  /**
+   * struct resource_table - firmware resource table header
+   * @ver: version number
+   * @num: number of resource entries
+   * @reserved: reserved (must be zero)
+   * @offset: array of offsets pointing at the various resource entries
+   *
+   * The header of the resource table, as expressed by this structure,
+   * contains a version number (should we need to change this format in the
+   * future), the number of available resource entries, and their offsets
+   * in the table.
+   */
+  struct resource_table {
+	u32 ver;
+	u32 num;
+	u32 reserved[2];
+	u32 offset[0];
+  } __packed;
+
+Immediately following this header are the resource entries themselves,
+each of which begins with the following resource entry header::
+
+  /**
+   * struct fw_rsc_hdr - firmware resource entry header
+   * @type: resource type
+   * @data: resource data
+   *
+   * Every resource entry begins with a 'struct fw_rsc_hdr' header providing
+   * its @type. The content of the entry itself will immediately follow
+   * this header, and it should be parsed according to the resource type.
+   */
+  struct fw_rsc_hdr {
+	u32 type;
+	u8 data[0];
+  } __packed;
+
+Some resources entries are mere announcements, where the host is informed
+of specific remoteproc configuration. Other entries require the host to
+do something (e.g. allocate a system resource). Sometimes a negotiation
+is expected, where the firmware requests a resource, and once allocated,
+the host should provide back its details (e.g. address of an allocated
+memory region).
+
+Here are the various resource types that are currently supported::
+
+  /**
+   * enum fw_resource_type - types of resource entries
+   *
+   * @RSC_CARVEOUT:   request for allocation of a physically contiguous
+   *		    memory region.
+   * @RSC_DEVMEM:     request to iommu_map a memory-based peripheral.
+   * @RSC_TRACE:	    announces the availability of a trace buffer into which
+   *		    the remote processor will be writing logs.
+   * @RSC_VDEV:       declare support for a virtio device, and serve as its
+   *		    virtio header.
+   * @RSC_LAST:       just keep this one at the end
+   * @RSC_VENDOR_START:	start of the vendor specific resource types range
+   * @RSC_VENDOR_END:	end of the vendor specific resource types range
+   *
+   * Please note that these values are used as indices to the rproc_handle_rsc
+   * lookup table, so please keep them sane. Moreover, @RSC_LAST is used to
+   * check the validity of an index before the lookup table is accessed, so
+   * please update it as needed.
+   */
+  enum fw_resource_type {
+	RSC_CARVEOUT		= 0,
+	RSC_DEVMEM		= 1,
+	RSC_TRACE		= 2,
+	RSC_VDEV		= 3,
+	RSC_LAST		= 4,
+	RSC_VENDOR_START	= 128,
+	RSC_VENDOR_END		= 512,
+  };
+
+For more details regarding a specific resource type, please see its
+dedicated structure in include/linux/remoteproc.h.
+
+We also expect that platform-specific resource entries will show up
+at some point. When that happens, we could easily add a new RSC_PLATFORM
+type, and hand those resources to the platform-specific rproc driver to handle.
+
+Virtio and remoteproc
+=====================
+
+The firmware should provide remoteproc information about virtio devices
+that it supports, and their configurations: a RSC_VDEV resource entry
+should specify the virtio device id (as in virtio_ids.h), virtio features,
+virtio config space, vrings information, etc.
+
+When a new remote processor is registered, the remoteproc framework
+will look for its resource table and will register the virtio devices
+it supports. A firmware may support any number of virtio devices, and
+of any type (a single remote processor can also easily support several
+rpmsg virtio devices this way, if desired).
+
+Of course, RSC_VDEV resource entries are only good enough for static
+allocation of virtio devices. Dynamic allocations will also be made possible
+using the rpmsg bus (similar to how we already do dynamic allocations of
+rpmsg channels; read more about it in rpmsg.txt).
diff --git a/Documentation/staging/rpmsg.rst b/Documentation/staging/rpmsg.rst
new file mode 100644
index 000000000000..24b7a9e1a5f9
--- /dev/null
+++ b/Documentation/staging/rpmsg.rst
@@ -0,0 +1,341 @@
+============================================
+Remote Processor Messaging (rpmsg) Framework
+============================================
+
+.. note::
+
+  This document describes the rpmsg bus and how to write rpmsg drivers.
+  To learn how to add rpmsg support for new platforms, check out remoteproc.txt
+  (also a resident of Documentation/).
+
+Introduction
+============
+
+Modern SoCs typically employ heterogeneous remote processor devices in
+asymmetric multiprocessing (AMP) configurations, which may be running
+different instances of operating system, whether it's Linux or any other
+flavor of real-time OS.
+
+OMAP4, for example, has dual Cortex-A9, dual Cortex-M3 and a C64x+ DSP.
+Typically, the dual cortex-A9 is running Linux in a SMP configuration,
+and each of the other three cores (two M3 cores and a DSP) is running
+its own instance of RTOS in an AMP configuration.
+
+Typically AMP remote processors employ dedicated DSP codecs and multimedia
+hardware accelerators, and therefore are often used to offload CPU-intensive
+multimedia tasks from the main application processor.
+
+These remote processors could also be used to control latency-sensitive
+sensors, drive random hardware blocks, or just perform background tasks
+while the main CPU is idling.
+
+Users of those remote processors can either be userland apps (e.g. multimedia
+frameworks talking with remote OMX components) or kernel drivers (controlling
+hardware accessible only by the remote processor, reserving kernel-controlled
+resources on behalf of the remote processor, etc..).
+
+Rpmsg is a virtio-based messaging bus that allows kernel drivers to communicate
+with remote processors available on the system. In turn, drivers could then
+expose appropriate user space interfaces, if needed.
+
+When writing a driver that exposes rpmsg communication to userland, please
+keep in mind that remote processors might have direct access to the
+system's physical memory and other sensitive hardware resources (e.g. on
+OMAP4, remote cores and hardware accelerators may have direct access to the
+physical memory, gpio banks, dma controllers, i2c bus, gptimers, mailbox
+devices, hwspinlocks, etc..). Moreover, those remote processors might be
+running RTOS where every task can access the entire memory/devices exposed
+to the processor. To minimize the risks of rogue (or buggy) userland code
+exploiting remote bugs, and by that taking over the system, it is often
+desired to limit userland to specific rpmsg channels (see definition below)
+it can send messages on, and if possible, minimize how much control
+it has over the content of the messages.
+
+Every rpmsg device is a communication channel with a remote processor (thus
+rpmsg devices are called channels). Channels are identified by a textual name
+and have a local ("source") rpmsg address, and remote ("destination") rpmsg
+address.
+
+When a driver starts listening on a channel, its rx callback is bound with
+a unique rpmsg local address (a 32-bit integer). This way when inbound messages
+arrive, the rpmsg core dispatches them to the appropriate driver according
+to their destination address (this is done by invoking the driver's rx handler
+with the payload of the inbound message).
+
+
+User API
+========
+
+::
+
+  int rpmsg_send(struct rpmsg_channel *rpdev, void *data, int len);
+
+sends a message across to the remote processor on a given channel.
+The caller should specify the channel, the data it wants to send,
+and its length (in bytes). The message will be sent on the specified
+channel, i.e. its source and destination address fields will be
+set to the channel's src and dst addresses.
+
+In case there are no TX buffers available, the function will block until
+one becomes available (i.e. until the remote processor consumes
+a tx buffer and puts it back on virtio's used descriptor ring),
+or a timeout of 15 seconds elapses. When the latter happens,
+-ERESTARTSYS is returned.
+
+The function can only be called from a process context (for now).
+Returns 0 on success and an appropriate error value on failure.
+
+::
+
+  int rpmsg_sendto(struct rpmsg_channel *rpdev, void *data, int len, u32 dst);
+
+sends a message across to the remote processor on a given channel,
+to a destination address provided by the caller.
+
+The caller should specify the channel, the data it wants to send,
+its length (in bytes), and an explicit destination address.
+
+The message will then be sent to the remote processor to which the
+channel belongs, using the channel's src address, and the user-provided
+dst address (thus the channel's dst address will be ignored).
+
+In case there are no TX buffers available, the function will block until
+one becomes available (i.e. until the remote processor consumes
+a tx buffer and puts it back on virtio's used descriptor ring),
+or a timeout of 15 seconds elapses. When the latter happens,
+-ERESTARTSYS is returned.
+
+The function can only be called from a process context (for now).
+Returns 0 on success and an appropriate error value on failure.
+
+::
+
+  int rpmsg_send_offchannel(struct rpmsg_channel *rpdev, u32 src, u32 dst,
+							void *data, int len);
+
+
+sends a message across to the remote processor, using the src and dst
+addresses provided by the user.
+
+The caller should specify the channel, the data it wants to send,
+its length (in bytes), and explicit source and destination addresses.
+The message will then be sent to the remote processor to which the
+channel belongs, but the channel's src and dst addresses will be
+ignored (and the user-provided addresses will be used instead).
+
+In case there are no TX buffers available, the function will block until
+one becomes available (i.e. until the remote processor consumes
+a tx buffer and puts it back on virtio's used descriptor ring),
+or a timeout of 15 seconds elapses. When the latter happens,
+-ERESTARTSYS is returned.
+
+The function can only be called from a process context (for now).
+Returns 0 on success and an appropriate error value on failure.
+
+::
+
+  int rpmsg_trysend(struct rpmsg_channel *rpdev, void *data, int len);
+
+sends a message across to the remote processor on a given channel.
+The caller should specify the channel, the data it wants to send,
+and its length (in bytes). The message will be sent on the specified
+channel, i.e. its source and destination address fields will be
+set to the channel's src and dst addresses.
+
+In case there are no TX buffers available, the function will immediately
+return -ENOMEM without waiting until one becomes available.
+
+The function can only be called from a process context (for now).
+Returns 0 on success and an appropriate error value on failure.
+
+::
+
+  int rpmsg_trysendto(struct rpmsg_channel *rpdev, void *data, int len, u32 dst)
+
+
+sends a message across to the remote processor on a given channel,
+to a destination address provided by the user.
+
+The user should specify the channel, the data it wants to send,
+its length (in bytes), and an explicit destination address.
+
+The message will then be sent to the remote processor to which the
+channel belongs, using the channel's src address, and the user-provided
+dst address (thus the channel's dst address will be ignored).
+
+In case there are no TX buffers available, the function will immediately
+return -ENOMEM without waiting until one becomes available.
+
+The function can only be called from a process context (for now).
+Returns 0 on success and an appropriate error value on failure.
+
+::
+
+  int rpmsg_trysend_offchannel(struct rpmsg_channel *rpdev, u32 src, u32 dst,
+							void *data, int len);
+
+
+sends a message across to the remote processor, using source and
+destination addresses provided by the user.
+
+The user should specify the channel, the data it wants to send,
+its length (in bytes), and explicit source and destination addresses.
+The message will then be sent to the remote processor to which the
+channel belongs, but the channel's src and dst addresses will be
+ignored (and the user-provided addresses will be used instead).
+
+In case there are no TX buffers available, the function will immediately
+return -ENOMEM without waiting until one becomes available.
+
+The function can only be called from a process context (for now).
+Returns 0 on success and an appropriate error value on failure.
+
+::
+
+  struct rpmsg_endpoint *rpmsg_create_ept(struct rpmsg_channel *rpdev,
+		void (*cb)(struct rpmsg_channel *, void *, int, void *, u32),
+		void *priv, u32 addr);
+
+every rpmsg address in the system is bound to an rx callback (so when
+inbound messages arrive, they are dispatched by the rpmsg bus using the
+appropriate callback handler) by means of an rpmsg_endpoint struct.
+
+This function allows drivers to create such an endpoint, and by that,
+bind a callback, and possibly some private data too, to an rpmsg address
+(either one that is known in advance, or one that will be dynamically
+assigned for them).
+
+Simple rpmsg drivers need not call rpmsg_create_ept, because an endpoint
+is already created for them when they are probed by the rpmsg bus
+(using the rx callback they provide when they registered to the rpmsg bus).
+
+So things should just work for simple drivers: they already have an
+endpoint, their rx callback is bound to their rpmsg address, and when
+relevant inbound messages arrive (i.e. messages which their dst address
+equals to the src address of their rpmsg channel), the driver's handler
+is invoked to process it.
+
+That said, more complicated drivers might do need to allocate
+additional rpmsg addresses, and bind them to different rx callbacks.
+To accomplish that, those drivers need to call this function.
+Drivers should provide their channel (so the new endpoint would bind
+to the same remote processor their channel belongs to), an rx callback
+function, an optional private data (which is provided back when the
+rx callback is invoked), and an address they want to bind with the
+callback. If addr is RPMSG_ADDR_ANY, then rpmsg_create_ept will
+dynamically assign them an available rpmsg address (drivers should have
+a very good reason why not to always use RPMSG_ADDR_ANY here).
+
+Returns a pointer to the endpoint on success, or NULL on error.
+
+::
+
+  void rpmsg_destroy_ept(struct rpmsg_endpoint *ept);
+
+
+destroys an existing rpmsg endpoint. user should provide a pointer
+to an rpmsg endpoint that was previously created with rpmsg_create_ept().
+
+::
+
+  int register_rpmsg_driver(struct rpmsg_driver *rpdrv);
+
+
+registers an rpmsg driver with the rpmsg bus. user should provide
+a pointer to an rpmsg_driver struct, which contains the driver's
+->probe() and ->remove() functions, an rx callback, and an id_table
+specifying the names of the channels this driver is interested to
+be probed with.
+
+::
+
+  void unregister_rpmsg_driver(struct rpmsg_driver *rpdrv);
+
+
+unregisters an rpmsg driver from the rpmsg bus. user should provide
+a pointer to a previously-registered rpmsg_driver struct.
+Returns 0 on success, and an appropriate error value on failure.
+
+
+Typical usage
+=============
+
+The following is a simple rpmsg driver, that sends an "hello!" message
+on probe(), and whenever it receives an incoming message, it dumps its
+content to the console.
+
+::
+
+  #include <linux/kernel.h>
+  #include <linux/module.h>
+  #include <linux/rpmsg.h>
+
+  static void rpmsg_sample_cb(struct rpmsg_channel *rpdev, void *data, int len,
+						void *priv, u32 src)
+  {
+	print_hex_dump(KERN_INFO, "incoming message:", DUMP_PREFIX_NONE,
+						16, 1, data, len, true);
+  }
+
+  static int rpmsg_sample_probe(struct rpmsg_channel *rpdev)
+  {
+	int err;
+
+	dev_info(&rpdev->dev, "chnl: 0x%x -> 0x%x\n", rpdev->src, rpdev->dst);
+
+	/* send a message on our channel */
+	err = rpmsg_send(rpdev, "hello!", 6);
+	if (err) {
+		pr_err("rpmsg_send failed: %d\n", err);
+		return err;
+	}
+
+	return 0;
+  }
+
+  static void rpmsg_sample_remove(struct rpmsg_channel *rpdev)
+  {
+	dev_info(&rpdev->dev, "rpmsg sample client driver is removed\n");
+  }
+
+  static struct rpmsg_device_id rpmsg_driver_sample_id_table[] = {
+	{ .name	= "rpmsg-client-sample" },
+	{ },
+  };
+  MODULE_DEVICE_TABLE(rpmsg, rpmsg_driver_sample_id_table);
+
+  static struct rpmsg_driver rpmsg_sample_client = {
+	.drv.name	= KBUILD_MODNAME,
+	.id_table	= rpmsg_driver_sample_id_table,
+	.probe		= rpmsg_sample_probe,
+	.callback	= rpmsg_sample_cb,
+	.remove		= rpmsg_sample_remove,
+  };
+  module_rpmsg_driver(rpmsg_sample_client);
+
+.. note::
+
+   a similar sample which can be built and loaded can be found
+   in samples/rpmsg/.
+
+Allocations of rpmsg channels
+=============================
+
+At this point we only support dynamic allocations of rpmsg channels.
+
+This is possible only with remote processors that have the VIRTIO_RPMSG_F_NS
+virtio device feature set. This feature bit means that the remote
+processor supports dynamic name service announcement messages.
+
+When this feature is enabled, creation of rpmsg devices (i.e. channels)
+is completely dynamic: the remote processor announces the existence of a
+remote rpmsg service by sending a name service message (which contains
+the name and rpmsg addr of the remote service, see struct rpmsg_ns_msg).
+
+This message is then handled by the rpmsg bus, which in turn dynamically
+creates and registers an rpmsg channel (which represents the remote service).
+If/when a relevant rpmsg driver is registered, it will be immediately probed
+by the bus, and can then start sending messages to the remote service.
+
+The plan is also to add static creation of rpmsg channels via the virtio
+config space, but it's not implemented yet.
diff --git a/Documentation/staging/speculation.rst b/Documentation/staging/speculation.rst
new file mode 100644
index 000000000000..8045d99bcf12
--- /dev/null
+++ b/Documentation/staging/speculation.rst
@@ -0,0 +1,92 @@
+===========
+Speculation
+===========
+
+This document explains potential effects of speculation, and how undesirable
+effects can be mitigated portably using common APIs.
+
+------------------------------------------------------------------------------
+
+To improve performance and minimize average latencies, many contemporary CPUs
+employ speculative execution techniques such as branch prediction, performing
+work which may be discarded at a later stage.
+
+Typically speculative execution cannot be observed from architectural state,
+such as the contents of registers. However, in some cases it is possible to
+observe its impact on microarchitectural state, such as the presence or
+absence of data in caches. Such state may form side-channels which can be
+observed to extract secret information.
+
+For example, in the presence of branch prediction, it is possible for bounds
+checks to be ignored by code which is speculatively executed. Consider the
+following code::
+
+	int load_array(int *array, unsigned int index)
+	{
+		if (index >= MAX_ARRAY_ELEMS)
+			return 0;
+		else
+			return array[index];
+	}
+
+Which, on arm64, may be compiled to an assembly sequence such as::
+
+	CMP	<index>, #MAX_ARRAY_ELEMS
+	B.LT	less
+	MOV	<returnval>, #0
+	RET
+  less:
+	LDR	<returnval>, [<array>, <index>]
+	RET
+
+It is possible that a CPU mis-predicts the conditional branch, and
+speculatively loads array[index], even if index >= MAX_ARRAY_ELEMS. This
+value will subsequently be discarded, but the speculated load may affect
+microarchitectural state which can be subsequently measured.
+
+More complex sequences involving multiple dependent memory accesses may
+result in sensitive information being leaked. Consider the following
+code, building on the prior example::
+
+	int load_dependent_arrays(int *arr1, int *arr2, int index)
+	{
+		int val1, val2,
+
+		val1 = load_array(arr1, index);
+		val2 = load_array(arr2, val1);
+
+		return val2;
+	}
+
+Under speculation, the first call to load_array() may return the value
+of an out-of-bounds address, while the second call will influence
+microarchitectural state dependent on this value. This may provide an
+arbitrary read primitive.
+
+====================================
+Mitigating speculation side-channels
+====================================
+
+The kernel provides a generic API to ensure that bounds checks are
+respected even under speculation. Architectures which are affected by
+speculation-based side-channels are expected to implement these
+primitives.
+
+The array_index_nospec() helper in <linux/nospec.h> can be used to
+prevent information from being leaked via side-channels.
+
+A call to array_index_nospec(index, size) returns a sanitized index
+value that is bounded to [0, size) even under cpu speculation
+conditions.
+
+This can be used to protect the earlier load_array() example::
+
+	int load_array(int *array, unsigned int index)
+	{
+		if (index >= MAX_ARRAY_ELEMS)
+			return 0;
+		else {
+			index = array_index_nospec(index, MAX_ARRAY_ELEMS);
+			return array[index];
+		}
+	}
diff --git a/Documentation/staging/static-keys.rst b/Documentation/staging/static-keys.rst
new file mode 100644
index 000000000000..38290b9f25eb
--- /dev/null
+++ b/Documentation/staging/static-keys.rst
@@ -0,0 +1,331 @@
+===========
+Static Keys
+===========
+
+.. warning::
+
+   DEPRECATED API:
+
+   The use of 'struct static_key' directly, is now DEPRECATED. In addition
+   static_key_{true,false}() is also DEPRECATED. IE DO NOT use the following::
+
+	struct static_key false = STATIC_KEY_INIT_FALSE;
+	struct static_key true = STATIC_KEY_INIT_TRUE;
+	static_key_true()
+	static_key_false()
+
+   The updated API replacements are::
+
+	DEFINE_STATIC_KEY_TRUE(key);
+	DEFINE_STATIC_KEY_FALSE(key);
+	DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count);
+	DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count);
+	static_branch_likely()
+	static_branch_unlikely()
+
+Abstract
+========
+
+Static keys allows the inclusion of seldom used features in
+performance-sensitive fast-path kernel code, via a GCC feature and a code
+patching technique. A quick example::
+
+	DEFINE_STATIC_KEY_FALSE(key);
+
+	...
+
+        if (static_branch_unlikely(&key))
+                do unlikely code
+        else
+                do likely code
+
+	...
+	static_branch_enable(&key);
+	...
+	static_branch_disable(&key);
+	...
+
+The static_branch_unlikely() branch will be generated into the code with as little
+impact to the likely code path as possible.
+
+
+Motivation
+==========
+
+
+Currently, tracepoints are implemented using a conditional branch. The
+conditional check requires checking a global variable for each tracepoint.
+Although the overhead of this check is small, it increases when the memory
+cache comes under pressure (memory cache lines for these global variables may
+be shared with other memory accesses). As we increase the number of tracepoints
+in the kernel this overhead may become more of an issue. In addition,
+tracepoints are often dormant (disabled) and provide no direct kernel
+functionality. Thus, it is highly desirable to reduce their impact as much as
+possible. Although tracepoints are the original motivation for this work, other
+kernel code paths should be able to make use of the static keys facility.
+
+
+Solution
+========
+
+
+gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label:
+
+https://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html
+
+Using the 'asm goto', we can create branches that are either taken or not taken
+by default, without the need to check memory. Then, at run-time, we can patch
+the branch site to change the branch direction.
+
+For example, if we have a simple branch that is disabled by default::
+
+	if (static_branch_unlikely(&key))
+		printk("I am the true branch\n");
+
+Thus, by default the 'printk' will not be emitted. And the code generated will
+consist of a single atomic 'no-op' instruction (5 bytes on x86), in the
+straight-line code path. When the branch is 'flipped', we will patch the
+'no-op' in the straight-line codepath with a 'jump' instruction to the
+out-of-line true branch. Thus, changing branch direction is expensive but
+branch selection is basically 'free'. That is the basic tradeoff of this
+optimization.
+
+This lowlevel patching mechanism is called 'jump label patching', and it gives
+the basis for the static keys facility.
+
+Static key label API, usage and examples
+========================================
+
+
+In order to make use of this optimization you must first define a key::
+
+	DEFINE_STATIC_KEY_TRUE(key);
+
+or::
+
+	DEFINE_STATIC_KEY_FALSE(key);
+
+
+The key must be global, that is, it can't be allocated on the stack or dynamically
+allocated at run-time.
+
+The key is then used in code as::
+
+        if (static_branch_unlikely(&key))
+                do unlikely code
+        else
+                do likely code
+
+Or::
+
+        if (static_branch_likely(&key))
+                do likely code
+        else
+                do unlikely code
+
+Keys defined via DEFINE_STATIC_KEY_TRUE(), or DEFINE_STATIC_KEY_FALSE, may
+be used in either static_branch_likely() or static_branch_unlikely()
+statements.
+
+Branch(es) can be set true via::
+
+	static_branch_enable(&key);
+
+or false via::
+
+	static_branch_disable(&key);
+
+The branch(es) can then be switched via reference counts::
+
+	static_branch_inc(&key);
+	...
+	static_branch_dec(&key);
+
+Thus, 'static_branch_inc()' means 'make the branch true', and
+'static_branch_dec()' means 'make the branch false' with appropriate
+reference counting. For example, if the key is initialized true, a
+static_branch_dec(), will switch the branch to false. And a subsequent
+static_branch_inc(), will change the branch back to true. Likewise, if the
+key is initialized false, a 'static_branch_inc()', will change the branch to
+true. And then a 'static_branch_dec()', will again make the branch false.
+
+The state and the reference count can be retrieved with 'static_key_enabled()'
+and 'static_key_count()'.  In general, if you use these functions, they
+should be protected with the same mutex used around the enable/disable
+or increment/decrement function.
+
+Note that switching branches results in some locks being taken,
+particularly the CPU hotplug lock (in order to avoid races against
+CPUs being brought in the kernel while the kernel is getting
+patched). Calling the static key API from within a hotplug notifier is
+thus a sure deadlock recipe. In order to still allow use of the
+functionality, the following functions are provided:
+
+	static_key_enable_cpuslocked()
+	static_key_disable_cpuslocked()
+	static_branch_enable_cpuslocked()
+	static_branch_disable_cpuslocked()
+
+These functions are *not* general purpose, and must only be used when
+you really know that you're in the above context, and no other.
+
+Where an array of keys is required, it can be defined as::
+
+	DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count);
+
+or::
+
+	DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count);
+
+4) Architecture level code patching interface, 'jump labels'
+
+
+There are a few functions and macros that architectures must implement in order
+to take advantage of this optimization. If there is no architecture support, we
+simply fall back to a traditional, load, test, and jump sequence. Also, the
+struct jump_entry table must be at least 4-byte aligned because the
+static_key->entry field makes use of the two least significant bits.
+
+* ``select HAVE_ARCH_JUMP_LABEL``,
+    see: arch/x86/Kconfig
+
+* ``#define JUMP_LABEL_NOP_SIZE``,
+    see: arch/x86/include/asm/jump_label.h
+
+* ``__always_inline bool arch_static_branch(struct static_key *key, bool branch)``,
+    see: arch/x86/include/asm/jump_label.h
+
+* ``__always_inline bool arch_static_branch_jump(struct static_key *key, bool branch)``,
+    see: arch/x86/include/asm/jump_label.h
+
+* ``void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type)``,
+    see: arch/x86/kernel/jump_label.c
+
+* ``__init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type)``,
+    see: arch/x86/kernel/jump_label.c
+
+* ``struct jump_entry``,
+    see: arch/x86/include/asm/jump_label.h
+
+
+5) Static keys / jump label analysis, results (x86_64):
+
+
+As an example, let's add the following branch to 'getppid()', such that the
+system call now looks like::
+
+  SYSCALL_DEFINE0(getppid)
+  {
+        int pid;
+
+  +     if (static_branch_unlikely(&key))
+  +             printk("I am the true branch\n");
+
+        rcu_read_lock();
+        pid = task_tgid_vnr(rcu_dereference(current->real_parent));
+        rcu_read_unlock();
+
+        return pid;
+  }
+
+The resulting instructions with jump labels generated by GCC is::
+
+  ffffffff81044290 <sys_getppid>:
+  ffffffff81044290:       55                      push   %rbp
+  ffffffff81044291:       48 89 e5                mov    %rsp,%rbp
+  ffffffff81044294:       e9 00 00 00 00          jmpq   ffffffff81044299 <sys_getppid+0x9>
+  ffffffff81044299:       65 48 8b 04 25 c0 b6    mov    %gs:0xb6c0,%rax
+  ffffffff810442a0:       00 00
+  ffffffff810442a2:       48 8b 80 80 02 00 00    mov    0x280(%rax),%rax
+  ffffffff810442a9:       48 8b 80 b0 02 00 00    mov    0x2b0(%rax),%rax
+  ffffffff810442b0:       48 8b b8 e8 02 00 00    mov    0x2e8(%rax),%rdi
+  ffffffff810442b7:       e8 f4 d9 00 00          callq  ffffffff81051cb0 <pid_vnr>
+  ffffffff810442bc:       5d                      pop    %rbp
+  ffffffff810442bd:       48 98                   cltq
+  ffffffff810442bf:       c3                      retq
+  ffffffff810442c0:       48 c7 c7 e3 54 98 81    mov    $0xffffffff819854e3,%rdi
+  ffffffff810442c7:       31 c0                   xor    %eax,%eax
+  ffffffff810442c9:       e8 71 13 6d 00          callq  ffffffff8171563f <printk>
+  ffffffff810442ce:       eb c9                   jmp    ffffffff81044299 <sys_getppid+0x9>
+
+Without the jump label optimization it looks like::
+
+  ffffffff810441f0 <sys_getppid>:
+  ffffffff810441f0:       8b 05 8a 52 d8 00       mov    0xd8528a(%rip),%eax        # ffffffff81dc9480 <key>
+  ffffffff810441f6:       55                      push   %rbp
+  ffffffff810441f7:       48 89 e5                mov    %rsp,%rbp
+  ffffffff810441fa:       85 c0                   test   %eax,%eax
+  ffffffff810441fc:       75 27                   jne    ffffffff81044225 <sys_getppid+0x35>
+  ffffffff810441fe:       65 48 8b 04 25 c0 b6    mov    %gs:0xb6c0,%rax
+  ffffffff81044205:       00 00
+  ffffffff81044207:       48 8b 80 80 02 00 00    mov    0x280(%rax),%rax
+  ffffffff8104420e:       48 8b 80 b0 02 00 00    mov    0x2b0(%rax),%rax
+  ffffffff81044215:       48 8b b8 e8 02 00 00    mov    0x2e8(%rax),%rdi
+  ffffffff8104421c:       e8 2f da 00 00          callq  ffffffff81051c50 <pid_vnr>
+  ffffffff81044221:       5d                      pop    %rbp
+  ffffffff81044222:       48 98                   cltq
+  ffffffff81044224:       c3                      retq
+  ffffffff81044225:       48 c7 c7 13 53 98 81    mov    $0xffffffff81985313,%rdi
+  ffffffff8104422c:       31 c0                   xor    %eax,%eax
+  ffffffff8104422e:       e8 60 0f 6d 00          callq  ffffffff81715193 <printk>
+  ffffffff81044233:       eb c9                   jmp    ffffffff810441fe <sys_getppid+0xe>
+  ffffffff81044235:       66 66 2e 0f 1f 84 00    data32 nopw %cs:0x0(%rax,%rax,1)
+  ffffffff8104423c:       00 00 00 00
+
+Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction
+vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched
+to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump
+label case adds::
+
+  6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes.
+
+If we then include the padding bytes, the jump label code saves, 16 total bytes
+of instruction memory for this small function. In this case the non-jump label
+function is 80 bytes long. Thus, we have saved 20% of the instruction
+footprint. We can in fact improve this even further, since the 5-byte no-op
+really can be a 2-byte no-op since we can reach the branch with a 2-byte jmp.
+However, we have not yet implemented optimal no-op sizes (they are currently
+hard-coded).
+
+Since there are a number of static key API uses in the scheduler paths,
+'pipe-test' (also known as 'perf bench sched pipe') can be used to show the
+performance improvement. Testing done on 3.3.0-rc2:
+
+jump label disabled::
+
+ Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):
+
+        855.700314 task-clock                #    0.534 CPUs utilized            ( +-  0.11% )
+           200,003 context-switches          #    0.234 M/sec                    ( +-  0.00% )
+                 0 CPU-migrations            #    0.000 M/sec                    ( +- 39.58% )
+               487 page-faults               #    0.001 M/sec                    ( +-  0.02% )
+     1,474,374,262 cycles                    #    1.723 GHz                      ( +-  0.17% )
+   <not supported> stalled-cycles-frontend
+   <not supported> stalled-cycles-backend
+     1,178,049,567 instructions              #    0.80  insns per cycle          ( +-  0.06% )
+       208,368,926 branches                  #  243.507 M/sec                    ( +-  0.06% )
+         5,569,188 branch-misses             #    2.67% of all branches          ( +-  0.54% )
+
+       1.601607384 seconds time elapsed                                          ( +-  0.07% )
+
+jump label enabled::
+
+ Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):
+
+        841.043185 task-clock                #    0.533 CPUs utilized            ( +-  0.12% )
+           200,004 context-switches          #    0.238 M/sec                    ( +-  0.00% )
+                 0 CPU-migrations            #    0.000 M/sec                    ( +- 40.87% )
+               487 page-faults               #    0.001 M/sec                    ( +-  0.05% )
+     1,432,559,428 cycles                    #    1.703 GHz                      ( +-  0.18% )
+   <not supported> stalled-cycles-frontend
+   <not supported> stalled-cycles-backend
+     1,175,363,994 instructions              #    0.82  insns per cycle          ( +-  0.04% )
+       206,859,359 branches                  #  245.956 M/sec                    ( +-  0.04% )
+         4,884,119 branch-misses             #    2.36% of all branches          ( +-  0.85% )
+
+       1.579384366 seconds time elapsed
+
+The percentage of saved branches is .7%, and we've saved 12% on
+'branch-misses'. This is where we would expect to get the most savings, since
+this optimization is about reducing the number of branches. In addition, we've
+saved .2% on instructions, and 2.8% on cycles and 1.4% on elapsed time.
diff --git a/Documentation/staging/tee.rst b/Documentation/staging/tee.rst
new file mode 100644
index 000000000000..62e8ba64d04f
--- /dev/null
+++ b/Documentation/staging/tee.rst
@@ -0,0 +1,277 @@
+=============
+TEE subsystem
+=============
+
+This document describes the TEE subsystem in Linux.
+
+A TEE (Trusted Execution Environment) is a trusted OS running in some
+secure environment, for example, TrustZone on ARM CPUs, or a separate
+secure co-processor etc. A TEE driver handles the details needed to
+communicate with the TEE.
+
+This subsystem deals with:
+
+- Registration of TEE drivers
+
+- Managing shared memory between Linux and the TEE
+
+- Providing a generic API to the TEE
+
+The TEE interface
+=================
+
+include/uapi/linux/tee.h defines the generic interface to a TEE.
+
+User space (the client) connects to the driver by opening /dev/tee[0-9]* or
+/dev/teepriv[0-9]*.
+
+- TEE_IOC_SHM_ALLOC allocates shared memory and returns a file descriptor
+  which user space can mmap. When user space doesn't need the file
+  descriptor any more, it should be closed. When shared memory isn't needed
+  any longer it should be unmapped with munmap() to allow the reuse of
+  memory.
+
+- TEE_IOC_VERSION lets user space know which TEE this driver handles and
+  its capabilities.
+
+- TEE_IOC_OPEN_SESSION opens a new session to a Trusted Application.
+
+- TEE_IOC_INVOKE invokes a function in a Trusted Application.
+
+- TEE_IOC_CANCEL may cancel an ongoing TEE_IOC_OPEN_SESSION or TEE_IOC_INVOKE.
+
+- TEE_IOC_CLOSE_SESSION closes a session to a Trusted Application.
+
+There are two classes of clients, normal clients and supplicants. The latter is
+a helper process for the TEE to access resources in Linux, for example file
+system access. A normal client opens /dev/tee[0-9]* and a supplicant opens
+/dev/teepriv[0-9].
+
+Much of the communication between clients and the TEE is opaque to the
+driver. The main job for the driver is to receive requests from the
+clients, forward them to the TEE and send back the results. In the case of
+supplicants the communication goes in the other direction, the TEE sends
+requests to the supplicant which then sends back the result.
+
+The TEE kernel interface
+========================
+
+Kernel provides a TEE bus infrastructure where a Trusted Application is
+represented as a device identified via Universally Unique Identifier (UUID) and
+client drivers register a table of supported device UUIDs.
+
+TEE bus infrastructure registers following APIs:
+-  match(): iterates over the client driver UUID table to find a corresponding
+   match for device UUID. If a match is found, then this particular device is
+   probed via corresponding probe API registered by the client driver. This
+   process happens whenever a device or a client driver is registered with TEE
+   bus.
+-  uevent(): notifies user-space (udev) whenever a new device is registered on
+   TEE bus for auto-loading of modularized client drivers.
+
+TEE bus device enumeration is specific to underlying TEE implementation, so it
+is left open for TEE drivers to provide corresponding implementation.
+
+Then TEE client driver can talk to a matched Trusted Application using APIs
+listed in include/linux/tee_drv.h.
+
+TEE client driver example
+-------------------------
+
+Suppose a TEE client driver needs to communicate with a Trusted Application
+having UUID: ``ac6a4085-0e82-4c33-bf98-8eb8e118b6c2``, so driver registration
+snippet would look like::
+
+	static const struct tee_client_device_id client_id_table[] = {
+		{UUID_INIT(0xac6a4085, 0x0e82, 0x4c33,
+			   0xbf, 0x98, 0x8e, 0xb8, 0xe1, 0x18, 0xb6, 0xc2)},
+		{}
+	};
+
+	MODULE_DEVICE_TABLE(tee, client_id_table);
+
+	static struct tee_client_driver client_driver = {
+		.id_table	= client_id_table,
+		.driver		= {
+			.name		= DRIVER_NAME,
+			.bus		= &tee_bus_type,
+			.probe		= client_probe,
+			.remove		= client_remove,
+		},
+	};
+
+	static int __init client_init(void)
+	{
+		return driver_register(&client_driver.driver);
+	}
+
+	static void __exit client_exit(void)
+	{
+		driver_unregister(&client_driver.driver);
+	}
+
+	module_init(client_init);
+	module_exit(client_exit);
+
+OP-TEE driver
+=============
+
+The OP-TEE driver handles OP-TEE [1] based TEEs. Currently it is only the ARM
+TrustZone based OP-TEE solution that is supported.
+
+Lowest level of communication with OP-TEE builds on ARM SMC Calling
+Convention (SMCCC) [2], which is the foundation for OP-TEE's SMC interface
+[3] used internally by the driver. Stacked on top of that is OP-TEE Message
+Protocol [4].
+
+OP-TEE SMC interface provides the basic functions required by SMCCC and some
+additional functions specific for OP-TEE. The most interesting functions are:
+
+- OPTEE_SMC_FUNCID_CALLS_UID (part of SMCCC) returns the version information
+  which is then returned by TEE_IOC_VERSION
+
+- OPTEE_SMC_CALL_GET_OS_UUID returns the particular OP-TEE implementation, used
+  to tell, for instance, a TrustZone OP-TEE apart from an OP-TEE running on a
+  separate secure co-processor.
+
+- OPTEE_SMC_CALL_WITH_ARG drives the OP-TEE message protocol
+
+- OPTEE_SMC_GET_SHM_CONFIG lets the driver and OP-TEE agree on which memory
+  range to used for shared memory between Linux and OP-TEE.
+
+The GlobalPlatform TEE Client API [5] is implemented on top of the generic
+TEE API.
+
+Picture of the relationship between the different components in the
+OP-TEE architecture::
+
+      User space                  Kernel                   Secure world
+      ~~~~~~~~~~                  ~~~~~~                   ~~~~~~~~~~~~
+   +--------+                                             +-------------+
+   | Client |                                             | Trusted     |
+   +--------+                                             | Application |
+      /\                                                  +-------------+
+      || +----------+                                           /\
+      || |tee-      |                                           ||
+      || |supplicant|                                           \/
+      || +----------+                                     +-------------+
+      \/      /\                                          | TEE Internal|
+   +-------+  ||                                          | API         |
+   + TEE   |  ||            +--------+--------+           +-------------+
+   | Client|  ||            | TEE    | OP-TEE |           | OP-TEE      |
+   | API   |  \/            | subsys | driver |           | Trusted OS  |
+   +-------+----------------+----+-------+----+-----------+-------------+
+   |      Generic TEE API        |       |     OP-TEE MSG               |
+   |      IOCTL (TEE_IOC_*)      |       |     SMCCC (OPTEE_SMC_CALL_*) |
+   +-----------------------------+       +------------------------------+
+
+RPC (Remote Procedure Call) are requests from secure world to kernel driver
+or tee-supplicant. An RPC is identified by a special range of SMCCC return
+values from OPTEE_SMC_CALL_WITH_ARG. RPC messages which are intended for the
+kernel are handled by the kernel driver. Other RPC messages will be forwarded to
+tee-supplicant without further involvement of the driver, except switching
+shared memory buffer representation.
+
+OP-TEE device enumeration
+-------------------------
+
+OP-TEE provides a pseudo Trusted Application: drivers/tee/optee/device.c in
+order to support device enumeration. In other words, OP-TEE driver invokes this
+application to retrieve a list of Trusted Applications which can be registered
+as devices on the TEE bus.
+
+AMD-TEE driver
+==============
+
+The AMD-TEE driver handles the communication with AMD's TEE environment. The
+TEE environment is provided by AMD Secure Processor.
+
+The AMD Secure Processor (formerly called Platform Security Processor or PSP)
+is a dedicated processor that features ARM TrustZone technology, along with a
+software-based Trusted Execution Environment (TEE) designed to enable
+third-party Trusted Applications. This feature is currently enabled only for
+APUs.
+
+The following picture shows a high level overview of AMD-TEE::
+
+                                             |
+    x86                                      |
+                                             |
+ User space            (Kernel space)        |    AMD Secure Processor (PSP)
+ ~~~~~~~~~~            ~~~~~~~~~~~~~~        |    ~~~~~~~~~~~~~~~~~~~~~~~~~~
+                                             |
+ +--------+                                  |       +-------------+
+ | Client |                                  |       | Trusted     |
+ +--------+                                  |       | Application |
+     /\                                      |       +-------------+
+     ||                                      |             /\
+     ||                                      |             ||
+     ||                                      |             \/
+     ||                                      |         +----------+
+     ||                                      |         |   TEE    |
+     ||                                      |         | Internal |
+     \/                                      |         |   API    |
+ +---------+           +-----------+---------+         +----------+
+ | TEE     |           | TEE       | AMD-TEE |         | AMD-TEE  |
+ | Client  |           | subsystem | driver  |         | Trusted  |
+ | API     |           |           |         |         |   OS     |
+ +---------+-----------+----+------+---------+---------+----------+
+ |   Generic TEE API        |      | ASP     |      Mailbox       |
+ |   IOCTL (TEE_IOC_*)      |      | driver  | Register Protocol  |
+ +--------------------------+      +---------+--------------------+
+
+At the lowest level (in x86), the AMD Secure Processor (ASP) driver uses the
+CPU to PSP mailbox regsister to submit commands to the PSP. The format of the
+command buffer is opaque to the ASP driver. It's role is to submit commands to
+the secure processor and return results to AMD-TEE driver. The interface
+between AMD-TEE driver and AMD Secure Processor driver can be found in [6].
+
+The AMD-TEE driver packages the command buffer payload for processing in TEE.
+The command buffer format for the different TEE commands can be found in [7].
+
+The TEE commands supported by AMD-TEE Trusted OS are:
+
+* TEE_CMD_ID_LOAD_TA          - loads a Trusted Application (TA) binary into
+                                TEE environment.
+* TEE_CMD_ID_UNLOAD_TA        - unloads TA binary from TEE environment.
+* TEE_CMD_ID_OPEN_SESSION     - opens a session with a loaded TA.
+* TEE_CMD_ID_CLOSE_SESSION    - closes session with loaded TA
+* TEE_CMD_ID_INVOKE_CMD       - invokes a command with loaded TA
+* TEE_CMD_ID_MAP_SHARED_MEM   - maps shared memory
+* TEE_CMD_ID_UNMAP_SHARED_MEM - unmaps shared memory
+
+AMD-TEE Trusted OS is the firmware running on AMD Secure Processor.
+
+The AMD-TEE driver registers itself with TEE subsystem and implements the
+following driver function callbacks:
+
+* get_version - returns the driver implementation id and capability.
+* open - sets up the driver context data structure.
+* release - frees up driver resources.
+* open_session - loads the TA binary and opens session with loaded TA.
+* close_session -  closes session with loaded TA and unloads it.
+* invoke_func - invokes a command with loaded TA.
+
+cancel_req driver callback is not supported by AMD-TEE.
+
+The GlobalPlatform TEE Client API [5] can be used by the user space (client) to
+talk to AMD's TEE. AMD's TEE provides a secure environment for loading, opening
+a session, invoking commands and clossing session with TA.
+
+References
+==========
+
+[1] https://github.com/OP-TEE/optee_os
+
+[2] http://infocenter.arm.com/help/topic/com.arm.doc.den0028a/index.html
+
+[3] drivers/tee/optee/optee_smc.h
+
+[4] drivers/tee/optee/optee_msg.h
+
+[5] http://www.globalplatform.org/specificationsdevice.asp look for
+    "TEE Client API Specification v1.0" and click download.
+
+[6] include/linux/psp-tee.h
+
+[7] drivers/tee/amdtee/amdtee_if.h
diff --git a/Documentation/staging/xz.rst b/Documentation/staging/xz.rst
new file mode 100644
index 000000000000..b2f5ff12a161
--- /dev/null
+++ b/Documentation/staging/xz.rst
@@ -0,0 +1,127 @@
+============================
+XZ data compression in Linux
+============================
+
+Introduction
+============
+
+XZ is a general purpose data compression format with high compression
+ratio and relatively fast decompression. The primary compression
+algorithm (filter) is LZMA2. Additional filters can be used to improve
+compression ratio even further. E.g. Branch/Call/Jump (BCJ) filters
+improve compression ratio of executable data.
+
+The XZ decompressor in Linux is called XZ Embedded. It supports
+the LZMA2 filter and optionally also BCJ filters. CRC32 is supported
+for integrity checking. The home page of XZ Embedded is at
+<https://tukaani.org/xz/embedded.html>, where you can find the
+latest version and also information about using the code outside
+the Linux kernel.
+
+For userspace, XZ Utils provide a zlib-like compression library
+and a gzip-like command line tool. XZ Utils can be downloaded from
+<https://tukaani.org/xz/>.
+
+XZ related components in the kernel
+===================================
+
+The xz_dec module provides XZ decompressor with single-call (buffer
+to buffer) and multi-call (stateful) APIs. The usage of the xz_dec
+module is documented in include/linux/xz.h.
+
+The xz_dec_test module is for testing xz_dec. xz_dec_test is not
+useful unless you are hacking the XZ decompressor. xz_dec_test
+allocates a char device major dynamically to which one can write
+.xz files from userspace. The decompressed output is thrown away.
+Keep an eye on dmesg to see diagnostics printed by xz_dec_test.
+See the xz_dec_test source code for the details.
+
+For decompressing the kernel image, initramfs, and initrd, there
+is a wrapper function in lib/decompress_unxz.c. Its API is the
+same as in other decompress_*.c files, which is defined in
+include/linux/decompress/generic.h.
+
+scripts/xz_wrap.sh is a wrapper for the xz command line tool found
+from XZ Utils. The wrapper sets compression options to values suitable
+for compressing the kernel image.
+
+For kernel makefiles, two commands are provided for use with
+$(call if_needed). The kernel image should be compressed with
+$(call if_needed,xzkern) which will use a BCJ filter and a big LZMA2
+dictionary. It will also append a four-byte trailer containing the
+uncompressed size of the file, which is needed by the boot code.
+Other things should be compressed with $(call if_needed,xzmisc)
+which will use no BCJ filter and 1 MiB LZMA2 dictionary.
+
+Notes on compression options
+============================
+
+Since the XZ Embedded supports only streams with no integrity check or
+CRC32, make sure that you don't use some other integrity check type
+when encoding files that are supposed to be decoded by the kernel. With
+liblzma, you need to use either LZMA_CHECK_NONE or LZMA_CHECK_CRC32
+when encoding. With the xz command line tool, use --check=none or
+--check=crc32.
+
+Using CRC32 is strongly recommended unless there is some other layer
+which will verify the integrity of the uncompressed data anyway.
+Double checking the integrity would probably be waste of CPU cycles.
+Note that the headers will always have a CRC32 which will be validated
+by the decoder; you can only change the integrity check type (or
+disable it) for the actual uncompressed data.
+
+In userspace, LZMA2 is typically used with dictionary sizes of several
+megabytes. The decoder needs to have the dictionary in RAM, thus big
+dictionaries cannot be used for files that are intended to be decoded
+by the kernel. 1 MiB is probably the maximum reasonable dictionary
+size for in-kernel use (maybe more is OK for initramfs). The presets
+in XZ Utils may not be optimal when creating files for the kernel,
+so don't hesitate to use custom settings. Example::
+
+	xz --check=crc32 --lzma2=dict=512KiB inputfile
+
+An exception to above dictionary size limitation is when the decoder
+is used in single-call mode. Decompressing the kernel itself is an
+example of this situation. In single-call mode, the memory usage
+doesn't depend on the dictionary size, and it is perfectly fine to
+use a big dictionary: for maximum compression, the dictionary should
+be at least as big as the uncompressed data itself.
+
+Future plans
+============
+
+Creating a limited XZ encoder may be considered if people think it is
+useful. LZMA2 is slower to compress than e.g. Deflate or LZO even at
+the fastest settings, so it isn't clear if LZMA2 encoder is wanted
+into the kernel.
+
+Support for limited random-access reading is planned for the
+decompression code. I don't know if it could have any use in the
+kernel, but I know that it would be useful in some embedded projects
+outside the Linux kernel.
+
+Conformance to the .xz file format specification
+================================================
+
+There are a couple of corner cases where things have been simplified
+at expense of detecting errors as early as possible. These should not
+matter in practice all, since they don't cause security issues. But
+it is good to know this if testing the code e.g. with the test files
+from XZ Utils.
+
+Reporting bugs
+==============
+
+Before reporting a bug, please check that it's not fixed already
+at upstream. See <https://tukaani.org/xz/embedded.html> to get the
+latest code.
+
+Report bugs to <lasse.collin@tukaani.org> or visit #tukaani on
+Freenode and talk to Larhzu. I don't actively read LKML or other
+kernel-related mailing lists, so if there's something I should know,
+you should email to me personally or use IRC.
+
+Don't bother Igor Pavlov with questions about the XZ implementation
+in the kernel or about XZ Utils. While these two implementations
+include essential code that is directly based on Igor Pavlov's code,
+these implementations aren't maintained nor supported by him.