From 104dc5e20ff52748a16f756ae946391bdc6a4d0a Mon Sep 17 00:00:00 2001
From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Date: Wed, 20 Sep 2017 02:26:00 +0200
Subject: PM: Document rules on using pm_runtime_resume() in system suspend
 callbacks

It quite often is necessary to resume devices from runtime suspend
during system suspend for various reasons (for example, if their
wakeup settings need to be changed), but that requires middle-layer
or subsystem code to follow additional rules which currently are not
clearly documented.

Namely, if a driver calls pm_runtime_resume() for the device from
its ->suspend (or equivalent) system sleep callback, that may not
work if the middle layer above it has updated the state of the
device from its ->prepare or ->suspend callbacks already in an
incompatible way.  For this reason, all middle layers must follow
the rule that, until the ->suspend callback provided by the device's
driver is invoked, the only way in which the device's state can be
updated is by calling pm_runtime_resume() for it, if necessary.
Fortunately enough, all middle layers in the code base today follow
this rule, but it is not explicitly stated anywhere, so do that.

Note that calling pm_runtime_resume() from the ->suspend callback
of a driver will cause the ->runtime_resume callback provided by the
middle layer to be invoked, but the rule above guarantees that this
callback will nest properly with the middle layer's ->suspend
callback and it will play well with the ->prepare one invoked before.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
---
 Documentation/driver-api/pm/devices.rst | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

(limited to 'Documentation')

diff --git a/Documentation/driver-api/pm/devices.rst b/Documentation/driver-api/pm/devices.rst
index bedd32388dac..a8b07ec732bd 100644
--- a/Documentation/driver-api/pm/devices.rst
+++ b/Documentation/driver-api/pm/devices.rst
@@ -328,7 +328,10 @@ the phases are: ``prepare``, ``suspend``, ``suspend_late``, ``suspend_noirq``.
 	After the ``->prepare`` callback method returns, no new children may be
 	registered below the device.  The method may also prepare the device or
 	driver in some way for the upcoming system power transition, but it
-	should not put the device into a low-power state.
+	should not put the device into a low-power state.  Moreover, if the
+	device supports runtime power management, the ``->prepare`` callback
+	method must not update its state in case it is necessary to resume it
+	from runtime suspend later on.
 
 	For devices supporting runtime power management, the return value of the
 	prepare callback can be used to indicate to the PM core that it may
@@ -356,6 +359,16 @@ the phases are: ``prepare``, ``suspend``, ``suspend_late``, ``suspend_noirq``.
 	the appropriate low-power state, depending on the bus type the device is
 	on, and they may enable wakeup events.
 
+	However, for devices supporting runtime power management, the
+	``->suspend`` methods provided by subsystems (bus types and PM domains
+	in particular) must follow an additional rule regarding what can be done
+	to the devices before their drivers' ``->suspend`` methods are called.
+	Namely, they can only resume the devices from runtime suspend by
+	calling :c:func:`pm_runtime_resume` for them, if that is necessary, and
+	they must not update the state of the devices in any other way at that
+	time (in case the drivers need to resume the devices from runtime
+	suspend in their ``->suspend`` methods).
+
     3.	For a number of devices it is convenient to split suspend into the
 	"quiesce device" and "save device state" phases, in which cases
 	``suspend_late`` is meant to do the latter.  It is always executed after
@@ -729,6 +742,16 @@ state temporarily, for example so that its system wakeup capability can be
 disabled.  This all depends on the hardware and the design of the subsystem and
 device driver in question.
 
+If it is necessary to resume a device from runtime suspend during a system-wide
+transition into a sleep state, that can be done by calling
+:c:func:`pm_runtime_resume` for it from the ``->suspend`` callback (or its
+couterpart for transitions related to hibernation) of either the device's driver
+or a subsystem responsible for it (for example, a bus type or a PM domain).
+That is guaranteed to work by the requirement that subsystems must not change
+the state of devices (possibly except for resuming them from runtime suspend)
+from their ``->prepare`` and ``->suspend`` callbacks (or equivalent) *before*
+invoking device drivers' ``->suspend`` callbacks (or equivalent).
+
 During system-wide resume from a sleep state it's easiest to put devices into
 the full-power state, as explained in :file:`Documentation/power/runtime_pm.txt`.
 Refer to that document for more information regarding this particular issue as
-- 
cgit v1.2.3


From eeb2d80d502af28e5660ff4bbe00f90ceb82c2db Mon Sep 17 00:00:00 2001
From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Date: Thu, 5 Oct 2017 16:24:03 -0700
Subject: ACPI / LPIT: Add Low Power Idle Table (LPIT) support

Add functionality to read LPIT table, which provides:

 - Sysfs interface to read residency counters via
   /sys/devices/system/cpu/cpuidle/low_power_idle_cpu_residency_us
   /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us

Here the count "low_power_idle_cpu_residency_us" shows the time spent
by CPU package in low power state.  This is read via MSR interface,
which points to MSR for PKG C10.

Here the count "low_power_idle_system_residency_us" show the count the
system was in low power state. This is read via MMIO interface. This
is mapped to SLP_S0 residency on modern Intel systems. This residency
is achieved only when CPU is in PKG C10 and all functional blocks are
in low power state.

It is possible that none of the above counters present or anyone of the
counter present or all counters present.

For example: On my Kabylake system both of the above counters present.
After suspend to idle these counts updated and prints:

 6916179
 6998564

This counter can be read by tools like turbostat to display. Or it can
be used to debug, if modern systems are reaching desired low power state.

 - Provides an interface to read residency counter memory address

   This address can be used to get the base address of PMC memory
   mapped IO.  This is utilized by intel_pmc_core driver to print
   more debug information.

In addition, to avoid code duplication to read iomem, removed the read of
iomem from acpi_os_read_memory() in osl.c and made a common function
acpi_os_read_iomem(). This new function is used for reading iomem in
in both osl.c and acpi_lpit.c.

Link: http://www.uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 Documentation/acpi/lpit.txt |  25 +++++++
 drivers/acpi/Kconfig        |   5 ++
 drivers/acpi/Makefile       |   1 +
 drivers/acpi/acpi_lpit.c    | 162 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/acpi/internal.h     |   6 ++
 drivers/acpi/osl.c          |  42 +++++++-----
 drivers/acpi/scan.c         |   1 +
 include/acpi/acpiosxf.h     |   2 +
 include/linux/acpi.h        |   9 +++
 9 files changed, 237 insertions(+), 16 deletions(-)
 create mode 100644 Documentation/acpi/lpit.txt
 create mode 100644 drivers/acpi/acpi_lpit.c

(limited to 'Documentation')

diff --git a/Documentation/acpi/lpit.txt b/Documentation/acpi/lpit.txt
new file mode 100644
index 000000000000..b426398d2e97
--- /dev/null
+++ b/Documentation/acpi/lpit.txt
@@ -0,0 +1,25 @@
+To enumerate platform Low Power Idle states, Intel platforms are using
+“Low Power Idle Table” (LPIT). More details about this table can be
+downloaded from:
+http://www.uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf
+
+Residencies for each low power state can be read via FFH
+(Function fixed hardware) or a memory mapped interface.
+
+On platforms supporting S0ix sleep states, there can be two types of
+residencies:
+- CPU PKG C10 (Read via FFH interface)
+- Platform Controller Hub (PCH) SLP_S0 (Read via memory mapped interface)
+
+The following attributes are added dynamically to the cpuidle
+sysfs attribute group:
+	/sys/devices/system/cpu/cpuidle/low_power_idle_cpu_residency_us
+	/sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us
+
+The "low_power_idle_cpu_residency_us" attribute shows time spent
+by the CPU package in PKG C10
+
+The "low_power_idle_system_residency_us" attribute shows SLP_S0
+residency, or system time spent with the SLP_S0# signal asserted.
+This is the lowest possible system power state, achieved only when CPU is in
+PKG C10 and all functional blocks in PCH are in a low power state.
diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index 1ce52f84dc23..4bfef0f78cde 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -80,6 +80,11 @@ endif
 config ACPI_SPCR_TABLE
 	bool
 
+config ACPI_LPIT
+	bool
+	depends on X86_64
+	default y
+
 config ACPI_SLEEP
 	bool
 	depends on SUSPEND || HIBERNATION
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index 90265ab4437a..6a19bd7aba21 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -56,6 +56,7 @@ acpi-$(CONFIG_DEBUG_FS)		+= debugfs.o
 acpi-$(CONFIG_ACPI_NUMA)	+= numa.o
 acpi-$(CONFIG_ACPI_PROCFS_POWER) += cm_sbs.o
 acpi-y				+= acpi_lpat.o
+acpi-$(CONFIG_ACPI_LPIT)	+= acpi_lpit.o
 acpi-$(CONFIG_ACPI_GENERIC_GSI) += irq.o
 acpi-$(CONFIG_ACPI_WATCHDOG)	+= acpi_watchdog.o
 
diff --git a/drivers/acpi/acpi_lpit.c b/drivers/acpi/acpi_lpit.c
new file mode 100644
index 000000000000..e94e478dd18b
--- /dev/null
+++ b/drivers/acpi/acpi_lpit.c
@@ -0,0 +1,162 @@
+
+/*
+ * acpi_lpit.c - LPIT table processing functions
+ *
+ * Copyright (C) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/cpu.h>
+#include <linux/acpi.h>
+#include <asm/msr.h>
+#include <asm/tsc.h>
+
+struct lpit_residency_info {
+	struct acpi_generic_address gaddr;
+	u64 frequency;
+	void __iomem *iomem_addr;
+};
+
+/* Storage for an memory mapped and FFH based entries */
+static struct lpit_residency_info residency_info_mem;
+static struct lpit_residency_info residency_info_ffh;
+
+static int lpit_read_residency_counter_us(u64 *counter, bool io_mem)
+{
+	int err;
+
+	if (io_mem) {
+		u64 count = 0;
+		int error;
+
+		error = acpi_os_read_iomem(residency_info_mem.iomem_addr, &count,
+					   residency_info_mem.gaddr.bit_width);
+		if (error)
+			return error;
+
+		*counter = div64_u64(count * 1000000ULL, residency_info_mem.frequency);
+		return 0;
+	}
+
+	err = rdmsrl_safe(residency_info_ffh.gaddr.address, counter);
+	if (!err) {
+		u64 mask = GENMASK_ULL(residency_info_ffh.gaddr.bit_offset +
+				       residency_info_ffh.gaddr. bit_width - 1,
+				       residency_info_ffh.gaddr.bit_offset);
+
+		*counter &= mask;
+		*counter >>= residency_info_ffh.gaddr.bit_offset;
+		*counter = div64_u64(*counter * 1000000ULL, residency_info_ffh.frequency);
+		return 0;
+	}
+
+	return -ENODATA;
+}
+
+static ssize_t low_power_idle_system_residency_us_show(struct device *dev,
+						       struct device_attribute *attr,
+						       char *buf)
+{
+	u64 counter;
+	int ret;
+
+	ret = lpit_read_residency_counter_us(&counter, true);
+	if (ret)
+		return ret;
+
+	return sprintf(buf, "%llu\n", counter);
+}
+static DEVICE_ATTR_RO(low_power_idle_system_residency_us);
+
+static ssize_t low_power_idle_cpu_residency_us_show(struct device *dev,
+						    struct device_attribute *attr,
+						    char *buf)
+{
+	u64 counter;
+	int ret;
+
+	ret = lpit_read_residency_counter_us(&counter, false);
+	if (ret)
+		return ret;
+
+	return sprintf(buf, "%llu\n", counter);
+}
+static DEVICE_ATTR_RO(low_power_idle_cpu_residency_us);
+
+int lpit_read_residency_count_address(u64 *address)
+{
+	if (!residency_info_mem.gaddr.address)
+		return -EINVAL;
+
+	*address = residency_info_mem.gaddr.address;
+
+	return 0;
+}
+
+static void lpit_update_residency(struct lpit_residency_info *info,
+				 struct acpi_lpit_native *lpit_native)
+{
+	info->frequency = lpit_native->counter_frequency ?
+				lpit_native->counter_frequency : tsc_khz * 1000;
+	if (!info->frequency)
+		info->frequency = 1;
+
+	info->gaddr = lpit_native->residency_counter;
+	if (info->gaddr.space_id == ACPI_ADR_SPACE_SYSTEM_MEMORY) {
+		info->iomem_addr = ioremap_nocache(info->gaddr.address,
+						   info->gaddr.bit_width / 8);
+		if (!info->iomem_addr)
+			return;
+
+		/* Silently fail, if cpuidle attribute group is not present */
+		sysfs_add_file_to_group(&cpu_subsys.dev_root->kobj,
+					&dev_attr_low_power_idle_system_residency_us.attr,
+					"cpuidle");
+	} else if (info->gaddr.space_id == ACPI_ADR_SPACE_FIXED_HARDWARE) {
+		/* Silently fail, if cpuidle attribute group is not present */
+		sysfs_add_file_to_group(&cpu_subsys.dev_root->kobj,
+					&dev_attr_low_power_idle_cpu_residency_us.attr,
+					"cpuidle");
+	}
+}
+
+static void lpit_process(u64 begin, u64 end)
+{
+	while (begin + sizeof(struct acpi_lpit_native) < end) {
+		struct acpi_lpit_native *lpit_native = (struct acpi_lpit_native *)begin;
+
+		if (!lpit_native->header.type && !lpit_native->header.flags) {
+			if (lpit_native->residency_counter.space_id == ACPI_ADR_SPACE_SYSTEM_MEMORY &&
+			    !residency_info_mem.gaddr.address) {
+				lpit_update_residency(&residency_info_mem, lpit_native);
+			} else if (lpit_native->residency_counter.space_id == ACPI_ADR_SPACE_FIXED_HARDWARE &&
+				   !residency_info_ffh.gaddr.address) {
+				lpit_update_residency(&residency_info_ffh, lpit_native);
+			}
+		}
+		begin += lpit_native->header.length;
+	}
+}
+
+void acpi_init_lpit(void)
+{
+	acpi_status status;
+	u64 lpit_begin;
+	struct acpi_table_lpit *lpit;
+
+	status = acpi_get_table(ACPI_SIG_LPIT, 0, (struct acpi_table_header **)&lpit);
+
+	if (ACPI_FAILURE(status))
+		return;
+
+	lpit_begin = (u64)lpit + sizeof(*lpit);
+	lpit_process(lpit_begin, lpit_begin + lpit->header.length);
+}
diff --git a/drivers/acpi/internal.h b/drivers/acpi/internal.h
index 4361c4415b4f..fc8c43e76707 100644
--- a/drivers/acpi/internal.h
+++ b/drivers/acpi/internal.h
@@ -248,4 +248,10 @@ void acpi_watchdog_init(void);
 static inline void acpi_watchdog_init(void) {}
 #endif
 
+#ifdef CONFIG_ACPI_LPIT
+void acpi_init_lpit(void);
+#else
+static inline void acpi_init_lpit(void) { }
+#endif
+
 #endif /* _ACPI_INTERNAL_H_ */
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index db78d353bab1..3bb46cb24a99 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -663,6 +663,29 @@ acpi_status acpi_os_write_port(acpi_io_address port, u32 value, u32 width)
 
 EXPORT_SYMBOL(acpi_os_write_port);
 
+int acpi_os_read_iomem(void __iomem *virt_addr, u64 *value, u32 width)
+{
+
+	switch (width) {
+	case 8:
+		*(u8 *) value = readb(virt_addr);
+		break;
+	case 16:
+		*(u16 *) value = readw(virt_addr);
+		break;
+	case 32:
+		*(u32 *) value = readl(virt_addr);
+		break;
+	case 64:
+		*(u64 *) value = readq(virt_addr);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 acpi_status
 acpi_os_read_memory(acpi_physical_address phys_addr, u64 *value, u32 width)
 {
@@ -670,6 +693,7 @@ acpi_os_read_memory(acpi_physical_address phys_addr, u64 *value, u32 width)
 	unsigned int size = width / 8;
 	bool unmap = false;
 	u64 dummy;
+	int error;
 
 	rcu_read_lock();
 	virt_addr = acpi_map_vaddr_lookup(phys_addr, size);
@@ -684,22 +708,8 @@ acpi_os_read_memory(acpi_physical_address phys_addr, u64 *value, u32 width)
 	if (!value)
 		value = &dummy;
 
-	switch (width) {
-	case 8:
-		*(u8 *) value = readb(virt_addr);
-		break;
-	case 16:
-		*(u16 *) value = readw(virt_addr);
-		break;
-	case 32:
-		*(u32 *) value = readl(virt_addr);
-		break;
-	case 64:
-		*(u64 *) value = readq(virt_addr);
-		break;
-	default:
-		BUG();
-	}
+	error = acpi_os_read_iomem(virt_addr, value, width);
+	BUG_ON(error);
 
 	if (unmap)
 		iounmap(virt_addr);
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index 602f8ff212f2..81367edc8a10 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -2122,6 +2122,7 @@ int __init acpi_scan_init(void)
 	acpi_int340x_thermal_init();
 	acpi_amba_init();
 	acpi_watchdog_init();
+	acpi_init_lpit();
 
 	acpi_scan_add_handler(&generic_device_handler);
 
diff --git a/include/acpi/acpiosxf.h b/include/acpi/acpiosxf.h
index c66eb8ffa454..d5c0f5153c4e 100644
--- a/include/acpi/acpiosxf.h
+++ b/include/acpi/acpiosxf.h
@@ -287,6 +287,8 @@ acpi_status acpi_os_write_port(acpi_io_address address, u32 value, u32 width);
 /*
  * Platform and hardware-independent physical memory interfaces
  */
+int acpi_os_read_iomem(void __iomem *virt_addr, u64 *value, u32 width);
+
 #ifndef ACPI_USE_ALTERNATE_PROTOTYPE_acpi_os_read_memory
 acpi_status
 acpi_os_read_memory(acpi_physical_address address, u64 *value, u32 width);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index d18c92d4ba19..2b1738f840ab 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -1248,4 +1248,13 @@ int acpi_irq_get(acpi_handle handle, unsigned int index, struct resource *res)
 }
 #endif
 
+#ifdef CONFIG_ACPI_LPIT
+int lpit_read_residency_count_address(u64 *address);
+#else
+static inline int lpit_read_residency_count_address(u64 *address)
+{
+	return -EINVAL;
+}
+#endif
+
 #endif	/*_LINUX_ACPI_H*/
-- 
cgit v1.2.3


From 20f97caf1120bd02e8ff4adbad3b44b63626feb5 Mon Sep 17 00:00:00 2001
From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Date: Fri, 13 Oct 2017 15:27:24 +0200
Subject: PM / QoS: Drop PM_QOS_FLAG_REMOTE_WAKEUP

The PM QoS flag PM_QOS_FLAG_REMOTE_WAKEUP is not used consistently
and the vast majority of code simply assumes that remote wakeup
should be enabled for devices in runtime suspend if they can
generate wakeup signals, so drop it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
---
 Documentation/ABI/testing/sysfs-devices-power | 16 ---------------
 Documentation/power/pm_qos_interface.txt      | 13 ++++++-------
 drivers/acpi/device_pm.c                      |  6 ++----
 drivers/base/power/domain.c                   |  4 +---
 drivers/base/power/sysfs.c                    | 28 ---------------------------
 include/linux/pm_qos.h                        |  1 -
 6 files changed, 9 insertions(+), 59 deletions(-)

(limited to 'Documentation')

diff --git a/Documentation/ABI/testing/sysfs-devices-power b/Documentation/ABI/testing/sysfs-devices-power
index 676fdf5f2a99..f4b24c327665 100644
--- a/Documentation/ABI/testing/sysfs-devices-power
+++ b/Documentation/ABI/testing/sysfs-devices-power
@@ -258,19 +258,3 @@ Description:
 
 		This attribute has no effect on system-wide suspend/resume and
 		hibernation.
-
-What:		/sys/devices/.../power/pm_qos_remote_wakeup
-Date:		September 2012
-Contact:	Rafael J. Wysocki <rjw@rjwysocki.net>
-Description:
-		The /sys/devices/.../power/pm_qos_remote_wakeup attribute
-		is used for manipulating the PM QoS "remote wakeup required"
-		flag.  If set, this flag indicates to the kernel that the
-		device is a source of user events that have to be signaled from
-		its low-power states.
-
-		Not all drivers support this attribute.  If it isn't supported,
-		it is not present.
-
-		This attribute has no effect on system-wide suspend/resume and
-		hibernation.
diff --git a/Documentation/power/pm_qos_interface.txt b/Documentation/power/pm_qos_interface.txt
index 21d2d48f87a2..19c5f7b1a7ba 100644
--- a/Documentation/power/pm_qos_interface.txt
+++ b/Documentation/power/pm_qos_interface.txt
@@ -98,8 +98,7 @@ Values are updated in response to changes of the request list.
 The target values of resume latency and active state latency tolerance are
 simply the minimum of the request values held in the parameter list elements.
 The PM QoS flags aggregate value is a gather (bitwise OR) of all list elements'
-values.  Two device PM QoS flags are defined currently: PM_QOS_FLAG_NO_POWER_OFF
-and PM_QOS_FLAG_REMOTE_WAKEUP.
+values.  One device PM QoS flag is defined currently: PM_QOS_FLAG_NO_POWER_OFF.
 
 Note: The aggregated target values are implemented in such a way that reading
 the aggregated value does not require any locking mechanism.
@@ -153,14 +152,14 @@ PM QoS list of resume latency constraints and remove sysfs attribute
 pm_qos_resume_latency_us from the device's power directory.
 
 int dev_pm_qos_expose_flags(device, value)
-Add a request to the device's PM QoS list of flags and create sysfs attributes
-pm_qos_no_power_off and pm_qos_remote_wakeup under the device's power directory
-allowing user space to change these flags' value.
+Add a request to the device's PM QoS list of flags and create sysfs attribute
+pm_qos_no_power_off under the device's power directory allowing user space to
+change the value of the PM_QOS_FLAG_NO_POWER_OFF flag.
 
 void dev_pm_qos_hide_flags(device)
 Drop the request added by dev_pm_qos_expose_flags() from the device's PM QoS list
-of flags and remove sysfs attributes pm_qos_no_power_off and pm_qos_remote_wakeup
-under the device's power directory.
+of flags and remove sysfs attribute pm_qos_no_power_off from the device's power
+directory.
 
 Notification mechanisms:
 The per-device PM QoS framework has a per-device notification tree.
diff --git a/drivers/acpi/device_pm.c b/drivers/acpi/device_pm.c
index fbcc73f7a099..e8c820129797 100644
--- a/drivers/acpi/device_pm.c
+++ b/drivers/acpi/device_pm.c
@@ -581,8 +581,7 @@ static int acpi_dev_pm_get_state(struct device *dev, struct acpi_device *adev,
 		d_min = ret;
 		wakeup = device_may_wakeup(dev) && adev->wakeup.flags.valid
 			&& adev->wakeup.sleep_state >= target_state;
-	} else if (dev_pm_qos_flags(dev, PM_QOS_FLAG_REMOTE_WAKEUP) !=
-			PM_QOS_FLAGS_NONE) {
+	} else {
 		wakeup = adev->wakeup.flags.valid;
 	}
 
@@ -865,8 +864,7 @@ int acpi_dev_runtime_suspend(struct device *dev)
 	if (!adev)
 		return 0;
 
-	remote_wakeup = dev_pm_qos_flags(dev, PM_QOS_FLAG_REMOTE_WAKEUP) >
-				PM_QOS_FLAGS_NONE;
+	remote_wakeup = acpi_device_can_wakeup(adev);
 	if (remote_wakeup) {
 		error = acpi_device_wakeup_enable(adev, ACPI_STATE_S0);
 		if (error)
diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
index e8ca5e2cf1e5..e6414e9998bb 100644
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -346,9 +346,7 @@ static int genpd_power_off(struct generic_pm_domain *genpd, bool one_dev_on,
 	list_for_each_entry(pdd, &genpd->dev_list, list_node) {
 		enum pm_qos_flags_status stat;
 
-		stat = dev_pm_qos_flags(pdd->dev,
-					PM_QOS_FLAG_NO_POWER_OFF
-						| PM_QOS_FLAG_REMOTE_WAKEUP);
+		stat = dev_pm_qos_flags(pdd->dev, PM_QOS_FLAG_NO_POWER_OFF);
 		if (stat > PM_QOS_FLAGS_NONE)
 			return -EBUSY;
 
diff --git a/drivers/base/power/sysfs.c b/drivers/base/power/sysfs.c
index 156ab57bca77..29bf28fef136 100644
--- a/drivers/base/power/sysfs.c
+++ b/drivers/base/power/sysfs.c
@@ -309,33 +309,6 @@ static ssize_t pm_qos_no_power_off_store(struct device *dev,
 static DEVICE_ATTR(pm_qos_no_power_off, 0644,
 		   pm_qos_no_power_off_show, pm_qos_no_power_off_store);
 
-static ssize_t pm_qos_remote_wakeup_show(struct device *dev,
-					 struct device_attribute *attr,
-					 char *buf)
-{
-	return sprintf(buf, "%d\n", !!(dev_pm_qos_requested_flags(dev)
-					& PM_QOS_FLAG_REMOTE_WAKEUP));
-}
-
-static ssize_t pm_qos_remote_wakeup_store(struct device *dev,
-					  struct device_attribute *attr,
-					  const char *buf, size_t n)
-{
-	int ret;
-
-	if (kstrtoint(buf, 0, &ret))
-		return -EINVAL;
-
-	if (ret != 0 && ret != 1)
-		return -EINVAL;
-
-	ret = dev_pm_qos_update_flags(dev, PM_QOS_FLAG_REMOTE_WAKEUP, ret);
-	return ret < 0 ? ret : n;
-}
-
-static DEVICE_ATTR(pm_qos_remote_wakeup, 0644,
-		   pm_qos_remote_wakeup_show, pm_qos_remote_wakeup_store);
-
 #ifdef CONFIG_PM_SLEEP
 static const char _enabled[] = "enabled";
 static const char _disabled[] = "disabled";
@@ -671,7 +644,6 @@ static const struct attribute_group pm_qos_latency_tolerance_attr_group = {
 
 static struct attribute *pm_qos_flags_attrs[] = {
 	&dev_attr_pm_qos_no_power_off.attr,
-	&dev_attr_pm_qos_remote_wakeup.attr,
 	NULL,
 };
 static const struct attribute_group pm_qos_flags_attr_group = {
diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h
index 032b55909145..51f0d7e0b15f 100644
--- a/include/linux/pm_qos.h
+++ b/include/linux/pm_qos.h
@@ -39,7 +39,6 @@ enum pm_qos_flags_status {
 #define PM_QOS_LATENCY_ANY			((s32)(~(__u32)0 >> 1))
 
 #define PM_QOS_FLAG_NO_POWER_OFF	(1 << 0)
-#define PM_QOS_FLAG_REMOTE_WAKEUP	(1 << 1)
 
 struct pm_qos_request {
 	struct plist_node node;
-- 
cgit v1.2.3


From 7e95d9134e3851de57075254737bc434462bb293 Mon Sep 17 00:00:00 2001
From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Date: Thu, 19 Oct 2017 01:18:57 +0200
Subject: PM: docs: Fix formatting typo in devices.rst

There is one word too many under formatting markup in one place
in device.rst, so fix it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 Documentation/driver-api/pm/devices.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'Documentation')

diff --git a/Documentation/driver-api/pm/devices.rst b/Documentation/driver-api/pm/devices.rst
index a0dc2879a152..b8f1e3bdb743 100644
--- a/Documentation/driver-api/pm/devices.rst
+++ b/Documentation/driver-api/pm/devices.rst
@@ -274,7 +274,7 @@ sleep states and the hibernation state ("suspend-to-disk").  Each phase involves
 executing callbacks for every device before the next phase begins.  Not all
 buses or classes support all these callbacks and not all drivers use all the
 callbacks.  The various phases always run after tasks have been frozen and
-before they are unfrozen.  Furthermore, the ``*_noirq phases`` run at a time
+before they are unfrozen.  Furthermore, the ``*_noirq`` phases run at a time
 when IRQ handlers have been disabled (except for those marked with the
 IRQF_NO_SUSPEND flag).
 
-- 
cgit v1.2.3


From 08810a4119aaebf6318f209ec5dd9828e969cba4 Mon Sep 17 00:00:00 2001
From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Date: Wed, 25 Oct 2017 14:12:29 +0200
Subject: PM / core: Add NEVER_SKIP and SMART_PREPARE driver flags

The motivation for this change is to provide a way to work around
a problem with the direct-complete mechanism used for avoiding
system suspend/resume handling for devices in runtime suspend.

The problem is that some middle layer code (the PCI bus type and
the ACPI PM domain in particular) returns positive values from its
system suspend ->prepare callbacks regardless of whether the driver's
->prepare returns a positive value or 0, which effectively prevents
drivers from being able to control the direct-complete feature.
Some drivers need that control, however, and the PCI bus type has
grown its own flag to deal with this issue, but since it is not
limited to PCI, it is better to address it by adding driver flags at
the core level.

To that end, add a driver_flags field to struct dev_pm_info for flags
that can be set by device drivers at the probe time to inform the PM
core and/or bus types, PM domains and so on on the capabilities and/or
preferences of device drivers.  Also add two static inline helpers
for setting that field and testing it against a given set of flags
and make the driver core clear it automatically on driver remove
and probe failures.

Define and document two PM driver flags related to the direct-
complete feature: NEVER_SKIP and SMART_PREPARE that can be used,
respectively, to indicate to the PM core that the direct-complete
mechanism should never be used for the device and to inform the
middle layer code (bus types, PM domains etc) that it can only
request the PM core to use the direct-complete mechanism for
the device (by returning a positive value from its ->prepare
callback) if it also has been requested by the driver.

While at it, make the core check pm_runtime_suspended() when
setting power.direct_complete so that it doesn't need to be
checked by ->prepare callbacks.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
---
 Documentation/driver-api/pm/devices.rst | 14 ++++++++++++++
 Documentation/power/pci.txt             | 19 +++++++++++++++++++
 drivers/acpi/device_pm.c                | 13 +++++++++----
 drivers/base/dd.c                       |  2 ++
 drivers/base/power/main.c               |  4 +++-
 drivers/pci/pci-driver.c                |  5 ++++-
 include/linux/device.h                  | 10 ++++++++++
 include/linux/pm.h                      | 20 ++++++++++++++++++++
 8 files changed, 81 insertions(+), 6 deletions(-)

(limited to 'Documentation')

diff --git a/Documentation/driver-api/pm/devices.rst b/Documentation/driver-api/pm/devices.rst
index 4a18ef9997c0..8add5b302a89 100644
--- a/Documentation/driver-api/pm/devices.rst
+++ b/Documentation/driver-api/pm/devices.rst
@@ -354,6 +354,20 @@ the phases are: ``prepare``, ``suspend``, ``suspend_late``, ``suspend_noirq``.
 	is because all such devices are initially set to runtime-suspended with
 	runtime PM disabled.
 
+	This feature also can be controlled by device drivers by using the
+	``DPM_FLAG_NEVER_SKIP`` and ``DPM_FLAG_SMART_PREPARE`` driver power
+	management flags.  [Typically, they are set at the time the driver is
+	probed against the device in question by passing them to the
+	:c:func:`dev_pm_set_driver_flags` helper function.]  If the first of
+	these flags is set, the PM core will not apply the direct-complete
+	procedure described above to the given device and, consequenty, to any
+	of its ancestors.  The second flag, when set, informs the middle layer
+	code (bus types, device types, PM domains, classes) that it should take
+	the return value of the ``->prepare`` callback provided by the driver
+	into account and it may only return a positive value from its own
+	``->prepare`` callback if the driver's one also has returned a positive
+	value.
+
     2.	The ``->suspend`` methods should quiesce the device to stop it from
 	performing I/O.  They also may save the device registers and put it into
 	the appropriate low-power state, depending on the bus type the device is
diff --git a/Documentation/power/pci.txt b/Documentation/power/pci.txt
index a1b7f7158930..ab4e7d0540c1 100644
--- a/Documentation/power/pci.txt
+++ b/Documentation/power/pci.txt
@@ -961,6 +961,25 @@ dev_pm_ops to indicate that one suspend routine is to be pointed to by the
 .suspend(), .freeze(), and .poweroff() members and one resume routine is to
 be pointed to by the .resume(), .thaw(), and .restore() members.
 
+3.1.19. Driver Flags for Power Management
+
+The PM core allows device drivers to set flags that influence the handling of
+power management for the devices by the core itself and by middle layer code
+including the PCI bus type.  The flags should be set once at the driver probe
+time with the help of the dev_pm_set_driver_flags() function and they should not
+be updated directly afterwards.
+
+The DPM_FLAG_NEVER_SKIP flag prevents the PM core from using the direct-complete
+mechanism allowing device suspend/resume callbacks to be skipped if the device
+is in runtime suspend when the system suspend starts.  That also affects all of
+the ancestors of the device, so this flag should only be used if absolutely
+necessary.
+
+The DPM_FLAG_SMART_PREPARE flag instructs the PCI bus type to only return a
+positive value from pci_pm_prepare() if the ->prepare callback provided by the
+driver of the device returns a positive value.  That allows the driver to opt
+out from using the direct-complete mechanism dynamically.
+
 3.2. Device Runtime Power Management
 ------------------------------------
 In addition to providing device power management callbacks PCI device drivers
diff --git a/drivers/acpi/device_pm.c b/drivers/acpi/device_pm.c
index 17e8eb93a76c..b4dcc6144e6b 100644
--- a/drivers/acpi/device_pm.c
+++ b/drivers/acpi/device_pm.c
@@ -959,11 +959,16 @@ static bool acpi_dev_needs_resume(struct device *dev, struct acpi_device *adev)
 int acpi_subsys_prepare(struct device *dev)
 {
 	struct acpi_device *adev = ACPI_COMPANION(dev);
-	int ret;
 
-	ret = pm_generic_prepare(dev);
-	if (ret < 0)
-		return ret;
+	if (dev->driver && dev->driver->pm && dev->driver->pm->prepare) {
+		int ret = dev->driver->pm->prepare(dev);
+
+		if (ret < 0)
+			return ret;
+
+		if (!ret && dev_pm_test_driver_flags(dev, DPM_FLAG_SMART_PREPARE))
+			return 0;
+	}
 
 	if (!adev || !pm_runtime_suspended(dev))
 		return 0;
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index ad44b40fe284..45575e134696 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -464,6 +464,7 @@ pinctrl_bind_failed:
 	if (dev->pm_domain && dev->pm_domain->dismiss)
 		dev->pm_domain->dismiss(dev);
 	pm_runtime_reinit(dev);
+	dev_pm_set_driver_flags(dev, 0);
 
 	switch (ret) {
 	case -EPROBE_DEFER:
@@ -869,6 +870,7 @@ static void __device_release_driver(struct device *dev, struct device *parent)
 		if (dev->pm_domain && dev->pm_domain->dismiss)
 			dev->pm_domain->dismiss(dev);
 		pm_runtime_reinit(dev);
+		dev_pm_set_driver_flags(dev, 0);
 
 		klist_remove(&dev->p->knode_driver);
 		device_pm_check_callbacks(dev);
diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index 9bbbbb13a9db..c0135cd95ada 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -1700,7 +1700,9 @@ unlock:
 	 * applies to suspend transitions, however.
 	 */
 	spin_lock_irq(&dev->power.lock);
-	dev->power.direct_complete = ret > 0 && state.event == PM_EVENT_SUSPEND;
+	dev->power.direct_complete = state.event == PM_EVENT_SUSPEND &&
+		pm_runtime_suspended(dev) && ret > 0 &&
+		!dev_pm_test_driver_flags(dev, DPM_FLAG_NEVER_SKIP);
 	spin_unlock_irq(&dev->power.lock);
 	return 0;
 }
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 11bd267fc137..68a32703b30a 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -689,8 +689,11 @@ static int pci_pm_prepare(struct device *dev)
 
 	if (drv && drv->pm && drv->pm->prepare) {
 		int error = drv->pm->prepare(dev);
-		if (error)
+		if (error < 0)
 			return error;
+
+		if (!error && dev_pm_test_driver_flags(dev, DPM_FLAG_SMART_PREPARE))
+			return 0;
 	}
 	return pci_dev_keep_suspended(to_pci_dev(dev));
 }
diff --git a/include/linux/device.h b/include/linux/device.h
index c32e6f974d4a..fb9451599aca 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1070,6 +1070,16 @@ static inline void dev_pm_syscore_device(struct device *dev, bool val)
 #endif
 }
 
+static inline void dev_pm_set_driver_flags(struct device *dev, u32 flags)
+{
+	dev->power.driver_flags = flags;
+}
+
+static inline bool dev_pm_test_driver_flags(struct device *dev, u32 flags)
+{
+	return !!(dev->power.driver_flags & flags);
+}
+
 static inline void device_lock(struct device *dev)
 {
 	mutex_lock(&dev->mutex);
diff --git a/include/linux/pm.h b/include/linux/pm.h
index a0ceeccf2846..f10bad831bfa 100644
--- a/include/linux/pm.h
+++ b/include/linux/pm.h
@@ -550,6 +550,25 @@ struct pm_subsys_data {
 #endif
 };
 
+/*
+ * Driver flags to control system suspend/resume behavior.
+ *
+ * These flags can be set by device drivers at the probe time.  They need not be
+ * cleared by the drivers as the driver core will take care of that.
+ *
+ * NEVER_SKIP: Do not skip system suspend/resume callbacks for the device.
+ * SMART_PREPARE: Check the return value of the driver's ->prepare callback.
+ *
+ * Setting SMART_PREPARE instructs bus types and PM domains which may want
+ * system suspend/resume callbacks to be skipped for the device to return 0 from
+ * their ->prepare callbacks if the driver's ->prepare callback returns 0 (in
+ * other words, the system suspend/resume callbacks can only be skipped for the
+ * device if its driver doesn't object against that).  This flag has no effect
+ * if NEVER_SKIP is set.
+ */
+#define DPM_FLAG_NEVER_SKIP	BIT(0)
+#define DPM_FLAG_SMART_PREPARE	BIT(1)
+
 struct dev_pm_info {
 	pm_message_t		power_state;
 	unsigned int		can_wakeup:1;
@@ -561,6 +580,7 @@ struct dev_pm_info {
 	bool			is_late_suspended:1;
 	bool			early_init:1;	/* Owned by the PM core */
 	bool			direct_complete:1;	/* Owned by the PM core */
+	u32			driver_flags;
 	spinlock_t		lock;
 #ifdef CONFIG_PM_SLEEP
 	struct list_head	entry;
-- 
cgit v1.2.3


From 0eab11c9ae3b3cc5dd76f20b81d0247647a6e96f Mon Sep 17 00:00:00 2001
From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Date: Thu, 26 Oct 2017 12:12:08 +0200
Subject: PM / core: Add SMART_SUSPEND driver flag

Define and document a SMART_SUSPEND flag to instruct bus types and PM
domains that the system suspend callbacks provided by the driver can
cope with runtime-suspended devices, so from the driver's perspective
it should be safe to leave devices in runtime suspend during system
suspend.

Setting that flag may also cause middle-layer code (bus types,
PM domains etc.) to skip invocations of the ->suspend_late and
->suspend_noirq callbacks provided by the driver if the device
is in runtime suspend at the beginning of the "late" phase of
the system-wide suspend transition, in which case the driver's
system-wide resume callbacks may be invoked back-to-back with
its ->runtime_suspend callback, so the driver has to be able to
cope with that too.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
---
 Documentation/driver-api/pm/devices.rst | 20 ++++++++++++++++++++
 drivers/base/power/main.c               |  3 +++
 include/linux/pm.h                      |  8 ++++++++
 3 files changed, 31 insertions(+)

(limited to 'Documentation')

diff --git a/Documentation/driver-api/pm/devices.rst b/Documentation/driver-api/pm/devices.rst
index 8add5b302a89..574dadd06dec 100644
--- a/Documentation/driver-api/pm/devices.rst
+++ b/Documentation/driver-api/pm/devices.rst
@@ -766,6 +766,26 @@ the state of devices (possibly except for resuming them from runtime suspend)
 from their ``->prepare`` and ``->suspend`` callbacks (or equivalent) *before*
 invoking device drivers' ``->suspend`` callbacks (or equivalent).
 
+Some bus types and PM domains have a policy to resume all devices from runtime
+suspend upfront in their ``->suspend`` callbacks, but that may not be really
+necessary if the driver of the device can cope with runtime-suspended devices.
+The driver can indicate that by setting ``DPM_FLAG_SMART_SUSPEND`` in
+:c:member:`power.driver_flags` at the probe time, by passing it to the
+:c:func:`dev_pm_set_driver_flags` helper.  That also may cause middle-layer code
+(bus types, PM domains etc.) to skip the ``->suspend_late`` and
+``->suspend_noirq`` callbacks provided by the driver if the device remains in
+runtime suspend at the beginning of the ``suspend_late`` phase of system-wide
+suspend (or in the ``poweroff_late`` phase of hibernation), when runtime PM
+has been disabled for it, under the assumption that its state should not change
+after that point until the system-wide transition is over.  If that happens, the
+driver's system-wide resume callbacks, if present, may still be invoked during
+the subsequent system-wide resume transition and the device's runtime power
+management status may be set to "active" before enabling runtime PM for it,
+so the driver must be prepared to cope with the invocation of its system-wide
+resume callbacks back-to-back with its ``->runtime_suspend`` one (without the
+intervening ``->runtime_resume`` and so on) and the final state of the device
+must reflect the "active" status for runtime PM in that case.
+
 During system-wide resume from a sleep state it's easiest to put devices into
 the full-power state, as explained in :file:`Documentation/power/runtime_pm.txt`.
 Refer to that document for more information regarding this particular issue as
diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index c0135cd95ada..8d9024017645 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -1652,6 +1652,9 @@ static int device_prepare(struct device *dev, pm_message_t state)
 	if (dev->power.syscore)
 		return 0;
 
+	WARN_ON(dev_pm_test_driver_flags(dev, DPM_FLAG_SMART_SUSPEND) &&
+		!pm_runtime_enabled(dev));
+
 	/*
 	 * If a device's parent goes into runtime suspend at the wrong time,
 	 * it won't be possible to resume the device.  To prevent this we
diff --git a/include/linux/pm.h b/include/linux/pm.h
index f10bad831bfa..43b5418e05bb 100644
--- a/include/linux/pm.h
+++ b/include/linux/pm.h
@@ -558,6 +558,7 @@ struct pm_subsys_data {
  *
  * NEVER_SKIP: Do not skip system suspend/resume callbacks for the device.
  * SMART_PREPARE: Check the return value of the driver's ->prepare callback.
+ * SMART_SUSPEND: No need to resume the device from runtime suspend.
  *
  * Setting SMART_PREPARE instructs bus types and PM domains which may want
  * system suspend/resume callbacks to be skipped for the device to return 0 from
@@ -565,9 +566,16 @@ struct pm_subsys_data {
  * other words, the system suspend/resume callbacks can only be skipped for the
  * device if its driver doesn't object against that).  This flag has no effect
  * if NEVER_SKIP is set.
+ *
+ * Setting SMART_SUSPEND instructs bus types and PM domains which may want to
+ * runtime resume the device upfront during system suspend that doing so is not
+ * necessary from the driver's perspective.  It also may cause them to skip
+ * invocations of the ->suspend_late and ->suspend_noirq callbacks provided by
+ * the driver if they decide to leave the device in runtime suspend.
  */
 #define DPM_FLAG_NEVER_SKIP	BIT(0)
 #define DPM_FLAG_SMART_PREPARE	BIT(1)
+#define DPM_FLAG_SMART_SUSPEND	BIT(2)
 
 struct dev_pm_info {
 	pm_message_t		power_state;
-- 
cgit v1.2.3


From c4b65157aeefad29b2351a00a010e8c40ce7fd0e Mon Sep 17 00:00:00 2001
From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Date: Thu, 26 Oct 2017 12:12:22 +0200
Subject: PCI / PM: Take SMART_SUSPEND driver flag into account

Make the PCI bus type take DPM_FLAG_SMART_SUSPEND into account in its
system-wide PM callbacks and make sure that all code that should not
run in parallel with pci_pm_runtime_resume() is executed in the "late"
phases of system suspend, freeze and poweroff transitions.

[Note that the pm_runtime_suspended() check in pci_dev_keep_suspended()
is an optimization, because if is not passed, all of the subsequent
checks may be skipped and some of them are much more overhead in
general.]

Also use the observation that if the device is in runtime suspend
at the beginning of the "late" phase of a system-wide suspend-like
transition, its state cannot change going forward (runtime PM is
disabled for it at that time) until the transition is over and the
subsequent system-wide PM callbacks should be skipped for it (as
they generally assume the device to not be suspended), so add checks
for that in pci_pm_suspend_late/noirq(), pci_pm_freeze_late/noirq()
and pci_pm_poweroff_late/noirq().

Moreover, if pci_pm_resume_noirq() or pci_pm_restore_noirq() is
called during the subsequent system-wide resume transition and if
the device was left in runtime suspend previously, its runtime PM
status needs to be changed to "active" as it is going to be put
into the full-power state, so add checks for that too to these
functions.

In turn, if pci_pm_thaw_noirq() runs after the device has been
left in runtime suspend, the subsequent "thaw" callbacks need
to be skipped for it (as they may not work correctly with a
suspended device), so set the power.direct_complete flag for the
device then to make the PM core skip those callbacks.

In addition to the above add a core helper for checking if
DPM_FLAG_SMART_SUSPEND is set and the device runtime PM status is
"suspended" at the same time, which is done quite often in the new
code (and will be done elsewhere going forward too).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
---
 Documentation/power/pci.txt |  14 ++++++
 drivers/base/power/main.c   |   6 +++
 drivers/pci/pci-driver.c    | 103 ++++++++++++++++++++++++++++++++++++--------
 include/linux/pm.h          |   2 +
 4 files changed, 108 insertions(+), 17 deletions(-)

(limited to 'Documentation')

diff --git a/Documentation/power/pci.txt b/Documentation/power/pci.txt
index ab4e7d0540c1..304162ea377e 100644
--- a/Documentation/power/pci.txt
+++ b/Documentation/power/pci.txt
@@ -980,6 +980,20 @@ positive value from pci_pm_prepare() if the ->prepare callback provided by the
 driver of the device returns a positive value.  That allows the driver to opt
 out from using the direct-complete mechanism dynamically.
 
+The DPM_FLAG_SMART_SUSPEND flag tells the PCI bus type that from the driver's
+perspective the device can be safely left in runtime suspend during system
+suspend.  That causes pci_pm_suspend(), pci_pm_freeze() and pci_pm_poweroff()
+to skip resuming the device from runtime suspend unless there are PCI-specific
+reasons for doing that.  Also, it causes pci_pm_suspend_late/noirq(),
+pci_pm_freeze_late/noirq() and pci_pm_poweroff_late/noirq() to return early
+if the device remains in runtime suspend in the beginning of the "late" phase
+of the system-wide transition under way.  Moreover, if the device is in
+runtime suspend in pci_pm_resume_noirq() or pci_pm_restore_noirq(), its runtime
+power management status will be changed to "active" (as it is going to be put
+into D0 going forward), but if it is in runtime suspend in pci_pm_thaw_noirq(),
+the function will set the power.direct_complete flag for it (to make the PM core
+skip the subsequent "thaw" callbacks for it) and return.
+
 3.2. Device Runtime Power Management
 ------------------------------------
 In addition to providing device power management callbacks PCI device drivers
diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index 8d9024017645..6c6f1c74c24c 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -1861,3 +1861,9 @@ void device_pm_check_callbacks(struct device *dev)
 		 !dev->driver->suspend && !dev->driver->resume));
 	spin_unlock_irq(&dev->power.lock);
 }
+
+bool dev_pm_smart_suspend_and_suspended(struct device *dev)
+{
+	return dev_pm_test_driver_flags(dev, DPM_FLAG_SMART_SUSPEND) &&
+		pm_runtime_status_suspended(dev);
+}
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index c1aeeb10539e..d19bd54d337e 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -734,18 +734,25 @@ static int pci_pm_suspend(struct device *dev)
 
 	if (!pm) {
 		pci_pm_default_suspend(pci_dev);
-		goto Fixup;
+		return 0;
 	}
 
 	/*
-	 * PCI devices suspended at run time need to be resumed at this point,
-	 * because in general it is necessary to reconfigure them for system
-	 * suspend.  Namely, if the device is supposed to wake up the system
-	 * from the sleep state, we may need to reconfigure it for this purpose.
-	 * In turn, if the device is not supposed to wake up the system from the
-	 * sleep state, we'll have to prevent it from signaling wake-up.
+	 * PCI devices suspended at run time may need to be resumed at this
+	 * point, because in general it may be necessary to reconfigure them for
+	 * system suspend.  Namely, if the device is expected to wake up the
+	 * system from the sleep state, it may have to be reconfigured for this
+	 * purpose, or if the device is not expected to wake up the system from
+	 * the sleep state, it should be prevented from signaling wakeup events
+	 * going forward.
+	 *
+	 * Also if the driver of the device does not indicate that its system
+	 * suspend callbacks can cope with runtime-suspended devices, it is
+	 * better to resume the device from runtime suspend here.
 	 */
-	pm_runtime_resume(dev);
+	if (!dev_pm_test_driver_flags(dev, DPM_FLAG_SMART_SUSPEND) ||
+	    !pci_dev_keep_suspended(pci_dev))
+		pm_runtime_resume(dev);
 
 	pci_dev->state_saved = false;
 	if (pm->suspend) {
@@ -765,17 +772,27 @@ static int pci_pm_suspend(struct device *dev)
 		}
 	}
 
- Fixup:
-	pci_fixup_device(pci_fixup_suspend, pci_dev);
-
 	return 0;
 }
 
+static int pci_pm_suspend_late(struct device *dev)
+{
+	if (dev_pm_smart_suspend_and_suspended(dev))
+		return 0;
+
+	pci_fixup_device(pci_fixup_suspend, to_pci_dev(dev));
+
+	return pm_generic_suspend_late(dev);
+}
+
 static int pci_pm_suspend_noirq(struct device *dev)
 {
 	struct pci_dev *pci_dev = to_pci_dev(dev);
 	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
 
+	if (dev_pm_smart_suspend_and_suspended(dev))
+		return 0;
+
 	if (pci_has_legacy_pm_support(pci_dev))
 		return pci_legacy_suspend_late(dev, PMSG_SUSPEND);
 
@@ -834,6 +851,14 @@ static int pci_pm_resume_noirq(struct device *dev)
 	struct device_driver *drv = dev->driver;
 	int error = 0;
 
+	/*
+	 * Devices with DPM_FLAG_SMART_SUSPEND may be left in runtime suspend
+	 * during system suspend, so update their runtime PM status to "active"
+	 * as they are going to be put into D0 shortly.
+	 */
+	if (dev_pm_smart_suspend_and_suspended(dev))
+		pm_runtime_set_active(dev);
+
 	pci_pm_default_resume_early(pci_dev);
 
 	if (pci_has_legacy_pm_support(pci_dev))
@@ -876,6 +901,7 @@ static int pci_pm_resume(struct device *dev)
 #else /* !CONFIG_SUSPEND */
 
 #define pci_pm_suspend		NULL
+#define pci_pm_suspend_late	NULL
 #define pci_pm_suspend_noirq	NULL
 #define pci_pm_resume		NULL
 #define pci_pm_resume_noirq	NULL
@@ -910,7 +936,8 @@ static int pci_pm_freeze(struct device *dev)
 	 * devices should not be touched during freeze/thaw transitions,
 	 * however.
 	 */
-	pm_runtime_resume(dev);
+	if (!dev_pm_test_driver_flags(dev, DPM_FLAG_SMART_SUSPEND))
+		pm_runtime_resume(dev);
 
 	pci_dev->state_saved = false;
 	if (pm->freeze) {
@@ -925,11 +952,22 @@ static int pci_pm_freeze(struct device *dev)
 	return 0;
 }
 
+static int pci_pm_freeze_late(struct device *dev)
+{
+	if (dev_pm_smart_suspend_and_suspended(dev))
+		return 0;
+
+	return pm_generic_freeze_late(dev);;
+}
+
 static int pci_pm_freeze_noirq(struct device *dev)
 {
 	struct pci_dev *pci_dev = to_pci_dev(dev);
 	struct device_driver *drv = dev->driver;
 
+	if (dev_pm_smart_suspend_and_suspended(dev))
+		return 0;
+
 	if (pci_has_legacy_pm_support(pci_dev))
 		return pci_legacy_suspend_late(dev, PMSG_FREEZE);
 
@@ -959,6 +997,16 @@ static int pci_pm_thaw_noirq(struct device *dev)
 	struct device_driver *drv = dev->driver;
 	int error = 0;
 
+	/*
+	 * If the device is in runtime suspend, the code below may not work
+	 * correctly with it, so skip that code and make the PM core skip all of
+	 * the subsequent "thaw" callbacks for the device.
+	 */
+	if (dev_pm_smart_suspend_and_suspended(dev)) {
+		dev->power.direct_complete = true;
+		return 0;
+	}
+
 	if (pcibios_pm_ops.thaw_noirq) {
 		error = pcibios_pm_ops.thaw_noirq(dev);
 		if (error)
@@ -1008,11 +1056,13 @@ static int pci_pm_poweroff(struct device *dev)
 
 	if (!pm) {
 		pci_pm_default_suspend(pci_dev);
-		goto Fixup;
+		return 0;
 	}
 
 	/* The reason to do that is the same as in pci_pm_suspend(). */
-	pm_runtime_resume(dev);
+	if (!dev_pm_test_driver_flags(dev, DPM_FLAG_SMART_SUSPEND) ||
+	    !pci_dev_keep_suspended(pci_dev))
+		pm_runtime_resume(dev);
 
 	pci_dev->state_saved = false;
 	if (pm->poweroff) {
@@ -1024,17 +1074,27 @@ static int pci_pm_poweroff(struct device *dev)
 			return error;
 	}
 
- Fixup:
-	pci_fixup_device(pci_fixup_suspend, pci_dev);
-
 	return 0;
 }
 
+static int pci_pm_poweroff_late(struct device *dev)
+{
+	if (dev_pm_smart_suspend_and_suspended(dev))
+		return 0;
+
+	pci_fixup_device(pci_fixup_suspend, to_pci_dev(dev));
+
+	return pm_generic_poweroff_late(dev);
+}
+
 static int pci_pm_poweroff_noirq(struct device *dev)
 {
 	struct pci_dev *pci_dev = to_pci_dev(dev);
 	struct device_driver *drv = dev->driver;
 
+	if (dev_pm_smart_suspend_and_suspended(dev))
+		return 0;
+
 	if (pci_has_legacy_pm_support(to_pci_dev(dev)))
 		return pci_legacy_suspend_late(dev, PMSG_HIBERNATE);
 
@@ -1076,6 +1136,10 @@ static int pci_pm_restore_noirq(struct device *dev)
 	struct device_driver *drv = dev->driver;
 	int error = 0;
 
+	/* This is analogous to the pci_pm_resume_noirq() case. */
+	if (dev_pm_smart_suspend_and_suspended(dev))
+		pm_runtime_set_active(dev);
+
 	if (pcibios_pm_ops.restore_noirq) {
 		error = pcibios_pm_ops.restore_noirq(dev);
 		if (error)
@@ -1124,10 +1188,12 @@ static int pci_pm_restore(struct device *dev)
 #else /* !CONFIG_HIBERNATE_CALLBACKS */
 
 #define pci_pm_freeze		NULL
+#define pci_pm_freeze_late	NULL
 #define pci_pm_freeze_noirq	NULL
 #define pci_pm_thaw		NULL
 #define pci_pm_thaw_noirq	NULL
 #define pci_pm_poweroff		NULL
+#define pci_pm_poweroff_late	NULL
 #define pci_pm_poweroff_noirq	NULL
 #define pci_pm_restore		NULL
 #define pci_pm_restore_noirq	NULL
@@ -1243,10 +1309,13 @@ static const struct dev_pm_ops pci_dev_pm_ops = {
 	.prepare = pci_pm_prepare,
 	.complete = pci_pm_complete,
 	.suspend = pci_pm_suspend,
+	.suspend_late = pci_pm_suspend_late,
 	.resume = pci_pm_resume,
 	.freeze = pci_pm_freeze,
+	.freeze_late = pci_pm_freeze_late,
 	.thaw = pci_pm_thaw,
 	.poweroff = pci_pm_poweroff,
+	.poweroff_late = pci_pm_poweroff_late,
 	.restore = pci_pm_restore,
 	.suspend_noirq = pci_pm_suspend_noirq,
 	.resume_noirq = pci_pm_resume_noirq,
diff --git a/include/linux/pm.h b/include/linux/pm.h
index 43b5418e05bb..65d39115f06d 100644
--- a/include/linux/pm.h
+++ b/include/linux/pm.h
@@ -765,6 +765,8 @@ extern int pm_generic_poweroff_late(struct device *dev);
 extern int pm_generic_poweroff(struct device *dev);
 extern void pm_generic_complete(struct device *dev);
 
+extern bool dev_pm_smart_suspend_and_suspended(struct device *dev);
+
 #else /* !CONFIG_PM_SLEEP */
 
 #define device_pm_lock() do {} while (0)
-- 
cgit v1.2.3


From 0759e80b84e34a84e7e46e2b1adb528c83d84a47 Mon Sep 17 00:00:00 2001
From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Date: Tue, 7 Nov 2017 11:33:49 +0100
Subject: PM / QoS: Fix device resume latency framework

The special value of 0 for device resume latency PM QoS means
"no restriction", but there are two problems with that.

First, device resume latency PM QoS requests with 0 as the
value are always put in front of requests with positive
values in the priority lists used internally by the PM QoS
framework, causing 0 to be chosen as an effective constraint
value.  However, that 0 is then interpreted as "no restriction"
effectively overriding the other requests with specific
restrictions which is incorrect.

Second, the users of device resume latency PM QoS have no
way to specify that *any* resume latency at all should be
avoided, which is an artificial limitation in general.

To address these issues, modify device resume latency PM QoS to
use S32_MAX as the "no constraint" value and 0 as the "no
latency at all" one and rework its users (the cpuidle menu
governor, the genpd QoS governor and the runtime PM framework)
to follow these changes.

Also add a special "n/a" value to the corresponding user space I/F
to allow user space to indicate that it cannot accept any resume
latencies at all for the given device.

Fixes: 85dc0b8a4019 (PM / QoS: Make it possible to expose PM QoS latency constraints)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=197323
Reported-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Tero Kristo <t-kristo@ti.com>
Reviewed-by: Ramesh Thomas <ramesh.thomas@intel.com>
---
 Documentation/ABI/testing/sysfs-devices-power |  4 ++-
 drivers/base/cpu.c                            |  3 +-
 drivers/base/power/domain.c                   |  2 +-
 drivers/base/power/domain_governor.c          | 40 +++++++++++----------------
 drivers/base/power/qos.c                      |  5 +++-
 drivers/base/power/runtime.c                  |  2 +-
 drivers/base/power/sysfs.c                    | 25 ++++++++++++++---
 drivers/cpuidle/governors/menu.c              |  4 +--
 include/linux/pm_qos.h                        | 26 +++++++++++------
 9 files changed, 68 insertions(+), 43 deletions(-)

(limited to 'Documentation')

diff --git a/Documentation/ABI/testing/sysfs-devices-power b/Documentation/ABI/testing/sysfs-devices-power
index f4b24c327665..80a00f7b6667 100644
--- a/Documentation/ABI/testing/sysfs-devices-power
+++ b/Documentation/ABI/testing/sysfs-devices-power
@@ -211,7 +211,9 @@ Description:
 		device, after it has been suspended at run time, from a resume
 		request to the moment the device will be ready to process I/O,
 		in microseconds.  If it is equal to 0, however, this means that
-		the PM QoS resume latency may be arbitrary.
+		the PM QoS resume latency may be arbitrary and the special value
+		"n/a" means that user space cannot accept any resume latency at
+		all for the given device.
 
 		Not all drivers support this attribute.  If it isn't supported,
 		it is not present.
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 321cd7b4d817..227bac5f1191 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -377,7 +377,8 @@ int register_cpu(struct cpu *cpu, int num)
 
 	per_cpu(cpu_sys_devices, num) = &cpu->dev;
 	register_cpu_under_node(num, cpu_to_node(num));
-	dev_pm_qos_expose_latency_limit(&cpu->dev, 0);
+	dev_pm_qos_expose_latency_limit(&cpu->dev,
+					PM_QOS_RESUME_LATENCY_NO_CONSTRAINT);
 
 	return 0;
 }
diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
index 679c79545e42..24e39ce27bd8 100644
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -1326,7 +1326,7 @@ static struct generic_pm_domain_data *genpd_alloc_dev_data(struct device *dev,
 
 	gpd_data->base.dev = dev;
 	gpd_data->td.constraint_changed = true;
-	gpd_data->td.effective_constraint_ns = 0;
+	gpd_data->td.effective_constraint_ns = PM_QOS_RESUME_LATENCY_NO_CONSTRAINT_NS;
 	gpd_data->nb.notifier_call = genpd_dev_pm_qos_notifier;
 
 	spin_lock_irq(&dev->power.lock);
diff --git a/drivers/base/power/domain_governor.c b/drivers/base/power/domain_governor.c
index e4cca8adab32..99896fbf18e4 100644
--- a/drivers/base/power/domain_governor.c
+++ b/drivers/base/power/domain_governor.c
@@ -33,15 +33,10 @@ static int dev_update_qos_constraint(struct device *dev, void *data)
 		 * known at this point anyway).
 		 */
 		constraint_ns = dev_pm_qos_read_value(dev);
-		if (constraint_ns > 0)
-			constraint_ns *= NSEC_PER_USEC;
+		constraint_ns *= NSEC_PER_USEC;
 	}
 
-	/* 0 means "no constraint" */
-	if (constraint_ns == 0)
-		return 0;
-
-	if (constraint_ns < *constraint_ns_p || *constraint_ns_p == 0)
+	if (constraint_ns < *constraint_ns_p)
 		*constraint_ns_p = constraint_ns;
 
 	return 0;
@@ -69,12 +64,12 @@ static bool default_suspend_ok(struct device *dev)
 	}
 	td->constraint_changed = false;
 	td->cached_suspend_ok = false;
-	td->effective_constraint_ns = -1;
+	td->effective_constraint_ns = 0;
 	constraint_ns = __dev_pm_qos_read_value(dev);
 
 	spin_unlock_irqrestore(&dev->power.lock, flags);
 
-	if (constraint_ns < 0)
+	if (constraint_ns == 0)
 		return false;
 
 	constraint_ns *= NSEC_PER_USEC;
@@ -87,25 +82,25 @@ static bool default_suspend_ok(struct device *dev)
 		device_for_each_child(dev, &constraint_ns,
 				      dev_update_qos_constraint);
 
-	if (constraint_ns == 0) {
+	if (constraint_ns == PM_QOS_RESUME_LATENCY_NO_CONSTRAINT_NS) {
 		/* "No restriction", so the device is allowed to suspend. */
-		td->effective_constraint_ns = 0;
+		td->effective_constraint_ns = PM_QOS_RESUME_LATENCY_NO_CONSTRAINT_NS;
 		td->cached_suspend_ok = true;
-	} else if (constraint_ns < 0) {
+	} else if (constraint_ns == 0) {
 		/*
 		 * This triggers if one of the children that don't belong to a
-		 * domain has a negative PM QoS constraint and it's better not
-		 * to suspend then.  effective_constraint_ns is negative already
-		 * and cached_suspend_ok is false, so bail out.
+		 * domain has a zero PM QoS constraint and it's better not to
+		 * suspend then.  effective_constraint_ns is zero already and
+		 * cached_suspend_ok is false, so bail out.
 		 */
 		return false;
 	} else {
 		constraint_ns -= td->suspend_latency_ns +
 				td->resume_latency_ns;
 		/*
-		 * effective_constraint_ns is negative already and
-		 * cached_suspend_ok is false, so if the computed value is not
-		 * positive, return right away.
+		 * effective_constraint_ns is zero already and cached_suspend_ok
+		 * is false, so if the computed value is not positive, return
+		 * right away.
 		 */
 		if (constraint_ns <= 0)
 			return false;
@@ -174,13 +169,10 @@ static bool __default_power_down_ok(struct dev_pm_domain *pd,
 		td = &to_gpd_data(pdd)->td;
 		constraint_ns = td->effective_constraint_ns;
 		/*
-		 * Negative values mean "no suspend at all" and this runs only
-		 * when all devices in the domain are suspended, so it must be
-		 * 0 at least.
-		 *
-		 * 0 means "no constraint"
+		 * Zero means "no suspend at all" and this runs only when all
+		 * devices in the domain are suspended, so it must be positive.
 		 */
-		if (constraint_ns == 0)
+		if (constraint_ns == PM_QOS_RESUME_LATENCY_NO_CONSTRAINT_NS)
 			continue;
 
 		if (constraint_ns <= off_on_time_ns)
diff --git a/drivers/base/power/qos.c b/drivers/base/power/qos.c
index 277d43a83f53..3382542b39b7 100644
--- a/drivers/base/power/qos.c
+++ b/drivers/base/power/qos.c
@@ -139,6 +139,9 @@ static int apply_constraint(struct dev_pm_qos_request *req,
 
 	switch(req->type) {
 	case DEV_PM_QOS_RESUME_LATENCY:
+		if (WARN_ON(action != PM_QOS_REMOVE_REQ && value < 0))
+			value = 0;
+
 		ret = pm_qos_update_target(&qos->resume_latency,
 					   &req->data.pnode, action, value);
 		break;
@@ -189,7 +192,7 @@ static int dev_pm_qos_constraints_allocate(struct device *dev)
 	plist_head_init(&c->list);
 	c->target_value = PM_QOS_RESUME_LATENCY_DEFAULT_VALUE;
 	c->default_value = PM_QOS_RESUME_LATENCY_DEFAULT_VALUE;
-	c->no_constraint_value = PM_QOS_RESUME_LATENCY_DEFAULT_VALUE;
+	c->no_constraint_value = PM_QOS_RESUME_LATENCY_NO_CONSTRAINT;
 	c->type = PM_QOS_MIN;
 	c->notifiers = n;
 
diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index 7bcf80fa9ada..13e015905543 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -253,7 +253,7 @@ static int rpm_check_suspend_allowed(struct device *dev)
 	    || (dev->power.request_pending
 			&& dev->power.request == RPM_REQ_RESUME))
 		retval = -EAGAIN;
-	else if (__dev_pm_qos_read_value(dev) < 0)
+	else if (__dev_pm_qos_read_value(dev) == 0)
 		retval = -EPERM;
 	else if (dev->power.runtime_status == RPM_SUSPENDED)
 		retval = 1;
diff --git a/drivers/base/power/sysfs.c b/drivers/base/power/sysfs.c
index 29bf28fef136..e153e28b1857 100644
--- a/drivers/base/power/sysfs.c
+++ b/drivers/base/power/sysfs.c
@@ -218,7 +218,14 @@ static ssize_t pm_qos_resume_latency_show(struct device *dev,
 					  struct device_attribute *attr,
 					  char *buf)
 {
-	return sprintf(buf, "%d\n", dev_pm_qos_requested_resume_latency(dev));
+	s32 value = dev_pm_qos_requested_resume_latency(dev);
+
+	if (value == 0)
+		return sprintf(buf, "n/a\n");
+	else if (value == PM_QOS_RESUME_LATENCY_NO_CONSTRAINT)
+		value = 0;
+
+	return sprintf(buf, "%d\n", value);
 }
 
 static ssize_t pm_qos_resume_latency_store(struct device *dev,
@@ -228,11 +235,21 @@ static ssize_t pm_qos_resume_latency_store(struct device *dev,
 	s32 value;
 	int ret;
 
-	if (kstrtos32(buf, 0, &value))
-		return -EINVAL;
+	if (!kstrtos32(buf, 0, &value)) {
+		/*
+		 * Prevent users from writing negative or "no constraint" values
+		 * directly.
+		 */
+		if (value < 0 || value == PM_QOS_RESUME_LATENCY_NO_CONSTRAINT)
+			return -EINVAL;
 
-	if (value < 0)
+		if (value == 0)
+			value = PM_QOS_RESUME_LATENCY_NO_CONSTRAINT;
+	} else if (!strcmp(buf, "n/a") || !strcmp(buf, "n/a\n")) {
+		value = 0;
+	} else {
 		return -EINVAL;
+	}
 
 	ret = dev_pm_qos_update_request(dev->power.qos->resume_latency_req,
 					value);
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 48eaf2879228..aa390404e85f 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -298,8 +298,8 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 		data->needs_update = 0;
 	}
 
-	/* resume_latency is 0 means no restriction */
-	if (resume_latency && resume_latency < latency_req)
+	if (resume_latency < latency_req &&
+	    resume_latency != PM_QOS_RESUME_LATENCY_NO_CONSTRAINT)
 		latency_req = resume_latency;
 
 	/* Special case when user has set very strict latency requirement */
diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h
index 51f0d7e0b15f..2a3b36da61b1 100644
--- a/include/linux/pm_qos.h
+++ b/include/linux/pm_qos.h
@@ -27,16 +27,19 @@ enum pm_qos_flags_status {
 	PM_QOS_FLAGS_ALL,
 };
 
-#define PM_QOS_DEFAULT_VALUE -1
+#define PM_QOS_DEFAULT_VALUE	(-1)
+#define PM_QOS_LATENCY_ANY	S32_MAX
+#define PM_QOS_LATENCY_ANY_NS	((s64)PM_QOS_LATENCY_ANY * NSEC_PER_USEC)
 
 #define PM_QOS_CPU_DMA_LAT_DEFAULT_VALUE	(2000 * USEC_PER_SEC)
 #define PM_QOS_NETWORK_LAT_DEFAULT_VALUE	(2000 * USEC_PER_SEC)
 #define PM_QOS_NETWORK_THROUGHPUT_DEFAULT_VALUE	0
 #define PM_QOS_MEMORY_BANDWIDTH_DEFAULT_VALUE	0
-#define PM_QOS_RESUME_LATENCY_DEFAULT_VALUE	0
+#define PM_QOS_RESUME_LATENCY_DEFAULT_VALUE	PM_QOS_LATENCY_ANY
+#define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT	PM_QOS_LATENCY_ANY
+#define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT_NS	PM_QOS_LATENCY_ANY_NS
 #define PM_QOS_LATENCY_TOLERANCE_DEFAULT_VALUE	0
 #define PM_QOS_LATENCY_TOLERANCE_NO_CONSTRAINT	(-1)
-#define PM_QOS_LATENCY_ANY			((s32)(~(__u32)0 >> 1))
 
 #define PM_QOS_FLAG_NO_POWER_OFF	(1 << 0)
 
@@ -173,7 +176,8 @@ static inline s32 dev_pm_qos_requested_flags(struct device *dev)
 static inline s32 dev_pm_qos_raw_read_value(struct device *dev)
 {
 	return IS_ERR_OR_NULL(dev->power.qos) ?
-		0 : pm_qos_read_value(&dev->power.qos->resume_latency);
+		PM_QOS_RESUME_LATENCY_NO_CONSTRAINT :
+		pm_qos_read_value(&dev->power.qos->resume_latency);
 }
 #else
 static inline enum pm_qos_flags_status __dev_pm_qos_flags(struct device *dev,
@@ -183,9 +187,9 @@ static inline enum pm_qos_flags_status dev_pm_qos_flags(struct device *dev,
 							s32 mask)
 			{ return PM_QOS_FLAGS_UNDEFINED; }
 static inline s32 __dev_pm_qos_read_value(struct device *dev)
-			{ return 0; }
+			{ return PM_QOS_RESUME_LATENCY_NO_CONSTRAINT; }
 static inline s32 dev_pm_qos_read_value(struct device *dev)
-			{ return 0; }
+			{ return PM_QOS_RESUME_LATENCY_NO_CONSTRAINT; }
 static inline int dev_pm_qos_add_request(struct device *dev,
 					 struct dev_pm_qos_request *req,
 					 enum dev_pm_qos_req_type type,
@@ -231,9 +235,15 @@ static inline int dev_pm_qos_expose_latency_tolerance(struct device *dev)
 			{ return 0; }
 static inline void dev_pm_qos_hide_latency_tolerance(struct device *dev) {}
 
-static inline s32 dev_pm_qos_requested_resume_latency(struct device *dev) { return 0; }
+static inline s32 dev_pm_qos_requested_resume_latency(struct device *dev)
+{
+	return PM_QOS_RESUME_LATENCY_NO_CONSTRAINT;
+}
 static inline s32 dev_pm_qos_requested_flags(struct device *dev) { return 0; }
-static inline s32 dev_pm_qos_raw_read_value(struct device *dev) { return 0; }
+static inline s32 dev_pm_qos_raw_read_value(struct device *dev)
+{
+	return PM_QOS_RESUME_LATENCY_NO_CONSTRAINT;
+}
 #endif
 
 #endif
-- 
cgit v1.2.3


From f7bc9b209e27c0b617378400136cc663a6314d0c Mon Sep 17 00:00:00 2001
From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
Date: Tue, 7 Nov 2017 13:39:29 +0530
Subject: cpufreq: stats: Handle the case when trans_table goes beyond
 PAGE_SIZE

On platforms with large number of Pstates, the transition table, which
is a NxN matrix, can overflow beyond the PAGE_SIZE boundary.

This can be seen on POWER9 which has 100+ Pstates.

As a result, each time the trans_table is read for any of the CPUs, we
will get the following error.

---------------------------------------------------
fill_read_buffer: show+0x0/0xa0 returned bad count
---------------------------------------------------

This patch ensures that in case of an overflow, we print a warning
once in the dmesg and return FILE TOO LARGE error for this and all
subsequent accesses of trans_table.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 Documentation/cpu-freq/cpufreq-stats.txt | 3 +++
 drivers/cpufreq/cpufreq_stats.c          | 7 +++++--
 2 files changed, 8 insertions(+), 2 deletions(-)

(limited to 'Documentation')

diff --git a/Documentation/cpu-freq/cpufreq-stats.txt b/Documentation/cpu-freq/cpufreq-stats.txt
index 2bbe207354ed..a873855c811d 100644
--- a/Documentation/cpu-freq/cpufreq-stats.txt
+++ b/Documentation/cpu-freq/cpufreq-stats.txt
@@ -90,6 +90,9 @@ Freq_i to Freq_j. Freq_i is in descending order with increasing rows and
 Freq_j is in descending order with increasing columns. The output here also 
 contains the actual freq values for each row and column for better readability.
 
+If the transition table is bigger than PAGE_SIZE, reading this will
+return an -EFBIG error.
+
 --------------------------------------------------------------------------------
 <mysystem>:/sys/devices/system/cpu/cpu0/cpufreq/stats # cat trans_table
    From  :    To
diff --git a/drivers/cpufreq/cpufreq_stats.c b/drivers/cpufreq/cpufreq_stats.c
index e75880eb037d..1e55b5790853 100644
--- a/drivers/cpufreq/cpufreq_stats.c
+++ b/drivers/cpufreq/cpufreq_stats.c
@@ -118,8 +118,11 @@ static ssize_t show_trans_table(struct cpufreq_policy *policy, char *buf)
 			break;
 		len += snprintf(buf + len, PAGE_SIZE - len, "\n");
 	}
-	if (len >= PAGE_SIZE)
-		return PAGE_SIZE;
+
+	if (len >= PAGE_SIZE) {
+		pr_warn_once("cpufreq transition table exceeds PAGE_SIZE. Disabling\n");
+		return -EFBIG;
+	}
 	return len;
 }
 cpufreq_freq_attr_ro(trans_table);
-- 
cgit v1.2.3